# Audiobooks business case

### Problem

You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

Good luck!

## Create the Machine Learning Algorithm

## Import the relevant libraries

In [1]:
import numpy as np
import tensorflow as tf






## Data

In [2]:
# Load data
npz = np.load('Audiobooks_data_train.npz')

# np.ndarray.astype() creates a copy of the array, cast a specific type
train_inputs = npz['inputs'].astype(float)
# Extract train target
train_targets = npz['targets'].astype(int)

# Validation & Test
npz = np.load('Audiobooks_data_validation.npz')
# Extract validitaion inputs and targets
validation_inputs, validation_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

# Test data
npz = np.load('Audiobooks_data_test.npz')
# Extract validitaion inputs and targets
test_inputs, test_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

## Model
Outline, optimizer, loss, early and training

In [3]:
# Input and Output sizes
input_size = 10
output_size = 2

# Hidden layer size for both hidden layers
hidden_layer_size = 50


model = tf.keras.Sequential([
    
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

# Optimizer
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Batch size
batch_size = 100

# Training Epochs
max_epochs = 100

# Early stopping mechanism
# tf.keras.callbacks.EarlyStopping(patience) configure the early stopping mechanism o the algorithm. 'patience' let us decide how many cosecutive increase we can tolerate
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2) # Set patience=2, to be a bit tolerant against random validation loss increases

## Fit the Model
model.fit(train_inputs, # train inputs (input first)
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2 # making sure we get enough information about the training process
          )  



Epoch 1/100


36/36 - 1s - loss: 0.6174 - accuracy: 0.6597 - val_loss: 0.5240 - val_accuracy: 0.7405 - 1s/epoch - 28ms/step
Epoch 2/100
36/36 - 0s - loss: 0.4745 - accuracy: 0.7589 - val_loss: 0.4374 - val_accuracy: 0.7875 - 78ms/epoch - 2ms/step
Epoch 3/100
36/36 - 0s - loss: 0.4136 - accuracy: 0.7801 - val_loss: 0.3984 - val_accuracy: 0.8031 - 83ms/epoch - 2ms/step
Epoch 4/100
36/36 - 0s - loss: 0.3845 - accuracy: 0.7896 - val_loss: 0.3820 - val_accuracy: 0.7785 - 69ms/epoch - 2ms/step
Epoch 5/100
36/36 - 0s - loss: 0.3674 - accuracy: 0.8033 - val_loss: 0.3732 - val_accuracy: 0.7875 - 73ms/epoch - 2ms/step
Epoch 6/100
36/36 - 0s - loss: 0.3551 - accuracy: 0.8097 - val_loss: 0.3635 - val_accuracy: 0.8121 - 72ms/epoch - 2ms/step
Epoch 7/100
36/36 - 0s - loss: 0.3486 - accuracy: 0.8075 - val_loss: 0.3569 - val_accuracy: 0.8031 - 71ms/epoch - 2ms/step
Epoch 8/100
36/36 - 0s - loss: 0.3429 - accuracy: 0.8161 - val_loss: 0.3562 - val_accuracy: 0.8098 - 71ms/epoch - 2ms/step
Epoch 9/100
3

<keras.src.callbacks.History at 0x29bcb600e80>

## Test Model

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.

The test is the absolute final instance. You should not test before you are completely done with adjusting your model.

If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

In [4]:
# model.evaluate return loss value and metrics values for the model in 'test model'
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [5]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.36. Test accuracy: 81.92%
