# Audiobooks business casestudy - Deep Neural Network 

## Author/Data-Scientist: Leon Hamnett
## [LinkedIn](https://www.linkedin.com/in/leon-hamnett/) 

### Contents:

1. [Introduction](#introduction)
2. [Preprocessing](#prepro)
3. [Building and training the model](#model)
4. [Testing and Conclusion](#test)

### Introduction:  <a name="introduction"></a>

We have obtained data from an Audiobook App. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

We have obtained .csv file summarizing the data. There are several variables: Customer ID, Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as this variables adds nothing to the analysis)

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

We will create a machine learning algorithm which is able to predict if a customer will buy again so that targets advertising can be directed towards these users.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

### Import libraries:

In [1]:
import numpy as np
import pandas as pd
import random
# We will use the sklearn preprocessing library, as it will be easier to standardize the data.
from sklearn import preprocessing
import tensorflow as tf


## Preprocess the data <a name="prepro"></a>

Since we are dealing with real life data, we will need to preprocess it a bit.  Note that in the .csv file obtained the header rows have already been removed as these contained the names of the categories. We simply want the data for this model.

### Extract the data from the csv

In [2]:
# Load the data
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')

# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is our targets)

unscaled_inputs_all = raw_csv_data[:,1:-1]

# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [3]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# We want to create a "balanced" dataset, so we will have to remove some input/target pairs.
# Declare a variable that will do that:
indices_to_remove = []

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, mark entries where the target is 0.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked "to remove" in the loop above.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Balance dataset alternate method

After using the above method to balance the datasets, it was realised that as this method iterates over the entire dataset, it may not be suitable for large datasets as this could take a large amount of time and computer resources. 

For completeness, another method shown below was devised which does not need to iterate over the entire dataset and can work with the entire dataset at once. This alternate method would be much more efficient when using datasets with a large number of rows.

In [4]:

#convert to pandas df
targets_all_pd = pd.DataFrame(data=targets_all)
inputs_all_pd = pd.DataFrame(data=unscaled_inputs_all)
#get stats
total = targets_all_pd.shape[0]
print('total: ',total)
print(targets_all_pd.value_counts())
num_ones = targets_all_pd.value_counts().iloc[1]
num_zeroes = targets_all_pd.value_counts().iloc[0]

#we see there are 2237 - targets with ones and 11847 targets with zeroes. 
#we will downsample the zeroes so that the inputs and targets are balanced

#get list of indices with zeros in targets
zero_indices = targets_all_pd.index[targets_all_pd[0] == 0].tolist()
#select random zeroes to drop leaving the same number of ones and zeroes, making targets balanced
zero_indices_to_drop = random.sample(zero_indices,num_zeroes-num_ones)
#create balanced datasets by dropping the same rows for inputs and targets
targets_balanced = targets_all_pd.drop(index = zero_indices_to_drop) 
inputs_balanced = inputs_all_pd.drop(index = zero_indices_to_drop)

#check all indexes match and datasets are balanced
print('number of rows sharing the same index: ',sum(targets_balanced.index == inputs_balanced.index)) # we see all dfs have the same index (2237 * 2)
print('mean targets for balanced targets :',targets_balanced.mean()) # we also see mean is 0.5 for targets so it is balanced

#turn back into nparray so it can correctly be fed into ML algorithm later
targets_balanced_np = targets_balanced.to_numpy()
inputs_balanced_np = inputs_balanced.to_numpy()

total:  14084
0.0    11847
1.0     2237
dtype: int64
number of rows sharing the same index:  4474
mean targets for balanced targets : 0    0.5
dtype: float64


### Standardise the inputs

Now we will standardise the inputs for a more accurate machine learning model so inputs with a large range do not skew the learning process.

In [5]:
# That's the only place we use sklearn functionality. We will take advantage of its preprocessing capabilities
# It's a simple line of code, which standardizes the inputs, as we explained in one of the lectures.
# At the end of the business case, you can try to run the algorithm WITHOUT this line of code. 
# The result will be interesting.
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

Now we shuffle the data so that it is as random as possible, removing such effects like time-variability or changes in the business process and making sure the algorithm can learn on random data for the most accurate results.

In [6]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

Now we split the data into the different sets, using a split of train: 80% , validation: 10% and testing: 10%. This ensures we do not end up overfitting the model and we can check the model at the end using data the algorithm has never seen before to get a good idea of the accuracy of the model.

In [7]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Check if they are balanced, too. Note that each time you rerun this code, 
# you will get different values, as each time they are shuffled randomly.
# Normally you preprocess ONCE, so you need not rerun this code once it is done.
# If you rerun this whole sheet, the npzs will be overwritten with your newly preprocessed data.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1808.0 3579 0.5051690416317407
227.0 447 0.5078299776286354
202.0 448 0.45089285714285715


We see the sets have been split correctly and the proportion of zeroes to ones is still within an acceptable range for all the sets and so we can consider them still balanced.

### Save the three datasets in *.npz

In [8]:
# Save the three datasets in *.npz. so they can easily be fed into the model.

np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

## Create the machine learning algorithm <a name="model"></a>

In this section we will create the machine learning algorithm, defining the data, the model, the loss function as well as the optimising algorithm as well as any hyper parameters that might be needed.

### Data

We will load the data from the npz files.

In [9]:
# let's create a temporary variable npz, where we will store each of the three Audiobooks datasets
npz = np.load('Audiobooks_data_train.npz')

# we extract the inputs using the keyword under which we saved them
# to ensure that they are all floats, let's also take care of that
train_inputs = npz['inputs'].astype(np.float)
# targets must be int because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)
train_targets = npz['targets'].astype(np.int)

# we load the validation data in the temporary variable
npz = np.load('Audiobooks_data_validation.npz')
# we can load the inputs and the targets in the same line
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

# we load the test data in the temporary variable
npz = np.load('Audiobooks_data_test.npz')
# we create 2 variables that will contain the test inputs and the test targets
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

### Model

We will outline the model,choose optimizers and loss methods,and then set early stopping. We will then move onto feeding the data into the model and completing the training of the model.

In [10]:
# Set the input and output sizes
input_size = 10 # 10 input variables
output_size = 2 # output is logistic (0 or 1)
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 100
    
# define how the model will look like
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='tanh'), # 3rd hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 4th hidden layer 
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])


### Choose the optimizer and the loss function

# we define the optimizer we'd like to use, 
# the loss function, 
# and the metrics we are interested in obtaining at each iteration

#call the adam optimizer explicitly so we can change the hyper parameters
optim_model = tf.keras.optimizers.Adam(
    learning_rate=0.00005, beta_1=0.5, beta_2=0.999, epsilon=0.001, amsgrad=True,
    name='Adam',
)

model.compile(optimizer=optim_model, loss='sparse_categorical_crossentropy', metrics=['accuracy'],)

### Training
# That's where we train the model we have built.

# set the batch size
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases (validation loss can increase twice before algorithm is halted)
early_stopping = tf.keras.callbacks.EarlyStopping(patience=4)


# fit the model
# note that this time the train, validation and test data are not iterable
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2 # making sure we get enough information about the training process
          )  

Epoch 1/100
36/36 - 1s - loss: 0.6956 - accuracy: 0.4962 - val_loss: 0.6878 - val_accuracy: 0.5369
Epoch 2/100
36/36 - 0s - loss: 0.6773 - accuracy: 0.5569 - val_loss: 0.6694 - val_accuracy: 0.5861
Epoch 3/100
36/36 - 0s - loss: 0.6578 - accuracy: 0.6108 - val_loss: 0.6509 - val_accuracy: 0.6219
Epoch 4/100
36/36 - 0s - loss: 0.6379 - accuracy: 0.6398 - val_loss: 0.6322 - val_accuracy: 0.6353
Epoch 5/100
36/36 - 0s - loss: 0.6179 - accuracy: 0.6642 - val_loss: 0.6141 - val_accuracy: 0.6667
Epoch 6/100
36/36 - 0s - loss: 0.5983 - accuracy: 0.6963 - val_loss: 0.5968 - val_accuracy: 0.6935
Epoch 7/100
36/36 - 0s - loss: 0.5791 - accuracy: 0.7128 - val_loss: 0.5805 - val_accuracy: 0.7002
Epoch 8/100
36/36 - 0s - loss: 0.5608 - accuracy: 0.7265 - val_loss: 0.5648 - val_accuracy: 0.7092
Epoch 9/100
36/36 - 0s - loss: 0.5431 - accuracy: 0.7318 - val_loss: 0.5499 - val_accuracy: 0.7136
Epoch 10/100
36/36 - 0s - loss: 0.5263 - accuracy: 0.7418 - val_loss: 0.5369 - val_accuracy: 0.7159
Epoch 11/

<tensorflow.python.keras.callbacks.History at 0x7fa766875af0>

## Test the model <a name="test"></a>

After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset. 

The test is the absolute final instance. We should not test before we are completely done with adjusting your model.

If we adjust the model after testing, we will start overfitting the test dataset, which will defeat its purpose.

In [11]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [12]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.34. Test accuracy: 83.26%


### Conclusions



We can see that the algorithm above obtained an accuracy of around 83%. I think this is a suitable level of accuracy for the purpose of this model as we are trying to target users with selective advertising and so in theory we can generate more income from 83 % of the users who we send the advertising out to. 

Many variations of the hyper-parameters were tried to see if the accuracy could be improved, but all the different models only seem able to generate an accuracy of around 80 %. This suggests that if further accuracy is required, more data would need to be aquired and fed into the model so it can learn more effectively which inputs lead to a successful conversion. 