# Deep Learning Case Study: Audiobooks 
#### by Sooyeon Won 

### Keywords 
- Deep Learning 
- TensorFlow2 - Keras
- Unbalanced Data
- Classification Problem


### Contents 

<ul>    
<li><a href="#Introduction">1.  Introduction</a></li>
<li><a href="#Preprocessing">2.  Data Preprocessing</a></li>
<li><a href="#Analysis">3.  Data Analysis</a></li>
<li><a href="#Test">4.  Test the Model</a></li>
</ul>





<a id='Introduction'></a>
### 1. Introduction: Business Problem

 For this analysis, data from an Audiobook App are provided. Note that it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once. Based on the available data, I created a machine learning (ML) algorithm that can predict whether customers will re-purchase products from the Audiobook company. <br><br>
From the company's perspective, if a customer has a low probability of re-purchasing, the company can reduce costs on advertising to the customer, so that the company can focus its efforts only on the customers who are likely to convert again. In addition, the ML model can identify the most important metrics for a customer who come back again. Identifying new customers creates value and growth opportunities. <br><br>
The given dataset contains following information. 
>- **Customer ID**
>- **Book length overall**: Sum of the minute length of all purchases
>- **Book length avg**: Average length in minutes of all purchases
>- **Price paid_overall**: Total price of all purchases  
>- **Price Paid avg**: Average paid price of all purchases
>- **Review**: A binary variable whether the customer left a review
>- **Review out of 10**: A review score that the customer left  
>- **Total minutes listened**: Total minutes of time is a measure of engagement.
>- **Completion**: The percentage of completion of the audiobook. "Total minutes listened" is divided by "Book length overall". It is ranged from 0 to 1.
>- **Support requests**: Total number of support requests such as forgotten password, assistance for using the App, and so on.
>- **Last visited - purchase date**: It is also a measure of engagement. The bigger the difference, the gibber the engagement (in days).

In this analysis, I used all the features as the inputs, except for **Customer ID**, since it is completely arbitrary. 
The targets are a binary variable, indicating 0 or 1. I took a period of 2 years in the inputs, and the next 6 months as targets. 

So, basically, I predict whether a customer will convert in the next 6 months. based on the last 2 years of purchasing patterns and individual engagement. 6 months could be debatable, but it is also reasonable. If customers do not convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobooks way of digesting information.  

My solution approach is to create a a machine learning algorithm, which is able to predict whether a customer will buy again. This is a classic classification problem with two classes: will not purchase and will purchase, represented by 0s and 1s, respectively. 



<a id='Preprocessing'></a>
## 2. Data Preprocessing
&nbsp; 2. 1. Extract the data from the csv <br>
&nbsp; 2. 2. Balance the dataset <br>
&nbsp; 2. 3. Standardize the inputs <br>
&nbsp; 2. 4. Shuffle the data <br>
&nbsp; 2. 5. Split the dataset into train, validation, and test <br>
&nbsp; 2. 6. Save the three datasets in a tensor friendly format

### 2. 1. Extract the data from the csv

In [1]:
# Import the relevant libraries 
import numpy as np
from sklearn import preprocessing # for easier data standardization 

In [2]:
# Load the data
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')

# Specfiy inputs and target
# Inputs - expect for the the arbitrary customer IDs in the first column, and the target
unscaled_inputs_all = raw_csv_data[:,1:-1]
# Target - the last column
targets_all = raw_csv_data[:,-1]

### 2. 2. Balance the dataset

In [3]:
# Target distribution 
num_one_targets = int(np.sum(targets_all))
print('The number of target 1s: ', num_one_targets)
print('The number of target 0s: ', len(targets_all) - num_one_targets)
print('Total targets: ', len(targets_all))

The number of target 1s:  2237
The number of target 0s:  11847
Total targets:  14084


In [4]:
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

In [5]:
# Delete the marked indices marked from the for loop
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

> **Comment**: First, I count how many targets are 1s, indicating that the number of customers who converted. I set a counter for targets that are 0, meaning that the customer did not convert. To create a "balanced" dataset, I simply remove some datapoints from the major class, as the lecturer recommended. Note that there are various ways to deal with imbalanced data. For example I took a SMOTE technique in my previous [Starbucks Capstone Project](https://github.com/SooyeonWon/ML_starbucks_capstone_projects). 
<br><br>
To balance the data, I declare a variable that will count the number of target: 0s. Once the 2 classes are balanced, mark entries where the target is 0. Then I created two variables, containing the inputs, and the targets.I delete all indices marked as "to remove" in the loop above.

### 2. 3. Standardize the inputs
> To standarize the inputs increases the accuracy of algorithm.

In [6]:
# Using the sklearn library, the inputs are standardized 
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### 2. 4. Shuffle the data
> The collected data is arranged by date. Since I will batch later, I must shuffle the data, so that I can keep the same information but in a random order. Otherwise the data is homogenous inside a batch. On the other hand, between the batches, the batches are heterogenous. So, I used the shuffled indices to shuffle the inputs and targets.

In [7]:
# Shuffle the indices 
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

In [8]:
# Shuffle the inputs and target
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### 2. 5. Split the dataset into train, validation, and test
> In this analysis, I split the data into 80-10-10 distribution of training, validation, and test.

In [9]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# 80% of the whole data are assigned to training dataset 
train_samples_count = int(0.8 * samples_count)

# 10% of the whole data are assigned to validation dataset 
validation_samples_count = int(0.1 * samples_count)

# Finally, test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

In [10]:
# Create variables that record the inputs and targets for training. 
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test. 
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

> So far I have balanced the dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were taken from a shuffled dataset. So it should be also checed if the splited sets are also balanced. Note that whenever I rerun the code, I get unequal values, since they are randomly shuffled. 

In [11]:
# Print the number of targets that are 1s, the total number of samples, and the proportion for each splited dataset.
print('Training set: Target=1:', int(np.sum(train_targets)), ' / Total # targets:', int(train_samples_count), ' / Percentage:', np.round(np.sum(train_targets) / train_samples_count,3))
print('Validation set: Target=1:', int(np.sum(validation_targets)), ' / Total # targets:', int(validation_samples_count),' / Percentage:',  np.round(np.sum(validation_targets) / validation_samples_count,3))
print('Test set: Target=1:', int(np.sum(test_targets)), ' / Total # targets:', int(test_samples_count), ' / Percentage:',  np.round(np.sum(test_targets) / test_samples_count,3))

Training set: Target=1: 1796  / Total # targets: 3579  / Percentage: 0.502
Validation set: Target=1: 216  / Total # targets: 447  / Percentage: 0.483
Test set: Target=1: 225  / Total # targets: 448  / Percentage: 0.502


### 2. 6. Save the three datasets in *.npz & Scaler

In [12]:
# Save the three datasets in *.npz.
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

<a id='Analysis'></a>
## 3. Data Analysis 
&nbsp; 3. 1. Data <br>
&nbsp; 3. 2. Model <br>

In [13]:
# Import the relevant libraries to create 
import tensorflow as tf

### 3. 1. Data

In [14]:
# Load the training set 
npz_train = np.load('Audiobooks_data_train.npz')
# Training Inputs & Targets
# Note that targets must be integer, because of sparse_categorical_crossentropy 
train_inputs, train_targets = npz_train['inputs'].astype(np.float), npz_train['targets'].astype(np.int)

In [15]:
# Load the validation set 
npz_valid = np.load('Audiobooks_data_validation.npz')
# Inputs & Targets in the validatin set
validation_inputs, validation_targets = npz_valid['inputs'].astype(np.float), npz_valid['targets'].astype(np.int)

In [16]:
# Load the test dataset
npz_test = np.load('Audiobooks_data_test.npz')
# # Inputs & Targets in the test set
test_inputs, test_targets = npz_test['inputs'].astype(np.float), npz_test['targets'].astype(np.int)

### 3. 2. Model
#### Outline

In [17]:
# Set the input and output sizes
input_size = 10
output_size = 2
# Set the same hidden layer size for both hidden layers
hidden_layer_size = 50

> Since there are 10 predictors in csv, and the target with 2 classes in the provided dataset. These decided the size of inputs and outputs. 50 Hidden units provide enough complexity. I did not put too many units at the begining for better speed.

In [18]:
# Define the model 
model_audiobook = tf.keras.Sequential([
                            # 1st hidden layer 
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'), 
                            # 2nd hidden layer
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'), 
                            # Output layer with softmax function
                            tf.keras.layers.Dense(output_size, activation='softmax') # output layer
                            ])

> Since the data is already preprocessed, it is not necessary to include input layer. 

#### Optimizer and Loss function

In [19]:
model_audiobook.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

> I defined the optimizer: 'adam', the loss function 'Sparse Categorical Cross-Entropy' and the metrics, obtaining at each iteration.


#### Early Stopping Mechanism

In [20]:
# I set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

#### Training the Model

In [21]:
# Set the batch size
batch_size = 100

# Set a maximum number of training epochs
max_epochs = 100

In [22]:
# Note that the train, validation and test data are not iterable in this time
model_audiobook.fit(train_inputs,                                            # 1. train inputs
                    train_targets,                                           # 2. train targets
                    batch_size=batch_size,                                   # 3. batch size
                    epochs=max_epochs,                                       # 4. epochs that I train for 
                    callbacks=[early_stopping],                              # 5. early stopping
                    validation_data=(validation_inputs, validation_targets), # 6. validation data
                    verbose = 2)  

Epoch 1/100
36/36 - 0s - loss: 0.6119 - accuracy: 0.6695 - val_loss: 0.5309 - val_accuracy: 0.7293
Epoch 2/100
36/36 - 0s - loss: 0.4802 - accuracy: 0.7667 - val_loss: 0.4556 - val_accuracy: 0.7651
Epoch 3/100
36/36 - 0s - loss: 0.4211 - accuracy: 0.7857 - val_loss: 0.4197 - val_accuracy: 0.7651
Epoch 4/100
36/36 - 0s - loss: 0.3911 - accuracy: 0.7994 - val_loss: 0.3995 - val_accuracy: 0.7696
Epoch 5/100
36/36 - 0s - loss: 0.3750 - accuracy: 0.7974 - val_loss: 0.3875 - val_accuracy: 0.7852
Epoch 6/100
36/36 - 0s - loss: 0.3637 - accuracy: 0.8078 - val_loss: 0.3801 - val_accuracy: 0.7830
Epoch 7/100
36/36 - 0s - loss: 0.3587 - accuracy: 0.8030 - val_loss: 0.3747 - val_accuracy: 0.7987
Epoch 8/100
36/36 - 0s - loss: 0.3528 - accuracy: 0.8061 - val_loss: 0.3701 - val_accuracy: 0.7987
Epoch 9/100
36/36 - 0s - loss: 0.3471 - accuracy: 0.8167 - val_loss: 0.3641 - val_accuracy: 0.8054
Epoch 10/100
36/36 - 0s - loss: 0.3441 - accuracy: 0.8136 - val_loss: 0.3610 - val_accuracy: 0.8054
Epoch 11/

<tensorflow.python.keras.callbacks.History at 0x2053736e970>

> 'callbacks' is the function called by a task when a task is completed. The task here is to check whether val_loss is increasing. After 12 epochs of training, I have reached a validation accuracy of ca. 82%. Since I set an early stopping mechanism, the training didnt go through for all epochs. 

<a id='Test'></a>
## 4. Test the Model
> After fitting on the training data and validating on the validation data, I tested the final prediction power of the model by running it on the test dataset which the algorithm has NEVER used before.

In [23]:
test_loss, test_accuracy = model_audiobook.evaluate(test_inputs, test_targets)



In [24]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.33. Test accuracy: 84.38%


> The final accuracy is very close to the validation accuracy, since I did not fiddle too much with hyperparameters. Using the initial model and hyperparameters given in this notebook, the final test accuracy should be roughly around 80%. Again, note that each time the code is rerun, a different accuracy will be obtained because each training is different. 
