# Predict audiobooks returning customers

I am building a model that analysises data from customers of an audiobook app to classify them as returning or not, in a boolean way. By "returning" I mean returning customers, i.e. if they will go back to using the app to purchase more products.
Methods used:
- NN
- SVM (as comparison)

The data are imported from an attached csv file, and is tables as follows:

| customer id | average minutes spent per book | total minutes spent on app  |average price of book   |total spent on app   |has left reviews?|review score|completion fraction| minutes listened |number of support requests|Last visited time minus purchase date| Target (dependent variable)|
|---------------|-------|---|---|---|---|---|---|---|---|---|---|
|   x   |   x    |  x | x  | x  |x|x|x|x|x|x|x|x|x|

### Methodology - NN model

- data loading from the npz files saved in the BC_preprocessing notebook;
- model definition: NN with adjustable number of layers, nodes, etc;
- definition of an alternative model with an early stopping mechanism, to avoid overfitting;
- testing the model, and table with 2 examples of values obtained with slightly different models;


This work is based on an exercise from https://www.udemy.com/course/the-data-science-course-complete-data-science-bootcamp/

In [1]:
import numpy as np
import tensorflow as tf

## Data

In [2]:
# let's create a temporary variable npz, where we will store each of the three Audiobooks datasets
npz = np.load('Audiobooks_data_train.npz')

# we extract the inputs using the keyword under which we saved them
# to ensure that they are all floats, let's also take care of that
train_inputs = npz['inputs'].astype(np.float)
# targets must be int because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)
train_targets = npz['targets'].astype(np.int)

# we load the validation data in the temporary variable
npz = np.load('Audiobooks_data_validation.npz')
# we can load the inputs and the targets in the same line
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

# we load the test data in the temporary variable
npz = np.load('Audiobooks_data_test.npz')
# we create 2 variables that will contain the test inputs and the test targets
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

## Model

- we keep hl size as 50, we can change it afterwards;
- we don't need the flatten method, as we already reprocessed the data
- softmax is good for classifiers

In [3]:
input_size = 10
output_size = 2
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50

model = tf.keras.Sequential([
    
    # the first layer (the input layer)
    # each observation is 28x28x1 pixels, therefore it is a tensor of rank 3
    # since we don't know CNNs yet, we don't know how to feed such input into our net, so we must flatten the images
    # there is a convenient method 'Flatten' that simply takes our 28x28x1 tensor and orders it into a (None,) 
    # or (28x28x1,) = (784,) vector
    # this allows us to actually create a feed forward neural network
    #tf.keras.layers.Flatten(input_shape=(28, 28, 1)), # input layer
    
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    #tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 3rd hidden layer

    # will have to explore combinations of activation functions
    
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

batch_size=100
max_epochs=100

model.fit(train_inputs, 
          train_targets, 
          batch_size=batch_size, 
          epochs =max_epochs, 
          validation_data=(validation_inputs, validation_targets),
          verbose=2)

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 3s - loss: 0.5756 - accuracy: 0.7452 - val_loss: 0.4244 - val_accuracy: 0.8792
Epoch 2/100
3579/3579 - 0s - loss: 0.3863 - accuracy: 0.8664 - val_loss: 0.3174 - val_accuracy: 0.8881
Epoch 3/100
3579/3579 - 0s - loss: 0.3258 - accuracy: 0.8815 - val_loss: 0.2862 - val_accuracy: 0.8971
Epoch 4/100
3579/3579 - 0s - loss: 0.3013 - accuracy: 0.8885 - val_loss: 0.2693 - val_accuracy: 0.9038
Epoch 5/100
3579/3579 - 0s - loss: 0.2862 - accuracy: 0.8919 - val_loss: 0.2597 - val_accuracy: 0.9016
Epoch 6/100
3579/3579 - 0s - loss: 0.2774 - accuracy: 0.9003 - val_loss: 0.2522 - val_accuracy: 0.9016
Epoch 7/100
3579/3579 - 0s - loss: 0.2697 - accuracy: 0.9016 - val_loss: 0.2436 - val_accuracy: 0.9060
Epoch 8/100
3579/3579 - 0s - loss: 0.2640 - accuracy: 0.9033 - val_loss: 0.2419 - val_accuracy: 0.9105
Epoch 9/100
3579/3579 - 0s - loss: 0.2575 - accuracy: 0.9061 - val_loss: 0.2367 - val_accuracy: 0.9083
Epoch 10/100
3579/3579 - 0

<tensorflow.python.keras.callbacks.History at 0x183e46ffdc8>

We see that the validation loss oscillates, which means that we have some overfitting.

## Setting an early stopping mechanism
opeating on the fit method input _callback_:

functions called during the model training, many available, we are using EarlyStopping.

tf.keras.callbacks.EarlyStopping(patience): patience by default is set to 0, but we can set how many consecutive val_loss increases we can tolerate. 

In [11]:
input_size = 10
output_size = 2
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50

model_es = tf.keras.Sequential([
    
    
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 3rd hidden layer

    # will have to explore combinations of activation functions
    
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

model_es.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

batch_size=100
max_epochs=100

early_stopping=tf.keras.callbacks.EarlyStopping(patience=2) #stops as val acc decreases, the first time

model_es.fit(train_inputs, 
          train_targets, 
          batch_size=batch_size, 
          epochs =max_epochs, 
          callbacks = [early_stopping],
          validation_data=(validation_inputs, validation_targets),
          verbose=2)

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 1s - loss: 0.5666 - accuracy: 0.7368 - val_loss: 0.3859 - val_accuracy: 0.8926
Epoch 2/100
3579/3579 - 0s - loss: 0.3611 - accuracy: 0.8751 - val_loss: 0.2846 - val_accuracy: 0.8993
Epoch 3/100
3579/3579 - 0s - loss: 0.3054 - accuracy: 0.8894 - val_loss: 0.2646 - val_accuracy: 0.9016
Epoch 4/100
3579/3579 - 0s - loss: 0.2836 - accuracy: 0.8991 - val_loss: 0.2533 - val_accuracy: 0.9038
Epoch 5/100
3579/3579 - 0s - loss: 0.2692 - accuracy: 0.9005 - val_loss: 0.2457 - val_accuracy: 0.9038
Epoch 6/100
3579/3579 - 0s - loss: 0.2599 - accuracy: 0.9030 - val_loss: 0.2446 - val_accuracy: 0.9083
Epoch 7/100
3579/3579 - 0s - loss: 0.2566 - accuracy: 0.9058 - val_loss: 0.2351 - val_accuracy: 0.9060
Epoch 8/100
3579/3579 - 0s - loss: 0.2509 - accuracy: 0.9044 - val_loss: 0.2332 - val_accuracy: 0.9128
Epoch 9/100
3579/3579 - 0s - loss: 0.2428 - accuracy: 0.9098 - val_loss: 0.2341 - val_accuracy: 0.9172
Epoch 10/100
3579/3579 - 0

<tensorflow.python.keras.callbacks.History at 0x183ee2277c8>

val_accuracy: 0.9150; Awesome!

# Test the model

In [12]:
test_loss, test_accuracy = model_es.evaluate(test_inputs, test_targets)



0s 476us/sample - loss: 0.2887 - accuracy: 0.8906

## Appendix: trying the get the accuracy better

| hidden layers | nodes | activation  |batch size   |optimizer   |loss|accuracy|
|---------------|-------|---|---|---|---|---|
|        2       |   50    |  relu | 100  | adam  |0.2887|0.8906|
|               |    100   |   |   |   |0.3059|0.8862|
|         3      |   50    |   |   |   |0.2965|0.8862|