# PROJECT: Audio Books' Purchase Prediction.

## Problem
 
The goal is to create a machine learning algorithm that can predict if a customer will buy again from the Audiobook company.
Data is collected from an Audiobook app. Each customer in the database has made a purchase of audio versions of the books atleast once.

The main idea is to focus our efforts ONLY on customers that are likely to buy from the company again, in this case we can make great savings. If a customer has a low probability of coming back, it would be wasteful to spend resources advertising to them.

There are several variables: 'Customer ID', 'Book length avg' (average of all purchases in minutes), 'Book length sum' (sum of all purchases in minutes), 'Avg Price'(average of all purchases), 'Sum Price paid' (sum of all purchases), 'Review' (a Boolean variable), 'Review(out of 10)', 'Completion' (from 0 to 1) , 'Total minutes listened', 'Support Requests' (number), 'Last visited' (minus purchase date in days), 'Purchase' (targets).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

## Create the machine learning algorithm

In [13]:
#import relevant libraries
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import tensorflow as tf
import joblib

### Load Data

In [14]:
#load npz data 
train_data = np.load('AudioTrainData.npz')
# we extract the inputs using the keyword under which we saved them
# to ensure that they are all floats, let's also take care of that
train_inputs = train_data['inputs'].astype(np.float)
train_targets = train_data['targets'].astype(np.int)
# Load validation data and extract inputs and targets
validation_data = np.load('AudioValidationData.npz')
validation_inputs = validation_data['inputs'].astype(np.float)
validation_targets = validation_data['targets'].astype(np.int)
# Load test data and extract inputs and targets
test_data = np.load('AudioTestData.npz')
test_inputs = test_data['inputs'].astype(np.float)
test_targets = test_data['targets'].astype(np.int)

## Model

In [15]:
#outine the model
#set input size, output size and hidden layer size
input_size = 10
output_size = 2
hidden_layer_size =50

# defining the model structure
model = tf.keras.Sequential([
      # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
      # hidden_layer_size and the activation function are the only arguments used in this particular case
    tf.keras.layers.Dense(hidden_layer_size, activation = 'relu' ),
    tf.keras.layers.Dense(hidden_layer_size, activation = 'relu' ),
    tf.keras.layers.Dense(output_size, activation='softmax' )
])

#create a custom optimizer with a set values for the learning rate
#choose a loss function
#choose metrics (accuracy in this case)
custom_optimizer = tf.keras.optimizers.Adam(learning_rate = 0.01)
model.compile(optimizer = custom_optimizer, loss= 'sparse_categorical_crossentropy', metrics= ['accuracy'])

#training the model

# set batch size and maximum training epochs
batch_size = 100
max_epochs= 100 #epochs that we will train for (assuming early stopping is not effected)
#set EarlyStopping
Early_Stopping = tf.keras.callbacks.EarlyStopping(patience=2)
#fit the model on the train and validation data
model.fit(train_inputs, 
          train_targets, 
          batch_size = batch_size,
          epochs= max_epochs,  
          callbacks = [Early_Stopping],
          validation_data = (validation_inputs, validation_targets),
          verbose =2
         )

Train on 3355 samples, validate on 671 samples
Epoch 1/100
3355/3355 - 1s - loss: 0.4437 - accuracy: 0.7529 - val_loss: 0.4121 - val_accuracy: 0.7347
Epoch 2/100
3355/3355 - 0s - loss: 0.3614 - accuracy: 0.7997 - val_loss: 0.3864 - val_accuracy: 0.7660
Epoch 3/100
3355/3355 - 0s - loss: 0.3476 - accuracy: 0.8155 - val_loss: 0.4022 - val_accuracy: 0.7601
Epoch 4/100
3355/3355 - 0s - loss: 0.3422 - accuracy: 0.8104 - val_loss: 0.3788 - val_accuracy: 0.7794
Epoch 5/100
3355/3355 - 0s - loss: 0.3375 - accuracy: 0.8077 - val_loss: 0.3738 - val_accuracy: 0.7914
Epoch 6/100
3355/3355 - 0s - loss: 0.3323 - accuracy: 0.8188 - val_loss: 0.3783 - val_accuracy: 0.7914
Epoch 7/100
3355/3355 - 0s - loss: 0.3300 - accuracy: 0.8197 - val_loss: 0.3662 - val_accuracy: 0.7869
Epoch 8/100
3355/3355 - 0s - loss: 0.3338 - accuracy: 0.8122 - val_loss: 0.3681 - val_accuracy: 0.7809
Epoch 9/100
3355/3355 - 0s - loss: 0.3305 - accuracy: 0.8185 - val_loss: 0.3862 - val_accuracy: 0.7899


<tensorflow.python.keras.callbacks.History at 0x19e08defc88>

## Testing the model

In [16]:
#test model on test data
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [17]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.1f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.33. Test accuracy: 82.8%


### Using the initial model and hyperparameters given in this notebook, the final test accuracy should be roughly around 82.8%.