## Problem

You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

Good luck!

## Create the machine leartnign algorithm
#### Import the relevant libabries


In [34]:
import numpy as np
import tensorflow as tf

## data

In [35]:
# get the train data
npz = np.load(
    "/home/angelo/repos/vscode_repos/customer_analytics_2022/pickle_data_models/Audiobooks_data_train.npz"
)

# remember that the npz data is in a tuple form: (inputs, targets)
# ensure that the inputs are floats
train_inputs, train_targets = npz["inputs"].astype(np.float), npz["targets"].astype(
    np.int
)

# get validation data
npz = np.load(
    "/home/angelo/repos/vscode_repos/customer_analytics_2022/pickle_data_models/Audiobooks_data_validation.npz"
)
validation_inputs, validation_targets = npz["inputs"].astype(np.float), npz[
    "targets"
].astype(np.int)

# get test data
npz = np.load(
    "/home/angelo/repos/vscode_repos/customer_analytics_2022/pickle_data_models/Audiobooks_data_test.npz"
)
test_inputs, test_targets = npz["inputs"].astype(np.float), npz["targets"].astype(
    np.int
)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  train_inputs, train_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  train_inputs, train_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#depreca

In [36]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices("GPU")))

Num GPUs Available:  0


## Model
Outline, optimizers, loss, early stopping & training

In order to create a neuronal network we need to define its width and depth


In [37]:
# Input size must equal the number of predictors
# we have 10 predictors
from sklearn import metrics


input_size = 10

# output_size must be two as we have only true and false
output_size = 2

# hidden layers sizes are usuaully of the same size
hidden_layer_size = 50

# init the actual model
# sequential is the method on the package function that indicates that we are laying down the algortihm
# with it we define the parameters and loss  & activation functions
model = tf.keras.Sequential(
    [
        # 1. define the first hidden layer
        # finds the dot products with the weights and adds the bias
        # it can also apply an activation function
        # takes the hiffenlayer size, and then the actiavation function
        # relu (rectified linear unit) vs tanh vs sigmoid
        tf.keras.layers.Dense(hidden_layer_size, activation="relu"),
        # 2. define the second hidden layer
        tf.keras.layers.Dense(hidden_layer_size, activation="relu"),
        # --> You can stack as many layers as you want
        # 3. define the output layer
        # note: whenever we are building a classifier, the activation,
        # function must transform the output values into probabilites --> SOFTWMAX
        tf.keras.layers.Dense(output_size, activation="softmax"),
    ]
)

# choose an optimizer & loss function
model.compile(
    # good optimizer: Adaptive Moment Estimation (adam)
    optimizer="adam",
    # regardign loss: we need a classifier
    # sparse_categorical_crosstropy apploes one-hot encoding to the targets
    loss="sparse_categorical_crossentropy",
    # define the metric on what to optimnize for
    metrics=["accuracy"],
)


# we need to set a batch size & set early stop observation (epochs)
# batch size will indicate how many obserations are fet to the algorithm at once
batch_size = 100

# this is the simplest stopping mechanism, setting simply the number of iterations the algorith mhas to converge
max_epochs = 100

# better stopping
# this one will monitor the validation loss and stop the training model the first time the validation loss starts oincreasing
# this ensures that the model will not overfit the training data
early_stopping = tf.keras.callbacks.EarlyStopping()

# the only porblem with the early_stopping above is that the validation loss may have increased by chance
# so the algorithm would stop optimizing while this is only by chance
# so adjsut the early_stopping using the patience parameter
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)


# fit the model
model.fit(
    train_inputs,  # train inputs
    train_targets,  # train targets
    batch_size=batch_size,  # batch size
    epochs=max_epochs,  # epochs that we will train for (assuming early stopping doesn't kick in)
    # callbacks are functions called by a task when a task is completed
    # task here is to check if val_loss is increasing
    callbacks=[early_stopping],  # early stopping
    validation_data=(validation_inputs, validation_targets),  # validation data
    verbose=2,  # making sure we get enough information about the training process
)

Epoch 1/100
36/36 - 0s - loss: 0.5855 - accuracy: 0.7418 - val_loss: 0.4388 - val_accuracy: 0.8837 - 311ms/epoch - 9ms/step
Epoch 2/100
36/36 - 0s - loss: 0.3849 - accuracy: 0.8731 - val_loss: 0.3240 - val_accuracy: 0.8993 - 49ms/epoch - 1ms/step
Epoch 3/100
36/36 - 0s - loss: 0.3267 - accuracy: 0.8807 - val_loss: 0.2885 - val_accuracy: 0.9016 - 53ms/epoch - 1ms/step
Epoch 4/100
36/36 - 0s - loss: 0.3075 - accuracy: 0.8849 - val_loss: 0.2697 - val_accuracy: 0.8971 - 54ms/epoch - 2ms/step
Epoch 5/100
36/36 - 0s - loss: 0.2939 - accuracy: 0.8902 - val_loss: 0.2540 - val_accuracy: 0.9083 - 50ms/epoch - 1ms/step
Epoch 6/100
36/36 - 0s - loss: 0.2838 - accuracy: 0.8924 - val_loss: 0.2422 - val_accuracy: 0.9060 - 49ms/epoch - 1ms/step
Epoch 7/100
36/36 - 0s - loss: 0.2773 - accuracy: 0.8938 - val_loss: 0.2421 - val_accuracy: 0.9172 - 49ms/epoch - 1ms/step
Epoch 8/100
36/36 - 0s - loss: 0.2696 - accuracy: 0.8975 - val_loss: 0.2330 - val_accuracy: 0.9195 - 54ms/epoch - 1ms/step
Epoch 9/100
36/

<keras.callbacks.History at 0x7f159477c8e0>