# Problem statement: 
## create a machine learning algorithm that can predict if a customer will buy again

These data is given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

Here we have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 


### classification problem - supervise learning

### action plan
1. preprocess the data
 - balance the dataset
 - divide the dataset in training, validation, testing
 - save the data in a tensor friendly format - .npz
2. create machine learning algorithm


# Load data and Preprocessing

In [30]:
## import the libraries

import numpy as np
from sklearn import preprocessing


In [31]:
## LOADING THE CSV FILE
raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter=',')

In [32]:
unscaled_inputs_all = raw_csv_data[:, 1:-1]
targets_all = raw_csv_data[:, -1]

In [33]:
## BALANCING THE DATASET
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i]==0:
        zero_targets_counter+=1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

In [36]:
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

In [37]:
## STANDARDIZE THE INPUTS

scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

In [40]:
## shuffle the data

shuffles_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffles_indices)

shuffled_inputs = scaled_inputs[shuffles_indices]
shuffled_targets = targets_equal_priors[shuffles_indices]

In [41]:
## spilt the data set into train, validate, test

sample_count = shuffled_inputs.shape[0]
train_sample_count = int(0.8*sample_count)
validation_sample_count = int(0.1*sample_count)
test_sample_count = sample_count - train_sample_count - validation_sample_count

In [42]:
train_inputs = shuffled_inputs[:train_sample_count]
train_targets = shuffled_targets[:train_sample_count]

validation_inputs = shuffled_inputs[train_sample_count:train_sample_count+validation_sample_count]
validation_targets = shuffled_targets[train_sample_count:train_sample_count+validation_sample_count]

test_inputs = shuffled_inputs[train_sample_count+validation_sample_count:]
test_targets = shuffled_targets[train_sample_count+validation_sample_count:]

In [43]:
print(np.sum(train_targets), train_sample_count, np.sum(train_targets)/train_sample_count)
print(np.sum(validation_targets), validation_sample_count, np.sum(validation_targets)/validation_sample_count)
print(np.sum(test_targets), test_sample_count, np.sum(test_targets)/test_sample_count)

1798.0 3579 0.502374965074043
216.0 447 0.48322147651006714
223.0 448 0.49776785714285715


In [44]:
##  SAVE THE THREE DATA SET INTO .NPZ

np.savez('Audiobooks_data_train', inputs = train_inputs, targets= train_targets)
np.savez('Audiobooks_data_validation', inputs = validation_inputs, targets = validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets = test_targets)

# Create Machine Learning algorithm

In [45]:
## import libraries

import tensorflow as tf

In [71]:
## MODEL

input_size = 10
output_size =2
hidden_Layer_size = 100

model = tf.keras.Sequential([
    tf.keras.layers.Dense(hidden_Layer_size, activation='relu'),
    tf.keras.layers.Dense(hidden_Layer_size, activation='relu'),
    tf.keras.layers.Dense(output_size, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

batchsize = 100
max_epochs = 100
early_stopping = tf.keras.callbacks.EarlyStopping(patience=3)

model.fit(train_inputs,
         train_targets,
         batch_size=batchsize,
         epochs = max_epochs,
         callbacks = [early_stopping],
         validation_data=(validation_inputs,validation_targets),
         verbose=2)

Epoch 1/100
36/36 - 0s - loss: 0.5134 - accuracy: 0.7334 - val_loss: 0.4237 - val_accuracy: 0.7942 - 498ms/epoch - 14ms/step
Epoch 2/100
36/36 - 0s - loss: 0.4095 - accuracy: 0.7896 - val_loss: 0.3786 - val_accuracy: 0.8098 - 67ms/epoch - 2ms/step
Epoch 3/100
36/36 - 0s - loss: 0.3785 - accuracy: 0.7966 - val_loss: 0.3806 - val_accuracy: 0.7740 - 57ms/epoch - 2ms/step
Epoch 4/100
36/36 - 0s - loss: 0.3629 - accuracy: 0.8036 - val_loss: 0.3538 - val_accuracy: 0.7987 - 54ms/epoch - 1ms/step
Epoch 5/100
36/36 - 0s - loss: 0.3529 - accuracy: 0.8069 - val_loss: 0.3451 - val_accuracy: 0.8076 - 58ms/epoch - 2ms/step
Epoch 6/100
36/36 - 0s - loss: 0.3416 - accuracy: 0.8167 - val_loss: 0.3414 - val_accuracy: 0.7964 - 56ms/epoch - 2ms/step
Epoch 7/100
36/36 - 0s - loss: 0.3411 - accuracy: 0.8103 - val_loss: 0.3455 - val_accuracy: 0.7875 - 58ms/epoch - 2ms/step
Epoch 8/100
36/36 - 0s - loss: 0.3346 - accuracy: 0.8226 - val_loss: 0.3323 - val_accuracy: 0.8031 - 56ms/epoch - 2ms/step
Epoch 9/100
36

<keras.callbacks.History at 0x22dc8491400>

In [72]:
## TEST MODEL

test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)

