# Audiobooks business case

## Problem Description
You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why they are in the database.  
The objective is to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

###### Dataset Summary
You have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

###### Objective
The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.
- 1: The customer converted
- 0: The customer did not convert.

## Preprocess the data. 

### Extract the data from the csv

In [1]:
import numpy as np
from sklearn import preprocessing

raw_csv_data = np.loadtxt('Dataset/Audiobooks_data.csv', delimiter = ',')

# The inputs are all columns in the csv, except for the first one (Customer ID), and the last one (targets)
unscaled_inputs_all = raw_csv_data[:,1:-1]

# The targets are in the last column.
targets_all = raw_csv_data[:,-1]

### Balance the dataset
To avoid introducing bias into the model, the training data will be balanced. This means making the dataset such that the number of rows with target=0 are equal to the number of rows with target=1, and then dropping the rest. This will result in the training data used for the model having a 50-50 split for targets 0 and 1.

In [2]:
# Count how many targets are 1 
num_one_targets = int(np.sum(targets_all))
                      
# Set a counter for targets that are 0 
zero_targets_counter = 0

indices_to_remove = []
                      
# Extract the index of the remaining rows after the number of rows for targets 1 and 0 are equal.
for i in range(targets_all.shape[0]):
    if targets_all[i] ==0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# variable to store the input and target after the excess rows have been deleted.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)
targets_equal_priors = np.delete (targets_all, indices_to_remove, axis=0)

### Standardize the inputs

In [3]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data
The dataset is aarranged by date, this might make the model learn trends according to this arrangement of the data.  
Shuffling will ensure the data will be as randomly spread as possible.

In [4]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [5]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming an 80-10-10 distribution of training, validation, and test.
train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# Check if the train, validation, and test data are balanced
# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1794.0 3579 0.501257334450964
222.0 447 0.4966442953020134
221.0 448 0.49330357142857145


### Save the three datasets in *.npz

In [6]:
np.savez('Dataset/Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Dataset/Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Dataset/Audiobooks_data_test', inputs=test_inputs, targets=test_targets)