# PROJECT: Audio Books' Purchase Prediction.
## Project Goal
 
The goal is to create a machine learning algorithm that can predict if a customer will buy again from the Audiobook company.
Data is collected from an Audiobook app. Each customer in the database has made a purchase of audio versions of the books atleast once.

The main idea is to focus our efforts ONLY on customers that are likely to buy from the company again, in this case we can make great savings. If a customer has a low probability of coming back, it would be wasteful to spend resources advertising to them.

There are several variables: 'Customer ID', 'Book length avg' (average of all purchases in minutes), 'Book length sum' (sum of all purchases in minutes), 'Avg Price'(average of all purchases), 'Sum Price paid' (sum of all purchases), 'Review' (a Boolean variable), 'Review(out of 10)', 'Completion' (from 0 to 1) , 'Total minutes listened', 'Support Requests' (number), 'Last visited' (minus purchase date in days), 'Purchase' (targets).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

## Data Preprocessing
* Preprocess the data.
* Balance the dataset. 
* Create 3 datasets: training, validation, and test. 
* Save the newly created sets in a tensor friendly format (e.g. *.npz)


In [29]:
#import the relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [26]:
#load and read the data provided
data = pd.read_csv('AudioBooks-data.csv')
#print the first 5 rows of the data to get a picture of how the data looks like
data.head()

Unnamed: 0,CustomerID,Book length avg,Book length Sum,Avg Price,Sum Price Paid,Review,Review(out of 10),Completion,Total Mins Listened,Support Requests,Last Visited(days),Purchase
0,994,1620.0,1620,19.73,19.73,1,10.0,0.99,1603.8,5,92,0
1,1143,2160.0,2160,5.33,5.33,0,8.91,0.0,0.0,0,0,0
2,2059,2160.0,2160,5.33,5.33,0,8.91,0.0,0.0,0,388,0
3,2882,1620.0,1620,5.96,5.96,0,8.91,0.42,680.4,1,129,0
4,3342,2160.0,2160,5.33,5.33,0,8.91,0.22,475.2,0,361,0


In [14]:
#checking for any missing values in the data
data.isnull().sum()

CustomerID             0
Book length avg        0
Book length Sum        0
Avg Price              0
Sum Price Paid         0
Review                 0
Review(out of 10)      0
Completion             0
Total Mins Listened    0
Support Requests       0
Last Visited(days)     0
Purchase               0
dtype: int64

This data contains no missing values

In [43]:
#checking total number of counts for each of the target values
data['Purchase'].value_counts()

0    11847
1     2237
Name: Purchase, dtype: int64

This shows how unbalanced the data is, so first step is to make sure our data is well balanced to avoid bias in predictions

In [46]:
#So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).
# create arrays for both inputs and targets for easy numpy manipulations
inputs = np.array(data.drop(['CustomerID', 'Purchase'], 1))
targets = np.array(data.Purchase)
print(inputs.shape, targets.shape)

(14084, 10) (14084,)


# Balancing the data

In [49]:
# create a function to balance the given data taking data_inputs and data_targets as parameters
def BalanceData(data_inputs, data_targets):
    remove_indices = []
    count_ones = int(np.sum(data_targets))
    count_zeros = 0
    for i in range(targets.shape[0]):
        if targets[i] == 0:
            count_zeros += 1
            if count_zeros > count_ones:
                remove_indices.append(i)
    balanced_inputs = np.delete(data_inputs, remove_indices, axis=0)
    balanced_targets = np.delete(data_targets, remove_indices, axis=0)
    return balanced_inputs, balanced_targets

In [50]:
balanced_inputs, balanced_targets = BalanceData(inputs, targets)
print(balanced_inputs.shape, balanced_targets.shape)

(4474, 10) (4474,)


In [19]:
# standardize the inputs
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_inputs = scaler.fit_transform(balanced_inputs)
print(scaled_inputs)

[[ 0.21053387 -0.18888517  1.97823887 ...  4.80955413 11.83828419
   0.09415043]
 [ 1.27894497  0.41646744 -0.39082475 ... -0.41569922 -0.20183481
  -0.80255852]
 [ 1.27894497  0.41646744 -0.39082475 ... -0.41569922 -0.20183481
   2.979214  ]
 ...
 [ 1.27894497  0.41646744 -0.39082475 ... -0.41569922 -0.20183481
  -0.7440775 ]
 [ 0.31737498  1.7482432   0.04679395 ... -0.41569922 -0.20183481
  -0.80255852]
 [ 0.31737498  1.7482432  -0.39082475 ... -0.41569922 -0.20183481
  -0.80255852]]


In [53]:
#shuffle the data
# This will create randomness in the way data is spread and prevent the model from being biased
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)
# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = balanced_targets[shuffled_indices]

In [52]:
#split the data
#total number of instances used
samples_count = shuffled_inputs.shape[0]

# number of instances used in training
train_samples_count = int(0.75*samples_count)
# number of instances used for validation
validation_samples_count = int(0.15*samples_count)
#number of instances used for testing
test_samples_count = samples_count - train_samples_count - validation_samples_count

#train data after shuffling
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# validation data after shuffling
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count + validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count + validation_samples_count]

# test data after shuffling
test_inputs = shuffled_inputs[train_samples_count + validation_samples_count:]
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]

print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

Save the Train, Validation and Test Datasets as .npz files to be used later on in building the model

In [22]:
np.savez('AudioTrainData', inputs = train_inputs, targets = train_targets)
np.savez('AudioValidationData', inputs = validation_inputs, targets = validation_targets)
np.savez('AudioTestData', inputs = test_inputs, targets = test_targets)