# Audiobooks business case

The data used in this notebook comes from an Audiobook App and it relates to the audio versions of books. Each customer in the database has made a purchase at least once, that's why they are in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If the focus and efforts lay solely on customers that are likely to convert again, that can make great savings. In addition, this model can identify which are the most important metrics for a customer to come back again.

There are several variables: 
- Customer ID
- Book length overall (sum of the minute length of all purchases)
- Book length avg (average length in minutes of all purchases)
- Price paid_overall (sum of all purchases)
- Price Paid avg (average of all purchases)
- Review (a Boolean variable whether the customer left a review)
- Review out of 10 (if the customer left a review, his/her review out of 10)
- Total minutes listened
- Completion (from 0 to 1)
- Support requests (number of support requests; everything from forgotten password to assistance for using the App)
- Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time because if they don't convert after that period of time, they've probably gone to a competitor or just didn't like the Audiobook format. 

### Extract the data from the csv

In [17]:
import numpy as np
from sklearn import preprocessing

# Load the data
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')

# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which are the targets)

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [18]:
# I want to create a "balanced" dataset, so I will have to remove some input/target pairs.

# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, save the index where the target is 0 in a list.
indices_to_remove = []
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# I delete all indices that are int he list and save the new arrays in two new variables, 
# one that will contain the inputs, and one that will contain the targets.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Standardize the inputs

In [19]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [20]:
# When the data was collected it was actually arranged by date
# Since I will be batching, the data must be as randomly spread out as possible, 
# so it is necessary to shuffle the indices of the data. 
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [21]:
# Count the total number of samples.
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training.
# They are the first "train_samples_count" observations in the shuffled dataset.
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" already assigned.
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are the remaining observations.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Check if they are balanced, too. Note that each time you rerun this code, 
# you will get different values, as each time they are shuffled randomly.
# Normally you preprocess ONCE, so you need not rerun this code once it is done.
# If you rerun this whole sheet, the npzs will be overwritten with your newly preprocessed data.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1785.0 3579 0.49874266554903607
215.0 447 0.4809843400447427
237.0 448 0.5290178571428571


The distribution is balanced so the three datasets have closely a 50-50 distribution (for targets 0 and 1).

### Save the three datasets in *.npz

In [22]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)