## Column Headings
A: User ID <br>
B: Book length_overall in minutes (The total length of all purchases for the user)<br>
C: Book length_average in minutes (The average length of books bought by the user)<br>
D: Price_overall (Total spent by the user)<br>
E: Price_average (Average spent by users across all purchases)<br>
F: Review (Yes (1) or No (0))<br>
G: Review (Rating out of 10) # Defaults to 8.91 if the user didn't leave a review<br>
H: Minutes listened (Across all the user's purchases)<br>
I: Completion (Total minutes listened / book length_overall)<br>
J: Support requests (How many times a user has made request for support)<br>
K: Last visited minus purchase date (Different between their first purchase and their most recent interaction. Bigger is better)<br>
L: Targets (Whether the user bought another book in the 6 months after this data was collected)<br><br>
This data has been collected from 2 years of engagement with the audio book app.

Our task is to create a model which determines whether a customer will buy another book, thus determining whether we should focus on marketing the app to this user, ensuring the stay with the company.

## Imports

In [2]:
import numpy as np
from sklearn import preprocessing

## Preprocessing

1. Balance the dataset (Make sure the training data has an equal number of samples from each possible target)<br>
2. Divide the dataset into training, validation and testing <br>
3. Save the data in a tensor friendly format (npz)

### Extract data from CSV file

In [3]:
raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',') 

unscaled_inputs_all = raw_csv_data[:, 1:-1] # Takes every column except the first and last columns
targets_all = raw_csv_data[:, -1] # Takes the last column of the CSV which are the targets

### Balance the dataset

In [4]:
num_one_targets = int(np.sum(targets_all)) # Count the number of targets which are 1. There are less 1's than 0's
zero_targets_counter = 0
indices_to_remove = []

for i in range (targets_all.shape[0]): # Iterate over every target value
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets: # Once we have the same number of 1's and 0's, we can remove the other samples
            indices_to_remove.append(i)

# Remove the samples we don't need for training
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis = 0)

### Standardize inputs

In [7]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors) # Standardizes all inputs in the array

### Shuffle data

In [10]:
shuffled_indices = np.arange(scaled_inputs.shape[0]) # Shape[0] takes the ID which is assigned by numpy automatically
np.random.shuffle(shuffled_indices) # Shuffle the inputs

# Make sure the inputs and targets are in the same order so the data is still correct by reordering them with the shuffled indices
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Training, validation and testing

In [11]:
sample_count = shuffled_inputs.shape[0]

# 80% training, 10% validation, 10% testing
train_samples_count = int(0.8 * sample_count)
validation_samples_count = int(0.1 * sample_count)
test_samples_count = sample_count - train_samples_count - validation_samples_count

# Extract the samples into their sections
train_inputs = shuffled_inputs[: train_samples_count] # Take samples from the start until it has 80% of the samples
train_targets = shuffled_targets[: train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count : train_samples_count + validation_samples_count]
validation_targets = shuffled_targets[train_samples_count : train_samples_count + validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count + validation_samples_count :]
test_targets = shuffled_targets[train_samples_count + validation_samples_count :]

### Save as npz

In [None]:
np.savez('Audiobooks_data_train', inputs = train_inputs, targets = train_targets)
np.savez('Audiobooks_data_validation', inputs = validation_inputs, targets = validation_targets)
np.savez('Audiobooks_data_test', inputs = test_inputs, targets = test_targets)