# Business Case: Audiobook App Buyer Behaviour
## Data pre-processing


Here I analyse buyer data from a audiobook app to inform decision making with regards to the most effective marketting strategy. I use a TensorFlow 2 Machine Learning Approach to predict which customers are likely to make additional purchaces. The insights from this analysis will help prevent the spending of marketting resources on consumers that are unlikely to return to the app.

The data for this analysis was collected via the app and represents two years of enagement data. Whether the customer made another audiobook purchace via the app during the 6 months after the 2 year period was then recorded.

Variables overview:  
Customer ID  
Book length  
Book length  
Prive_overall  
Price_avg  
Review  
Review 10/10 (all empty values replaced with average review rating - 8.91. Thus anything above this number equates to above average feelings, and everything below to below average feelings)  
Minutes_listened  
Completion (total minutes listened/book length_overall)  
Support Requests(measure of engagement)  
Last vissited minus Purchase date (measure of engagement)  
Targets (whether customer made another purchace during the 6 month period)

1) Preprocess the data  
   - Balance the dataset  
   - Divide the dataset in training, validation and test  
   - Save the data in a tensor friendly format (.npz)  

2) Create the Machine learning algorithm   
 

### Extract data (inputs and targets)

In [38]:
import os
os.chdir("C:\\Users\\christiaan.brink\\OneDrive\\Github\\BusinessCase_Audiobooks")
os.getcwd()

'C:\\Users\\christiaan.brink\\OneDrive\\Github\\BusinessCase_Audiobooks'

In [39]:
import numpy as np

# sklearn preprocessing library - for easier standardization of the data.
from sklearn import preprocessing

# Load the data
raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',')

# First column [:, 0] is an arbitrary customer ID that can be removed as it contains no useful information for our model.
# The last column [:, -1] contains our targets (whether the customer made another purchace during our 6 months period)

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

### Balance the data sheet
Use sklearn capabilities for standardizing the inputs. There are many more "0" (not return buyers) than "1" (return buyers) in the targets. We need to scale the data so that similar numbers of of "0" and "1" targets are used to build the model - otherwise the model's solution will be biased.

In [40]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# We want to create a "balanced" dataset (where 0 or 1 targets are not over-represented), so we will have to remove some input/target pairs.
# Declare a variable that will remove some observations from the over-represented target:
indices_to_remove = []

# Count the number of targets that are 0. 
# The for loop below marks entries where the target is 0 once an equal number of 0s as 1s have been identified:
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# Create new variables for both model inputs and targets, while deleting indices marked for removal in the above loop:
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)


### Standardize the inputs

In [41]:
# Use sklearn's preprocessing capabilities to standardize the inputs:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

This randomizes the order of the data in order to prevent bias during batching.

In [42]:
# When the data was collected it was arranged by date, since the data will be batched it needs to be randomized.
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices above to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset (training, test, validation)

The data needs to be split into different datasets for training, testing and validation. A 80-10-10 split will be used.

In [43]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Determine split count
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create training data
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create validation data
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create testing data
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# Check that the various datasets are still balanced (50/50 for targets 0 and 1)
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1797.0 3579 0.5020955574182733
217.0 447 0.4854586129753915
223.0 448 0.49776785714285715


### Save output files

In [44]:
# Save the three datasets in *.npz.
# In the next lesson, you will see that it is extremely valuable to name them in such a coherent way!

np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

# Note due to randomization output files will be different each time this code is rerun