# Audiobooks business case

## Problem Overview
Imagine you’re working with data from an Audiobook App that tracks customer purchases and engagement. Every customer in the system has made at least one purchase, so we already know they’ve interacted with the app. The goal is to build a machine learning model that predicts whether a customer will make another purchase in the future.

Why does this matter? If we can accurately identify which customers are likely to return, we can focus our marketing efforts and advertising budgets on those who are most likely to convert again. This targeted approach helps save resources and maximize growth potential. In addition, the model can uncover the key factors that influence a customer’s likelihood to return, providing valuable insights into customer behavior.

The dataset we’re working with contains several variables such as:

Customer ID (an identifier for each customer),
Book Length (overall and average) (how much time the customer has spent listening to books),
Price Paid (overall and average) (how much the customer has spent),
Review (whether the customer left a review),
Review Rating (if a review was left, how it was rated),
Total Minutes Listened,
Completion (how much of the audiobook they listened to),
Support Requests (how often they’ve contacted support),
Time Since Last Purchase (how many days since their last purchase).
The task is to use the data from the last 2 years of customer activity and predict whether they will make another purchase in the next 6 months. This is a binary classification problem where the goal is to classify customers into two categories: those who will buy again (1) and those who won’t (0).

By building this model, we can gain insights into customer behavior, make smarter marketing decisions, and improve customer retention strategies.

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but is crucial to creating a good model.

If you want to know how to do that, go through the code with comments. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supersized learning datasets). Keep in mind that a specific problem may require additional preprocessing.

Note that we have removed the header row, which contains the names of the categories. We simply want the data.

### Extract the data from the csv

In [1]:
import numpy as np

# We will use the sklearn preprocessing library, as it will be easier to standardize the data.
from sklearn import preprocessing

# Load the data
raw_csv_data = np.loadtxt('../data/Audiobooks_data.csv',delimiter=',')

# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is our targets)

unscaled_inputs_all = raw_csv_data[:,1:-1]
print(unscaled_inputs_all)

# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data[:,-1]
print(targets_all)

[[2.160e+03 2.160e+03 1.013e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 [1.404e+03 2.808e+03 6.660e+00 ... 0.000e+00 0.000e+00 1.820e+02]
 [3.240e+02 3.240e+02 1.013e+01 ... 0.000e+00 1.000e+00 3.340e+02]
 ...
 [1.080e+03 1.080e+03 6.550e+00 ... 0.000e+00 0.000e+00 2.900e+01]
 [2.160e+03 2.160e+03 6.140e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 [1.620e+03 1.620e+03 5.330e+00 ... 0.000e+00 0.000e+00 9.000e+01]]
[1. 1. 1. ... 0. 0. 0.]


### Balance the dataset

In [2]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))
print(num_one_targets)

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# We want to create a "balanced" dataset, so we will have to remove some input/target pairs.
# Declare a variable that will do that:
indices_to_remove = []

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, mark entries where the target is 0.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked "to remove" in the loop above.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

2237


### Standardize the inputs

In [3]:
# That's the only place we use sklearn functionality. We will take advantage of its preprocessing capabilities
# It's a simple line of code, which standardizes the inputs, as we explained in one of the lectures.
# At the end of the business case, you can try to run the algorithm WITHOUT this line of code. 
# The result will be interesting.
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [4]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [5]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Check if they are balanced, too. Note that each time you rerun this code, 
# you will get different values, as each time they are shuffled randomly.
# Normally you preprocess ONCE, so you need not rerun this code once it is done.
# If you rerun this whole sheet, the npzs will be overwritten with your newly preprocessed data.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print("Train Targets:")
print(f"Sum: {np.sum(train_targets)}, Count: {train_samples_count}, Proportion: {np.sum(train_targets) / train_samples_count}")

print("\nValidation Targets:")
print(f"Sum: {np.sum(validation_targets)}, Count: {validation_samples_count}, Proportion: {np.sum(validation_targets) / validation_samples_count}")

print("\nTest Targets:")
print(f"Sum: {np.sum(test_targets)}, Count: {test_samples_count}, Proportion: {np.sum(test_targets) / test_samples_count}")


Train Targets:
Sum: 1788.0, Count: 3579, Proportion: 0.49958088851634536

Validation Targets:
Sum: 214.0, Count: 447, Proportion: 0.47874720357941836

Test Targets:
Sum: 235.0, Count: 448, Proportion: 0.5245535714285714


### Save the three datasets in *.npz

In [7]:
# Save the three datasets in *.npz.
# In the next lesson, you will see that it is extremely valuable to name them in such a coherent way!

np.savez('../data/Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('../data/Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('../data/Audiobooks_data_test', inputs=test_inputs, targets=test_targets)