# Predict audiobooks returning customers

I am building a model that analysises data from customers of an audiobook app to classify them as returning or not, in a boolean way. By "returning" I mean returning customers, i.e. if they will go back to using the app to purchase more products.
Methods used:
- NN
- SVM (as comparison)

The data are imported from an attached csv file, and is tables as follows:

| customer id | average minutes spent per book | total minutes spent on app  |average price of book   |total spent on app   |has left reviews?|review score|completion fraction| minutes listened |number of support requests|Last visited time minus purchase date| Target (dependent variable)|
|---------------|-------|---|---|---|---|---|---|---|---|---|---|
|   x   |   x    |  x | x  | x  |x|x|x|x|x|x|x|x|x|

### Methodology - preprocessing

- Balancing the data: the original data is not balanced in terms of target (outcome) values, which means that the evaluation of the accuracy would be skewed. I removed values in order to have a roughly equal number of target values;
- Standardize the values;
- shuffling values: this is to avoid any order that might be present in the way the data is collected;
- saving data into csv: this will be handy for any other analysis;
- Split into train, validation and test datasets;
- Saving into npz format;

This work is based on an exercise from https://www.udemy.com/course/the-data-science-course-complete-data-science-bootcamp/

In [9]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn import preprocessing

## Importing data

In [2]:
ab_ds = pd.read_csv("Audiobooks_data.csv")

In [10]:
raw_csv_data=np.loadtxt("Audiobooks_data.csv", delimiter=',')
unscaled_inputs_all= raw_csv_data[:,1:-1] #excluding id and target columns
targets_all=raw_csv_data[:,-1]

In [3]:
ab_ds.columns = ['id','mins_avg','mins_tot','price_avg','price_tot','review','rev_score','completion','mins_listened','support_req','Last_visited_minus_purchase_date','Target']

In [4]:
ab_ds.head()

Unnamed: 0,id,mins_avg,mins_tot,price_avg,price_tot,review,rev_score,completion,mins_listened,support_req,Last_visited_minus_purchase_date,Target
0,611,1404.0,2808,6.66,13.33,1,6.5,0.0,0.0,0,182,1
1,705,324.0,324,10.13,10.13,1,9.0,0.0,0.0,1,334,1
2,391,1620.0,1620,15.31,15.31,0,9.0,0.0,0.0,0,183,1
3,819,432.0,1296,7.11,21.33,1,9.0,0.0,0.0,0,0,1
4,138,2160.0,2160,10.13,10.13,1,9.0,0.0,0.0,0,5,1


In [6]:
ab_ds=ab_ds.assign(num_books=lambda x: x.mins_tot/x.mins_avg)
ab_ds.head()

Unnamed: 0,id,mins_avg,mins_tot,price_avg,price_tot,review,rev_score,completion,mins_listened,support_req,Last_visited_minus_purchase_date,Target,num_books
0,611,1404.0,2808,6.66,13.33,1,6.5,0.0,0.0,0,182,1,2.0
1,705,324.0,324,10.13,10.13,1,9.0,0.0,0.0,1,334,1,1.0
2,391,1620.0,1620,15.31,15.31,0,9.0,0.0,0.0,0,183,1,1.0
3,819,432.0,1296,7.11,21.33,1,9.0,0.0,0.0,0,0,1,3.0
4,138,2160.0,2160,10.13,10.13,1,9.0,0.0,0.0,0,5,1,1.0


In [7]:
sum(ab_ds.Target==1)

2236

In [8]:
sum(ab_ds.Target==0)

11847

Hence the data is not extremely well balanced. We shall have equal (or roughly equal) number of 0 and 1 cases.

## Balancing the dataset

In [11]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

indices_to_remove = []

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, mark entries where the target is 0.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked "to remove" in the loop above.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

## Standardize the inputs

In [12]:
scaled_inputs= preprocessing.scale(unscaled_inputs_equal_priors)

## Shuffle the inputs and targets
shuffling to make sure that there is no order in the data, in case it was collected by date or any other parameter.

In [13]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices] # after shuffling the indeces, we address them and put in the shuffled order in a new frame
shuffled_targets = targets_equal_priors[shuffled_indices]

In [22]:
np.savetxt('shuffled_inputs.csv', shuffled_inputs, delimiter=',')
np.savetxt('shuffled_targets.csv', shuffled_targets, delimiter=',')

## Split into train, validation, test DS
80-10-10 split

In [15]:
samples_count =  shuffled_inputs.shape[0]
train_samples_count=int(samples_count*0.8)
val_samples_count=int(samples_count*0.1)
test_samples_count=samples_count-train_samples_count-val_samples_count

train_inputs=shuffled_inputs[:train_samples_count]
train_targets=shuffled_targets[:train_samples_count]

val_inputs=shuffled_inputs[train_samples_count:train_samples_count+val_samples_count]
val_targets=shuffled_targets[train_samples_count:train_samples_count+val_samples_count]

test_inputs=shuffled_inputs[train_samples_count+val_samples_count:train_samples_count+val_samples_count+test_samples_count]
test_targets=shuffled_targets[train_samples_count+val_samples_count:train_samples_count+val_samples_count+test_samples_count]

In [17]:
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(val_targets), val_samples_count, np.sum(val_targets) / val_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1795.0 3579 0.5015367421067337
216.0 447 0.48322147651006714
226.0 448 0.5044642857142857


## Save in npz format

In [18]:
# Save the three datasets in *.npz.
# In the next lesson, you will see that it is extremely valuable to name them in such a coherent way!

np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=val_inputs, targets=val_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)