# Audiobooks Predictions:

### Problem:   
* Each customer in the database has made a purchase at least once
* Create a machine learning algorithm based on given data that can predict if a customer will buy again from the Audiobook company
* We want to focus on customers who are likely to come back and buy again
* This model can identify the most important metrics for a customer to come back again
* The inputs:    

1)	ID: customer Id; no information in here. We will skip this in the algorithm  
2)	Book length (mins)_overall: sum of the lengths of all purchased books  
3)	Book length(mins)_avg: sum of the lengths of all purchased books / number of purchased books. So if somebody bought a single book, the average length and the overall length will be the same.  
4)	Price overall: sum of the prices of all purchased books  
5)	Price_avg: sum of the prices of all purchased books / number of purchased books  
6)	Review: it's a Boolean. It shows if a customer left a review. Our assumption is people who leave reviews are more likely to come back again  
7)	Review 10/10: measure of the review on a scale of a 1 to 10. 

      a. First preprocessing trick: logically we will only have a value for people who left a review. Most people leave no review. We substitute all the missing values with the average review. The average is 8.91. For our ML algorithm, 8.91 would mean the status quo. A review more than that indicate above average feelings. 
  
8)	Minutes listened: this is a measure of engagement  
9) Completion: total minutes listened / book length overall  
10) Support requests: the total number of support requests a person has opened. It may mean more support the person needed the more he or she got fed up with the platform and abandoned it.   
11) Last visited minus purchase date: the bigger the difference the better. If the value is 0, the customer has never accessed what he has bought  
12)	Targets: 1 if a person converted and 0 if it's not. (=purchased again or not). After the two year period, targets are measured with the additional 6 months  



## 1) Preprocess the data
* a) balance the dataset
* b) create 3 datasets: training, validation and test
* c) save the newly created sets in tensor friendly format = npz

The following code should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supersized learning datasets). 

In [1]:
import numpy as np
from sklearn import preprocessing

raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter =',')
unscaled_inputs_all = raw_csv_data[:,1:-1]
#All rows, and column 1 to last column (excluded). we exclude the first column (0), because we will not use ID column
#We exclude the last column (-1) because we will assign targets separately.
targets_all = raw_csv_data[:,-1]

## a) Balance the dataset

In [2]:
#we have unbalanced dataset, since most of the customers didn't convert in the given time
#we need to balance it with 50% convert and 50% not convert
#we need to count how many targets are 1 (meaning customer did convert)
num_one_targets = int(np.sum(targets_all)) #2237 targets are 1 out of 14,084 (targets_all.shape[0] gives the number of rows)

#set a counter for targets that are 0
zero_targets_counter = 0
#we want to create a balanced dataset, so we will have to remove some indices:
indices_to_remove=[]

#count the number of targets that are 0.
#once there are as many 0s as 1s, mark entries where the target is 0.
for i in range(targets_all.shape[0]): #targets_all.shape[0]: returns the number of rows in the dataset
    if targets_all[i] == 0:
        zero_targets_counter += 1 #ztc=ztc+1 
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
        
#create two new variables, one that will contain the inputs, one that will contain the targets
#we delete all the indices that we marked "to remove" in the loop above
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
#np.delete (input array, slice, axis (0=rows))
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)


#### Standardize the inputs: scaling
Different order of magnitudes creates problems, so we will standardize the inputs: We will use sklearn for this

In [3]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

#### Shuffle the data:
When the data was collected it was actually arranged by date. We will shuffle the indices of the data, so the data is not arranged in any way when we feed it.  

Since we will be batching, we want the data to be as rondomly spread out as possible

In [4]:
shuffled_indices = np.arange(scaled_inputs.shape[0]) #gives indices numbers from 0 to 4473 = 4474 numbers which is the range. (scaled_inputs.shape[0] = 4474) 
np.random.shuffle(shuffled_indices)

#use the shuffled indices to shuffle the inputs and targets
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

## b) Split the dataset: train, validation, and test

In [5]:
#count the total number of samples:
samples_count = shuffled_inputs.shape[0]

#count the samples in each subset, we want 80-10-10 distribution: train, validation, test
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)
#the test dataset contains all remaining data
test_samples_count = samples_count - train_samples_count - validation_samples_count

#create variables that record the inputs and targets for training
#in our shuffled dataset, they are the first 'train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

#create variables that record the inputs and targets for the validation
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count + validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count + validation_samples_count]

#create variables that record the inputs and targets for the test
test_inputs = shuffled_inputs[train_samples_count + validation_samples_count:]
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]

#print the number of targets that are 1s, the total number of samples, and the proportion for training, validation and test:
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1815.0 3579 0.5071248952221291
225.0 447 0.5033557046979866
197.0 448 0.43973214285714285


## c) Save the three datasets in .npz

In [6]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets = train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs = test_inputs, targets= test_targets)

## 2) Create a class that handles batching

Whenever you want to batch the data you need to have appropriate methods. There are some batching methods integrated in TensorFlow (tf.train.batch()) and sklearn, but some problems may need specific coding. You can use these following methods for any machine learning framework you need (directly or after little fine-tunning)

In [7]:
#create a class that will do the batching for the algorithm
#this code is extremely reusable. You should just change the Audiobooks_data everywhere in the code
class Audiobooks_Data_Reader():
    def __init__(self, dataset, batch_size=None):
        npz = np.load("Audiobooks_data_{0}.npz".format(dataset))
        #if I call this class with x('train', 5), it will load 'Audiobooks_data_train.npz' with a batch size of 5
        self.inputs, self.targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)
        #two variables that take the values of the inputs and targets.
        if batch_size is None:
            self.batch_size = self.inputs.shape[0]
        else:
            self.batch_size = batch_size
        self.curr_batch = 0
        self.batch_count = self.inputs.shape [0] // self.batch_size
        #this counts the batch number, given the batch size
        #if the batch size is None (we call the class without entering batch size), we are either validating or testing, so we want to take the data in a single batch 
    
    def __next__(self):
        if self.curr_batch>=self.batch_count:
            self.curr_batch = 0
            raise StopIteration()
    #this is a method which loads the next batch
    #now we slice the dataset in batches and then the next function loads them one after the other
        batch_slice = slice(self.curr_batch * self.batch_size, (self.curr_batch + 1) * self.batch_size)
        inputs_batch = self.inputs[batch_slice]
        target_batch = self.targets[batch_slice]
        self.curr_batch +=1
    #one-hot encoding for the targets. this is useful for classification task with more than one target
        classes_num = 2
        targets_one_hot = np.zeros((target_batch.shape[0], classes_num))
        targets_one_hot[range(target_batch.shape[0]), target_batch] = 1
    #the function will return the inputs batch and the one-hot encoded targets
        return inputs_batch, targets_one_hot
    
    #a method needed for iterating over the batches, as we will put them in a loop
    #this following code tells python that the class we're defining is iterable 
    #an iterator in python is a class with a method __next__ that defines exactly how to iterate through its objects
    def __iter__(self):
        return self
    

#### Batching explanation:
We created the class Audiobooks_Data_Reader. There are three functions in the class: init, next and iter.  
This class will be an iterator. An iterator is a class with methods: next and iter; and it is used with loops.  
The init method loads the data from npz.  
The next method loads the next batch from the npz.  
The iter tells python that the class is iterable.  

##### INIT METHOD: 
this method has two arguments: dataset and batch size. The self is simply python notation that defines the method as an instance method as opposed to a static method. instance method: you can apply the method to instances in the class. statis method: you can apply the method to the class. if we don't declare a batch size, the class will assume it has to load all the data as a single batch.
###### example: 
 x = Audiobooks_Data_Reader('train', 5)  
the result would be to load the train data set and then proceed with the operations taking batches of five samples at a time
##### batch size = how many samples in each batch
##### batch number = how many batches in total
y = Audiobooks_Data_reader('validation')  
the result would be to load the validation dataset where the whole data is contained in a single batch  

##### NEXT FUNCTION:
Next function slices the next batch out of the dataset and loads it for the next iteration. In addition, this is the place where we one-hot encode the targets. as the targets were 0s and 1s we want to split them into one-hot encoded target. we used the parameter called classes number. there are two classes: 0s and 1s. if you have a classification problem with more than two classes (bread, yoghurt, muffin) then you should change the class_num variable

##### The differences we should make for other datasets:
1) adjust the names of the NPZ files  
2) adjust the number of classes





# 3) Create the machine learning algorithm
We will create an algorithm which is essentially copy-pasting the MNIST code and we will simply adjust where needed. tensorflow code is extremely reusable. we will put the wholde code in one piece as we can simply rerun the cell and train a new model. That's because whole algorithm is contained in the cell and we have the tf.reset_default_graph() function

In [12]:
import tensorflow as tf
#the hyperparameters
input_size = 10 #number of columns we use
output_size = 2 # as we one-hot encoded the targets
hidden_layer_size = 200


tf.reset_default_graph()

#the placeholders:
inputs = tf.placeholder(tf.float32, [None, input_size])
targets = tf.placeholder(tf.int32, [None, output_size])

#outline the model with 2 hidden layers:
weights_1 = tf.get_variable('weights_1', [input_size, hidden_layer_size])
biases_1 = tf.get_variable('biases_1', [hidden_layer_size])
outputs_1 = tf.nn.relu(tf.matmul(inputs, weights_1) + biases_1)

weights_2 = tf.get_variable("weights_2", [hidden_layer_size, hidden_layer_size])
biases_2 = tf.get_variable("biases_2", [hidden_layer_size])
outputs_2 = tf.nn.relu(tf.matmul(outputs_1, weights_2) + biases_2)

weights_3 = tf.get_variable("weights_3", [hidden_layer_size, output_size])
biases_3 = tf.get_variable("biases_3", [output_size])

#no activation output
outputs = tf.matmul(outputs_2, weights_3) + biases_3

#softmax cross entropy loss with logits:
loss = tf.nn.softmax_cross_entropy_with_logits(logits = outputs, labels = targets)
mean_loss = tf.reduce_mean(loss)


#get a 0 or 1 for every input indicating whether it output the correct answer
out_equals_target = tf.equal(tf.argmax(outputs, 1), tf.argmax(targets, 1))
accuracy = tf.reduce_mean(tf.cast(out_equals_target, tf.float32))

#optimize with Adam:
optimize = tf.train.AdamOptimizer(learning_rate = 0.003).minimize(mean_loss)

#create a session:
sess= tf.InteractiveSession()

#initialize the variables:
initializer = tf.global_variables_initializer()
sess.run(initializer)

#other hyperparameters
batch_size=500
max_epochs =50
prev_validation_loss = 9999999.

#load the first batch of training and validation, using the class we created
train_data = Audiobooks_Data_Reader('train', batch_size)
validation_data = Audiobooks_Data_Reader('validation')

#optimize the algorithm: create for loop for epochs:
for epoch_counter in range(max_epochs):
    curr_epoch_loss=0.
    for input_batch, target_batch in train_data: #iterate over the training data
        _, batch_loss = sess.run([optimize,mean_loss],
                feed_dict = {inputs: input_batch, targets: target_batch})
        curr_epoch_loss +=batch_loss #record the batch loss into the current loss
    curr_epoch_loss /=train_data.batch_count #find the mean curr_epoch_loss
    validation_loss = 0.
    validation_accuracy = 0.
    for input_batch, target_batch in validation_data: #use the same logic of the code to forward propagate the validation set 
        validation_loss, validation_accuracy = sess.run([mean_loss, accuracy],
            feed_dict = {inputs: input_batch, targets: target_batch})
    print('Epoch '+str(epoch_counter+1)+
          '. Training loss: '+'{0:.3f}'.format(curr_epoch_loss)+
          '. Validation loss: '+'{0:.3f}'.format(validation_loss)+
          '. Validation accuracy: '+'{0:.2f}'.format(validation_accuracy * 100.)+'%')
    if validation_loss > prev_validation_loss:
        break
    prev_validation_loss = validation_loss
        
print('End of training')
    
    


Epoch 1. Training loss: 0.533. Validation loss: 0.440. Validation accuracy: 76.96%
Epoch 2. Training loss: 0.394. Validation loss: 0.401. Validation accuracy: 77.63%
Epoch 3. Training loss: 0.356. Validation loss: 0.378. Validation accuracy: 78.97%
Epoch 4. Training loss: 0.341. Validation loss: 0.369. Validation accuracy: 79.87%
Epoch 5. Training loss: 0.332. Validation loss: 0.367. Validation accuracy: 79.64%
Epoch 6. Training loss: 0.326. Validation loss: 0.364. Validation accuracy: 80.09%
Epoch 7. Training loss: 0.320. Validation loss: 0.362. Validation accuracy: 80.09%
Epoch 8. Training loss: 0.316. Validation loss: 0.360. Validation accuracy: 79.42%
Epoch 9. Training loss: 0.314. Validation loss: 0.361. Validation accuracy: 79.42%
End of training


# 4) Test the Model:


In [13]:
test_data = Audiobooks_Data_Reader('test')

for input_batch, target_batch in test_data: # we need the forwardpropagate as we did in the validation. cpy and past the validation forward propagate change the names and change the second line 
        test_accuracy = sess.run([accuracy],
            feed_dict = {inputs: input_batch, targets: target_batch})
        
test_accuracy_percent = test_accuracy[0] *100.
#test_accuracy[0]: because it is a list with one value only
print('test accuracy:' + '{0:.2f}'.format(test_accuracy_percent) + '%')

test accuracy:82.59%


data is not sufficient, so we don't expect more than 85% accuracy

hidden_layer_size= 50; test accur = 81.70  
hidden_layer_size = 80; test accur = 84.38  
hidden_layer_size = 95; test accur = 84.60  
hidden_layer_size = 80; sigmoid both output; test_accur= 81.47 (more epochs)  
hidden_layer_size = 80; learning rate = 0.002; test_accur = 84.15  
hidden_layer_size = 80; learning rate = 0.002; batch size=1000; test acc= 81.70  
hidden_layer_size = 100; learning rate = 0.002; batch size=100; test acc = 84.38  
