# Audiobooks Dataset

The objective is to determine if a person will purchase audiobooks again, based on the inputs.

Since this is a supervized learning example, the last column contains our targets, that indicate if the persor returned or not in the following 6 months (1=returned, 0= did not return)

In [47]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from sklearn import preprocessing
import pandas as pd

**On the table corresponding to the evaluation (from 0 to 10) the value was only filled for custumers who left any evaluation.
So solve the issue of empty cells, the ones that had no evaluation were filled with the average of the total evaluations = 8.91
This will remove empty cell and give the algorithm a basis parameter as comparison for the evaluation 

In [48]:
data = pd.read_csv('db/Audiobooks_data.csv')
data

Unnamed: 0,873,2160,2160.1,10.13,10.13.1,0,8.91,0.1,0.2,0.3,0.4,1
0,611,1404.0,2808,6.66,13.33,1,6.50,0.00,0.0,0,182,1
1,705,324.0,324,10.13,10.13,1,9.00,0.00,0.0,1,334,1
2,391,1620.0,1620,15.31,15.31,0,9.00,0.00,0.0,0,183,1
3,819,432.0,1296,7.11,21.33,1,9.00,0.00,0.0,0,0,1
4,138,2160.0,2160,10.13,10.13,1,9.00,0.00,0.0,0,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...
14078,27398,2160.0,2160,7.99,7.99,0,8.91,0.00,0.0,0,54,0
14079,28220,1620.0,1620,5.33,5.33,1,9.00,0.61,0.0,0,4,0
14080,28671,1080.0,1080,6.55,6.55,1,6.00,0.29,0.0,0,29,0
14081,31134,2160.0,2160,6.14,6.14,0,8.91,0.00,0.0,0,0,0


## Preprocessing

### Loading the dataset

In [49]:
#loading the data as an array

raw_data = np.loadtxt('db/Audiobooks_data.csv',delimiter=',')

unscaled_inputs_all = raw_data[:,1:-1] # all lines, all columns except for the IDs (first column)

targets_all = raw_data[:,-1] #all lines, only the last column 'target'

### Balancing the data

It is important to have a good variety of data, or the accuracy may seem good, but the model is not really making good predictions, you just happen to have a lot more of one type of the data

well distributed data will better show the accuracy of the model, mainly when it is a classification algorithm

    On this dataset, we have a lot more custumers that did not return, so we will balance the inputs before proceeding, so there will be as many 0's as there are 1's

In [50]:
num_one_targets = int(np.sum(targets_all)) #sometimes the number can come as a boolean, so we're declaring it as an integer, this returns the total number of '1' targets
zero_target_counter = 0
indices_to_remove = []

#we will count the number of 0's and compare to the unmber of 1's

for i in range(targets_all.shape[0]):
    if targets_all[i] ==0:
        zero_target_counter += 1
        if zero_target_counter > num_one_targets:
            indices_to_remove.append(i)

# remove exceeding 0's from the data
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0) #.delete(dataset, data to remove)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Standardizing the inputs

In [51]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [52]:
#shuffling the indices and then getting the related data

shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Spliting the dataset

In [53]:
samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8*samples_count) # 80% of the data
validation_samples_count = int(0.1*samples_count) # 10% of the data
test_samples_count = samples_count-train_samples_count-validation_samples_count # ~10% of the data

#first 80%
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

#next 10%
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

#next 10%
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

#printing the quantities
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1792.0 3579 0.5006985191394244
226.0 447 0.5055928411633109
219.0 448 0.4888392857142857


### Saving the dataset

storing the shuffled data

In [54]:
np.savez('Audiobooks_data_train',inputs=train_inputs,targets=train_targets)
np.savez('Audiobooks_data_validation',inputs=validation_inputs,targets=validation_targets)
np.savez('Audiobooks_data_test',inputs=test_inputs,targets=test_targets)

#keyword 'input' and 'targets' are only variables, you may use any keywords you preffer, these are being used for convenience

## Model Outlining

10 inputs on the input layer

2 hidden layers with 50 units each

1 outputs on the output layer (1 = custumer returned, 0 = customer did not return)


### Loading the data

    Train

In [55]:
npz = np.load('Audiobooks_data_train.npz')

train_inputs = npz['inputs'].astype(float)
train_targets = npz['targets'].astype(float)

    Validation

In [56]:
npz = np.load('Audiobooks_data_validation.npz')

validation_inputs = npz['inputs'].astype(float)
validation_targets = npz['targets'].astype(float)

    Test

In [57]:
npz = np.load('Audiobooks_data_test.npz')

test_inputs = npz['inputs'].astype(float)
test_targets = npz['targets'].astype(float)

### Model

Preprocessing usaully takes the most effort as it needs to be specifically designed for each dataset

The modeling itself usually can be applyed with little changes through a variety of datasests

In [58]:
input_size = 10 #10 columns for inputs categories
output_size = 2 #two possible output values (0 or 1)
hidden_layer = 50

model = tf.keras.Sequential([
    #input layer does not need to be declared since we already preprocessed the data prior to this point
    #hidden layers
    tf.keras.layers.Dense(hidden_layer, activation='relu'), #'relu' = activaion function
    tf.keras.layers.Dense(hidden_layer, activation='relu'),
    #output layer
    tf.keras.layers.Dense(output_size, activation='softmax') #'sofmax' = activation function -> transforms the values into probabilities
])

### Optimizer and loss funtion

Loss functions:

-   binary_crossentropy -> used when we have binary data encoding

-   categorical_crossentropy -> expects that the data is already one-hot encoded

-   sparse_categorical_crossentropy -> applies one-hot encoding to the data

In [59]:
# model.compile(optimizer, loss)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', # ADAM = adaptive moment estimation
              metrics=['accuracy']) # include metrics that we wish to be calculated during the training and testing

Fitting the model

In [60]:
batch_size = 100
num_epochs = 100
# this is a stopping mechanism available on the Keras library (check documentation for more options)
# it will always compare the validation loss with the previous vallue, if it starts increasing it will interrupt the epochs to prevent overfitting
# this way, we do not need to know the precise number of epochs to not overfit the model
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2) #the patience will allow the model to tolerate some increases on valisation loss before stopping

#too many epochs can mean model overfitting

model.fit(train_inputs,train_targets,
          batch_size=batch_size,
          epochs=num_epochs,
          callbacks=[early_stopping],
          validation_data=(validation_inputs,validation_targets),
          verbose=2)


Epoch 1/100
36/36 - 1s - loss: 0.5937 - accuracy: 0.7368 - val_loss: 0.4565 - val_accuracy: 0.8613 - 652ms/epoch - 18ms/step
Epoch 2/100
36/36 - 0s - loss: 0.3833 - accuracy: 0.8754 - val_loss: 0.3443 - val_accuracy: 0.8680 - 53ms/epoch - 1ms/step
Epoch 3/100
36/36 - 0s - loss: 0.3201 - accuracy: 0.8826 - val_loss: 0.3226 - val_accuracy: 0.8770 - 52ms/epoch - 1ms/step
Epoch 4/100
36/36 - 0s - loss: 0.2996 - accuracy: 0.8919 - val_loss: 0.3052 - val_accuracy: 0.8792 - 53ms/epoch - 1ms/step
Epoch 5/100
36/36 - 0s - loss: 0.2853 - accuracy: 0.8941 - val_loss: 0.2936 - val_accuracy: 0.8814 - 60ms/epoch - 2ms/step
Epoch 6/100
36/36 - 0s - loss: 0.2750 - accuracy: 0.8986 - val_loss: 0.2862 - val_accuracy: 0.8904 - 64ms/epoch - 2ms/step
Epoch 7/100
36/36 - 0s - loss: 0.2681 - accuracy: 0.9014 - val_loss: 0.2846 - val_accuracy: 0.8949 - 58ms/epoch - 2ms/step
Epoch 8/100
36/36 - 0s - loss: 0.2619 - accuracy: 0.9014 - val_loss: 0.2868 - val_accuracy: 0.8904 - 53ms/epoch - 1ms/step
Epoch 9/100
36

<keras.callbacks.History at 0x17718400250>

    Validation Accuracy = 89.71% (5.8 sec)

### Testing the model

In [61]:
model.evaluate(test_inputs,test_targets,
               verbose=2)

14/14 - 0s - loss: 0.2403 - accuracy: 0.9085 - 28ms/epoch - 2ms/step


[0.24031133949756622, 0.9084821343421936]

    Test Accuracy: 90.85%

It is usually smaler than or equal to the validation accuracy