# Audiobooks business case

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but is crucial to creating a good model.

If you want to know how to do that, go through the code. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supervised learning datasets). Keep in mind that a specific problem may require additional preprocessing.

Note that we have removed the header row, which contains the names of the categories. We simply want the data.

This code does not include comments - it is the same as the one in the lesson. Please refer to the other file if you want the code with comments.

### Extract the data from the csv

In [2]:
import numpy as np
from sklearn import preprocessing
import tensorflow as tf

raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',')

unscaled_inputs_all = raw_csv_data[:,1:-1]

targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [3]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] ==0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)
targets_equal_priors = np.delete (targets_all, indices_to_remove, axis=0)

### Standardize the inputs

In [4]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [5]:

shuffled_indices = np.arange(scaled_inputs.shape[0])

np.random.shuffle(shuffled_indices)
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]




### Split the dataset into train, validation, and test

In [6]:
samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1778.0 3579 0.49678681195864766
219.0 447 0.4899328859060403
240.0 448 0.5357142857142857


### Save the three datasets in *.npz

In [7]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

In [8]:
npz = np.load('Audiobooks_data_train.npz')
train_inputs = npz['inputs'].astype(np.float32)
tarin_targets = npz['targets'].astype(np.int64)

npz = np.load('Audiobooks_data_validation.npz')
validation_inputs = npz['inputs'].astype(np.float32)
validation_targets= npz['targets'].astype(np.int64)

npz= np.load('Audiobooks_data_test.npz')
test_inputs = npz['inputs'].astype(np.float32)
test_targets=npz['targets'].astype(np.int64)

## Learning

In [9]:
input_size=10
output_size=2
hidden_layer_size=50

model = tf.keras.Sequential([
                            
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                            tf.keras.layers.Dense(output_size, activation='softmax')
    
                            ])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

batch_size=100
max_epochs = 100
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

model.fit( train_inputs,
         train_targets,
         batch_size=batch_size,
         epochs=max_epochs,
          callbacks = [early_stopping],
         validation_data = (validation_inputs, validation_targets),
         verbose=2)

Epoch 1/100
36/36 - 2s - loss: 0.5518 - accuracy: 0.7290 - val_loss: 0.4881 - val_accuracy: 0.7204 - 2s/epoch - 48ms/step
Epoch 2/100
36/36 - 0s - loss: 0.4483 - accuracy: 0.7692 - val_loss: 0.4295 - val_accuracy: 0.7517 - 120ms/epoch - 3ms/step
Epoch 3/100
36/36 - 0s - loss: 0.4044 - accuracy: 0.7865 - val_loss: 0.3992 - val_accuracy: 0.7718 - 121ms/epoch - 3ms/step
Epoch 4/100
36/36 - 0s - loss: 0.3785 - accuracy: 0.8050 - val_loss: 0.3919 - val_accuracy: 0.7696 - 122ms/epoch - 3ms/step
Epoch 5/100
36/36 - 0s - loss: 0.3653 - accuracy: 0.8069 - val_loss: 0.3922 - val_accuracy: 0.7606 - 122ms/epoch - 3ms/step
Epoch 6/100
36/36 - 0s - loss: 0.3546 - accuracy: 0.8122 - val_loss: 0.3741 - val_accuracy: 0.7852 - 121ms/epoch - 3ms/step
Epoch 7/100
36/36 - 0s - loss: 0.3475 - accuracy: 0.8189 - val_loss: 0.3934 - val_accuracy: 0.7606 - 124ms/epoch - 3ms/step
Epoch 8/100
36/36 - 0s - loss: 0.3444 - accuracy: 0.8128 - val_loss: 0.3749 - val_accuracy: 0.7673 - 124ms/epoch - 3ms/step


<keras.callbacks.History at 0x19b77acd370>

## Testing the model

In [10]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



## Predicting

In [12]:
new_inputs = [[1620, 810, 5.87, 5.87, 0, 5, 0, 1, 0, 253]]
scaled = preprocessing.scale(new_inputs)
scaled.shape
#num = np.argmax(model.predict(scaled), axis=-1)
#print(num)      

(1, 10)