# Audiobooks business case

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but is crucial to creating a good model.

If you want to know how to do that, go through the code. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supervised learning datasets). Keep in mind that a specific problem may require additional preprocessing.

Note that we have removed the header row, which contains the names of the categories. We simply want the data.

This code does not include comments - it is the same as the one in the lesson. Please refer to the other file if you want the code with comments.

In [8]:
import numpy as np
from sklearn.preprocessing import StandardScaler
import tensorflow as tf

In [3]:
raw_data = np.loadtxt('Audiobooks-data.csv', delimiter=',')

In [4]:
unscaled_inputs = raw_data[:,1:-1]
unscaled_targets = raw_data[:,-1]


In [5]:
no_of_ones = int(np.sum(unscaled_targets))
no_of_zeros = 0
indices_to_remove = []
for i in range(unscaled_targets.shape[0]):
    if unscaled_targets[i] == 0 :
        no_of_zeros += 1
        if no_of_zeros > no_of_ones :
            indices_to_remove.append(i)

In [6]:
unscaled_inputs = np.delete(unscaled_inputs,indices_to_remove,axis=0)
unscaled_targets = np.delete(unscaled_targets,indices_to_remove,axis=0)

In [9]:
scale = StandardScaler()
scale.fit(unscaled_inputs)
scaled_inputs = scale.transform(unscaled_inputs)

In [15]:
shuffled_scaled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_scaled_indices)
shuffled_inputs = scaled_inputs[shuffled_scaled_indices]
shuffled_targets = unscaled_targets[shuffled_scaled_indices]

In [18]:
sample_count = shuffled_inputs.shape[0]
train_data_count = int(0.8*sample_count)
validation_data_count = int(0.1*sample_count)
test_data_count = sample_count - (train_data_count + validation_data_count)

train_sample_input_data = shuffled_inputs[:train_data_count]
train_sample_tartget_data = shuffled_targets[:train_data_count]

validation_sample_inputs_data = shuffled_inputs[train_data_count: train_data_count + validation_data_count]
validation_sample_target_data = shuffled_targets[train_data_count: train_data_count + validation_data_count]

test_sample_inputs_data = shuffled_inputs[train_data_count + validation_data_count : ]
test_sample_target_data = shuffled_targets[train_data_count + validation_data_count : ]

print(np.sum(train_sample_tartget_data), train_data_count,np.sum(train_sample_tartget_data)/train_data_count)
print(np.sum(validation_sample_target_data),validation_data_count,np.sum(train_sample_tartget_data)/validation_data_count)
print(np.sum(test_sample_target_data),test_data_count,np.sum(test_sample_target_data)/test_data_count)

1804.0 3579 0.5040514110086617
203.0 447 4.03579418344519
230.0 448 0.5133928571428571


In [19]:
np.savez('Audiobook_train_data',inputs = train_sample_input_data , targets = train_sample_tartget_data)
np.savez('Audiobook_validation_data',inputs = validation_sample_inputs_data , targets = validation_sample_target_data)
np.savez('Audiobook_test_data',inputs = test_sample_inputs_data , targets = test_sample_target_data)