<h1>Pratical example Audiobooks</h1>

<h2>Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)</h2>



Since we are dealing with real life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but is crucial to creating a good model.

If you want to know how to do that, go through the code. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supervised learning datasets). Keep in mind that a specific problem may require additional preprocessing.

Note that we have removed the header row, which contains the names of the categories. We simply want the data.

This code does not include comments - it is the same as the one in the lesson. Please refer to the other file if you want the code with comments.

<h2>Extract the data from csv</h2>

In [3]:
import numpy as np
from sklearn import preprocessing

raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',')

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

<h2>Balance the dataset</h2>

In [4]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] ==0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)
targets_equal_priors = np.delete (targets_all, indices_to_remove, axis=0)

<h2>Standardize the inputs</h2>

In [5]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

<h2>Shuffle the data</h2>

In [6]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

<h2>Split the dataset into train, validation, and test</h2>

In [17]:
samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

print(f"Train targets: {np.sum(train_targets)}, total sample: {train_samples_count}, propotion: {np.sum(train_targets) / train_samples_count}")
print(f"Validation targets: {np.sum(validation_targets)}, total sample: {validation_samples_count}, propotion: {np.sum(validation_targets) / validation_samples_count}")
print(f"Test targets: {np.sum(test_targets)}, total sample: {test_samples_count}, propotion: {np.sum(test_targets) / test_samples_count}")

Train tragets: 1787.0, total sample: 3579, propotion: 0.4993014808605756
Validation tragets: 217.0, total sample: 447, propotion: 0.4854586129753915
Test tragets: 233.0, total sample: 448, propotion: 0.5200892857142857


<h2>Save the three datasets in *.npz</h2>

In [18]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)