# Audiobooks business case

## Preprocess the data. Balance the dataset. Standardize the data. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real life data, we will need to preprocess it a bit.

If you want to know how to do that, go through the code with comments. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supersized learning datasets). Keep in mind that a specific problem may require additional preprocessing.

Note that we have removed the header row, which contains the names of the categories. We simply need the numerical data.

## Extract Data from the csv

In [45]:
import numpy as np

from sklearn.preprocessing import StandardScaler

import pickle

In [46]:
raw_csv_data = np.loadtxt(
    "/home/angelo/repos/vscode_repos/customer_analytics_2022/Data/Audiobooks_data.csv",
    delimiter=",",
)

In [47]:
# we want all but the first and last column
unscalled_inputs_all = raw_csv_data[:, 1:-1]

# dependent feature
targets_all = raw_csv_data[:, -1]

## Balance the dataset

In [48]:
# 1. count the number of targets == 1 by summing the column
num_one_targets = int(np.sum(targets_all))


# 2. randomly keep as many 0s as there are 1s
zero_targets_counter = 0

# list to collect indices to be removed
indicies_to_remove = []

# targets_all.shape[0] is length of the vector
for i in range(targets_all.shape[0]):
    # increase coutner +=1 if target is 0
    if targets_all[i] == 0:
        zero_targets_counter += 1

        # as long as there are not enough 0s, continue filling
        if zero_targets_counter > num_one_targets:
            indicies_to_remove.append(i)

unscaled_inputs_equal_priors = np.delete(
    unscalled_inputs_all, indicies_to_remove, axis=0
)
targets_equals_priors = np.delete(targets_all, indicies_to_remove, axis=0)

## Standardize the inputs 

In [49]:
scaler_deep_learning = StandardScaler()
scaler_inputs = scaler_deep_learning.fit_transform(unscaled_inputs_equal_priors)

## Shuffle the data to randomize it a litte 

Becasue assume that the data is ordered according to date; as batches of inputs will be used, the data from one specifci date may be more homogenoeus; this will confuse the stochastic gradient descent  

In [50]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaler_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaler_inputs[shuffled_indices]
shuffled_targets = targets_equals_priors[shuffled_indices]

## Split the dataset into train, validation, and test

In [51]:
samples_count = shuffled_inputs.shape[0]

# use 80-10-10 split (train, val, test)
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count


# now extract them
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_targets[
    train_samples_count : train_samples_count + validation_samples_count
]
validation_targets = shuffled_targets[
    train_samples_count : train_samples_count + validation_samples_count
]

test_inputs = shuffled_targets[train_samples_count + validation_samples_count :]
test_targets = shuffled_targets[train_samples_count + validation_samples_count :]


# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(
    np.sum(train_targets),
    train_samples_count,
    np.sum(train_targets) / train_samples_count,
)
print(
    np.sum(validation_targets),
    validation_samples_count,
    np.sum(validation_targets) / validation_samples_count,
)
print(
    np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count
)

1791.0 3579 0.5004191114836547
213.0 447 0.47651006711409394
233.0 448 0.5200892857142857


## Save the three datasets in *.npz

In [52]:
# Save the three datasets in *.npz.
# In the next lesson, you will see that it is extremely valuable to name them in such a coherent way!

np.savez("Audiobooks_data_train", inputs=train_inputs, targets=train_targets)
np.savez(
    "Audiobooks_data_validation", inputs=validation_inputs, targets=validation_targets
)
np.savez("Audiobooks_data_test", inputs=test_inputs, targets=test_targets)

### Save the scaler

In [53]:
# Similar to how we have saved the scaler files before, we also save this scaler, so we can apply in on new data
pickle.dump(
    scaler_deep_learning,
    open(
        "/home/angelo/repos/vscode_repos/customer_analytics_2022/pickle_data_models/scaler_deep_learning.pickle",
        "wb",
    ),
)