## Week 10: Day 3 – Business Case 

### Pratical Examples.Audiobooks

1. preprocess the data
2. balance the dataset
3. create three dataset(training,validation and test)
4. save the newly created sets in a tensor friendly formats (e.g npz)

#### Extract the data from the CSV

In [1]:
# import the neccessary packages
import numpy as np
from sklearn import preprocessing

raw_csv_data = np.loadtxt('Audiobooks_data.csv' , delimiter = ',')
# The inputs are all columns in the csv, except for the first one [:,0] (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is our targets
unscaled_inputs_all = raw_csv_data[:,1:-1]
# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data[:,-1]

#### Balancing the dataset

In [2]:
num_one_targets = int(np.sum(targets_all)) # count how many targets are 1
zero_targets_count = 0 # counter for target 0
indices_to_remove = [] # remove extra input/target pairs for balance

# count the number of targets 0, when get same amount of target 1 and 0, make entries where target is zero
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_count += 1
        if zero_targets_count > num_one_targets :
            indices_to_remove.append(i)
            
# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked "to remove" in the loop above.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all,indices_to_remove,axis = 0)
targets_equal_priors = np.delete(targets_all,indices_to_remove,axis = 0)            

#### Standardizing the inputs

In [3]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

#### Shuffle the data

In [4]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

#### Split the dataset into training , validation and testing dataset

In [5]:
#count the total number of samples
sample_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_sample_count = int(0.8 * sample_count)
validation_sample_count = int(0.1 * sample_count)

# The 'test' dataset contains all remaining data.
test_sample_count = sample_count - train_sample_count - validation_sample_count

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_sample_count]
train_targets = shuffled_targets[:train_sample_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_sample_count:train_sample_count + validation_sample_count]
validation_targets = shuffled_targets[train_sample_count:train_sample_count + validation_sample_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_sample_count + validation_sample_count:]
test_targets = shuffled_targets[train_sample_count + validation_sample_count:]

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets),train_sample_count,np.sum(train_targets)/train_sample_count)
print(np.sum(validation_targets),validation_sample_count, np.sum(validation_targets)/validation_sample_count)
print(np.sum(test_targets),test_sample_count, np.sum(test_targets)/test_sample_count)

1794.0 3579 0.501257334450964
226.0 447 0.5055928411633109
217.0 448 0.484375


#### Save the three datasets in *.npz format

In [6]:
np.savez('Audiobooks_data_train',inputs = train_inputs, targets = train_targets)
np.savez('Audiobooks_data_validation',inputs = validation_inputs, targets = validation_targets)
np.savez('Audiobooks_data_test',inputs = test_inputs, targets = test_targets)