<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Shuffling-the-data" data-toc-modified-id="Shuffling-the-data-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Shuffling the data</a></span></li><li><span><a href="#Balancing-the-dataset" data-toc-modified-id="Balancing-the-dataset-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>Balancing the dataset</a></span></li><li><span><a href="#Shuffling-the-dataset" data-toc-modified-id="Shuffling-the-dataset-1.0.3"><span class="toc-item-num">1.0.3&nbsp;&nbsp;</span>Shuffling the dataset</a></span></li><li><span><a href="#Standardizing-Inputs" data-toc-modified-id="Standardizing-Inputs-1.0.4"><span class="toc-item-num">1.0.4&nbsp;&nbsp;</span>Standardizing Inputs</a></span></li><li><span><a href="#Splitting-data-into-train,-validation,-and-test-sets" data-toc-modified-id="Splitting-data-into-train,-validation,-and-test-sets-1.0.5"><span class="toc-item-num">1.0.5&nbsp;&nbsp;</span>Splitting data into train, validation, and test sets</a></span></li></ul></li></ul></li></ul></div>

In [75]:
# This preprocessing can be used for any ML classification problem with two targets.
# If a future problem has more then 2 targets, we must balance the dataset differently

In [1]:
#Dependencies
import numpy as np 
# sklearn capabilities for standardizing inputs
from sklearn import preprocessing 

import pandas as pd

# Preprocessing


In [4]:
#load csv with numpy 
raw_csv_data = np.loadtxt('Data/original.csv',delimiter=',')
#removing id and targets and storing inputs
unscaled_inputs = raw_csv_data[:,1:-1]
#storing targets, targets is a tensor rank 1, or a vector
targets = raw_csv_data[:,-1]


### Shuffling the data

In [78]:
# Data must be shuffled because we are batching
# For example, if our data was collected and ordered based on date, third variables
# that would affect behavior like promotions, day of the week effect, etc could create 
# homogenous batches that would confuse our ml algorithm
shuffled_indicies = np.arange(unscaled_inputs.shape[0])
np.random.shuffle(shuffled_indicies)

shuffled_inputs = unscaled_inputs[shuffled_indicies]
shuffled_targets = targets[shuffled_indicies]

### Balancing the dataset

In [79]:
# Determing how many '1' (convert) targets there are 
one_targets = int(np.sum(shuffled_targets))
# Creating counter for '0' (not-converted) targets
zero_targets_counter = 0
# indices to remove
indicies_to_remove = []

for i in range(shuffled_targets.shape[0]):
    if shuffled_targets[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > one_targets:
            indicies_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(shuffled_inputs,indicies_to_remove,axis=0)
targets_equal_priors = np.delete(shuffled_targets,indicies_to_remove,axis=0)

In [80]:
print(f'Our original dataset had {shuffled_inputs.shape[0]} inputs and {shuffled_targets.shape[0]} targets')
print(f'When we balanced our priors, we have {unscaled_inputs_equal_priors.shape[0]} inputs and {targets_equal_priors.shape[0]} targets')



Our original dataset had 14084 inputs and 14084 targets
When we balanced our priors, we have 4474 inputs and 4474 targets


### Shuffling the dataset

In [81]:
# We must shuffle the targets and inputs again after balancing the dataset so priors are even (same amnt of
# 1's and 0's) because otherwise all of the target 1's will be placed in train dataset
shuffled_indicies = np.arange(unscaled_inputs_equal_priors.shape[0])
np.random.shuffle(shuffled_indicies)

shuffled_inputs = unscaled_inputs_equal_priors[shuffled_indicies]
shuffled_targets = targets_equal_priors[shuffled_indicies]

### Standardizing Inputs

In [82]:
#Using sklearn preprocessing and scale methods to standardize the inputs 
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Splitting data into train, validation, and test sets

In [83]:
sample_count = scaled_inputs.shape[0]

# Using 80, 10, 10 split for train, validations, test
train_samples_count = int(sample_count * .8)
validation_samples_count = int(sample_count * .1)
test_samples_count = sample_count - (train_samples_count + validation_samples_count)

In [84]:
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]
 
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count + validation_samples_count:]
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]


In [85]:
# Checking to see if I balanced the dataset

print(f'Our final training dataset consists of {np.sum(train_targets)} converts with a length of {train_samples_count} customers and a ratio of {np.sum(train_targets)/train_samples_count} converts to non-converts')
print('-----------------------------')
print(f'Our final validation dataset consists of {np.sum(validation_targets)} converts with a length of {validation_samples_count} customers and a ratio of {np.sum(validation_targets)/validation_samples_count} converts to non-converts')
print("-----------------------------")
print(f'Our final testing dataset consists of {np.sum(test_targets)} converts with a length of {test_samples_count} customers and a ratio of {np.sum(test_targets)/test_samples_count} converts to non-converts')




Our final training dataset consists of 1783.0 converts with a length of 3579 customers and a ratio of 0.4981838502374965 converts to non-converts
-----------------------------
Our final validation dataset consists of 232.0 converts with a length of 447 customers and a ratio of 0.5190156599552572 converts to non-converts
-----------------------------
Our final testing dataset consists of 222.0 converts with a length of 448 customers and a ratio of 0.4955357142857143 converts to non-converts


In [87]:
np.savez("Data/Audiobooks_data_train",inputs=train_inputs,targets=train_targets)
np.savez("Data/Audiobooks_data_validation",inputs=validation_inputs,targets=validation_targets)
np.savez("Data/Audiobooks_data_test",inputs=test_inputs,targets=test_targets)
