# Deep Neural Network for MNIST Classification

We'll apply all the knowledge from the lectures in this section to write a deep neural network. The problem we've chosen is referred to as the \"Hello World\" of deep learning because for most students it is the first deep learning algorithm they see.
   
The dataset is called MNIST and refers to handwritten digit recognition. You can find more about it on Yann LeCun's website (Director of AI Research, Facebook). He is one of the pioneers of what we've been talking about and of more complex approaches that are widely used today, such as covolutional neural networks (CNNs).

The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image)
    "The goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.
   
Our goal would be to build a neural network with 2 hidden layers."

### VID 376 Importing relevant packages and loading the data
### Import relevant library

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

In [2]:
# TensorFLow includes a data provider for MNIST that we'll use.\n",
# It comes with the tensorflow-datasets module

In [3]:
import tensorflow_datasets as tfds

## Data

In [4]:
# tfds.load actually loads a dataset (or downloads and then loads if that's the first time you use it)
# in our case, we are interesteed in the MNIST; the name of the dataset is the only mandatory argument
# there are other arguments we can specify, which we can find useful
# mnist_dataset = tfds.load(name='mnist', as_supervised=True)
# mnist_dataset, mnist_info = tfds.load(name='mnist', with_info=True, as_supervised=True)
# with_info=True will also provide us with a tuple containing information about the version, features, number of samples
# we will use this information a bit below and we will store it in mnist_info
# as_supervised=True will load the dataset in a 2-tuple structure (input, target)
# alternatively, as_supervised=False, would return a dictionary
# obviously we prefer to have our inputs and targets separated


## VID 377 preprocessing the data- Create a validation set and scale it

In [5]:
mnist_dataset, mnist_info = tfds.load(name='mnist', with_info=True, as_supervised=True)

#once we have loaded the dataset, we can easily extract the training and testing dataset with the built references
mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test']

# by default, TF has training and testing datasets, but no validation sets
# thus we must split it on our own
# we start by defining the number of validation samples as a % of the train samples
# this is also where we make use of mnist_info (we don't have to count the observations)

num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples

# let's cast this number to an integer, as a float may cause an error along the way
# tf.cast(x,dtype)-casts (converts) a variable into a given data type

num_validation_samples = tf.cast(num_validation_samples, tf.int64)

# let's also store the number of test samples in a dedicated variable (instead of using the mnist_info one)
num_test_samples = mnist_info.splits['test'].num_examples

# once more, we'd prefer an integer (rather than the default float)
num_test_samples = tf.cast(num_test_samples, tf.int64)

# normally, we would like to scale our data in some way to make the result more numerically stable
# in this case we will simply prefer to have inputs between 0 and 1
# let's define a function called: scale, that will take an MNIST image and its label

def scale(image, label):
# we make sure the value is a float
    image = tf.cast(image, tf.float32)
# since the possible values for the inputs are 0 to 255 (256 different shades of grey)
# if we divide each element by 255, we would get the desired result -> all elements will be between 0 and 1
    image /= 255. # the dot at the end means we want the result to be a float
    return image, label

# dataset.map(*function*) appplies a custom transformation to a given dataset. it takes as input a fn 
# which determines the transformation

# we have already decided that we will get the validation data from mnist_train

scaled_train_and_validation_data = mnist_train.map(scale) 
# this will scale the whole train dataset and store it in our new variable

# finally, we scale and batch the test data\n",
# we scale it so it has the same magnitude as the train and validation
# there is no need to shuffle it, because we won't be training on the test data
# there would be a single batch, equal to the size of the test data

test_data = mnist_test.map(scale)

# VID 379 preprocess the data-shuffle and Batch
# SHUFFLING: Keeping the same infor but in a different order

BUFFER_SIZE = 10000
# this BUFFER_SIZE parameter is here for cases when we're dealing with enormous datasets
# then we can't shuffle the whole dataset in one go because we can't fit it all in memory of the computer
# so instead TF only stores BUFFER_SIZE samples in memory at a time and shuffles them
# if BUFFER_SIZE=1 => no shuffling will actually happen
# if BUFFER_SIZE >= num samples => shuffling will happen at once but uniformly
# if 1< BUFFER_SIZE< num_samples, we will be optimizing the computational power of our computer 

shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)

# once we have scaled and shuffled the data, we can proceed to actually extracting the train and validation
# our validation data would be equal to 10% of the training set, which we've already calculated
# we use the .take() method to take that many samples

# finally, we create a batch with a batch size equal to the total number of validation samples

validation_data = shuffled_train_and_validation_data.take(num_validation_samples)

# similarly, the train_data is everything else, so we skip as many samples as there are in the validation dataset

train_data = shuffled_train_and_validation_data.skip(num_validation_samples)

# determine the batch size
batch_size = 100

# we can also take advantage of the occasion to batch the train data
# this would be very helpful when we train, as we would be able to iterate over the different batches

# batch size = 1 = SGD
# batch size = number of samples = (single batch) GD
# dataset.batch(batch_size) a method that combines the consecutive elemrnts of a dataset into batches

train_data = train_data.batch(batch_size) # this indicates to our model hw many samples it should take in each batch

# Since we wont be back propagating in the validation data but only forward propagating we dont really need to batch
# Recall that batching was useful in updating wt only once per batch which is like 100 samples rather than in every sample
# hence reducing noise in the training update. so whenever we validate or test we simply forward propagate once
# when batching we simply find the average loss and average accuracy. During validation and testing we want the exact values.
# therefore we should take all the data at once. moreover when forward propagating we dont use that much computational power so
# it is not expensive to calculate the exact values, however the model expects the validation in batch form too

validation_data = validation_data.batch(num_validation_samples)

# batch the test data
test_data = test_data.batch(num_test_samples) # takes next batch (it is the only batch)

# because as_supervized=True, we've got a 2-tuple structure(the mnist data is iterable and in 2-turple format)
# so we must extract and convert the validation inputs and targets appropriately
# our validation data must have the same shape and object properties as the train and test data 

validation_inputs, validation_targets = next(iter(validation_data))

# iter()-creates an object which can be iterated one element at a time (eg in a for or while loop). by default it will
# make the dataset iterable but will not load any data
# next()-loads the next (batch) elements of an iterable object. nd since there is only one batch it will load
# the inputs and targets

### VID 381 MNIST; outline the model

In [6]:
# When thinking about a deep learning algorithm, we mostly imagine building the model.

# input_size = 784
# output_size = 10
# Use same hidden layer size for both hidden layers (Not a necessity though)
# underlying assumption is that all hidden layers are of same size
# Recall that width and depth are hyper parameters

# hidden_layer_size = 50

# the underlying assumption is that all hidden layers are of the same size alternatively i can create hidden layers with
# different width and see if they work better for ur particular problem

# define how the model will look like
# tf.keras.Sequential()-fn that lays down the model(used to stack layers)

# model = tf.keras.Sequential([
# the first layer (the input layer)
# each observation is 28x28x1 pixels, therefore it is a tensor of rank 3
# so we must flatten the images
# there is a convenient method 'Flatten' that simply takes our 28x28x1 tensor and orders it into a (none)
# or (28x28x1,) = (784,) vector
# tf.keras.layers.Flatten(original shape)-transforms (flattens) a tensor into a vector
# this allows us to actually create a feed forward neural network
                          
                            # tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
    
# tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
# it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    
# tf.keras.layers.Dense(output size)-takes the inputs, provided to the model and calculates the dot product of the inputs
# and wts and adds the bias. this is also where we can apply the activation fn
                            
                            # tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
                            # tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
# the final layer is no different, we just make sure to activate it with softmax
                            # tf.keras.layers.Dense(output_size, activation='softmax') # output layer
                            # ])
        
# YOU CAN STACK MANY LAYERS AS YOU WANT USING THIS STRUCTURE: 1 WIDTH 2 DEPTH 3 ACTIVATION

In [7]:
# REMOVING ALL COMMENTS FROM ABOVE

input_size = 784
output_size = 10
hidden_layer_size = 200

model = tf.keras.Sequential([                          
                            tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
# the final layer is no different, we just make sure to activate it with softmax
                            tf.keras.layers.Dense(output_size, activation='softmax') # output layer
                            ])

### VID 382 select the loss fn and the optimizer

In [8]:
# we define the optimizer we'd like to use
# the loss function
# and the metrics we are interested in obtaining at each iteration

# model.compile(optimizer,loss)-configures the model for training
# the string for optimizer are not case sensitive, so we can use small or capital letters

# In tensorflow 2 there are 3 built-in variations of cross-entropy(CE) loss; 
# BINARY CE-used when there is binary encoding
# CATEGORICAL CE-expects that you have one-hot encoded the targets
# SPARSE CATEGORICAL CE- applies one-hot encoding

   
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

### VID 383 MNIST: LEARNING
### Training

In [9]:
# Here we train the model we have built

# determine the maximum number of epochs

NUM_EPOCHS = 5

# we fit the model, specifying the
# training data
# the total number of epochs
# and the validation data we just created ourselves in the format: (inputs,targets)

model.fit(train_data, epochs=NUM_EPOCHS, validation_data=validation_data, verbose =2)

# WHAT HAPPENS INSIDE AN EPOCH
#1 At the beginning of each epoch, the training loss will be set to 0
#2 The algorithm will iterate over a preset no of batches, all from train_data
#3 The wts and biases will be updated as many times as there are batches
#4 We will get value for the loss fn, indicating hw the training is going
#5 We willalso see a training accuracy
#6 At the end of the epoch, the algorithm will forward propagate the whole validation set

# When we reach the maximum number of epochs the training will be over


Epoch 1/5
540/540 - 103s - loss: 0.2725 - accuracy: 0.9194 - val_loss: 0.1300 - val_accuracy: 0.9623
Epoch 2/5
540/540 - 18s - loss: 0.1030 - accuracy: 0.9691 - val_loss: 0.0884 - val_accuracy: 0.9745
Epoch 3/5
540/540 - 14s - loss: 0.0691 - accuracy: 0.9781 - val_loss: 0.0725 - val_accuracy: 0.9782
Epoch 4/5
540/540 - 17s - loss: 0.0531 - accuracy: 0.9829 - val_loss: 0.0481 - val_accuracy: 0.9873
Epoch 5/5
540/540 - 13s - loss: 0.0381 - accuracy: 0.9879 - val_loss: 0.0477 - val_accuracy: 0.9867


<tensorflow.python.keras.callbacks.History at 0x1d5879c3390>

In [10]:
# first we have number of epoch(epoch 1/5)
# 2nd is the number of batches (540/540)
# 3rd is time it took the epoch to conclude
# 4th is the training loss. it should be compared to other training loss across epochs,in our case it is mostly decreasing.
# notice that it didnt change too much bc even after the first epoch we have already had 540 diff wts and bias updates
# one for each batch
# the next is ACURRACY. it shows in what percent of the cases our output were equal to the target. logically it follows the trend
# of the loss as they both represent how well the outputs match the targets
# lastly is the LOSS nd ACCURACY for the VALIDATION dataset. usually we keep an eye on the validation loss(or set early stopping mechnism)
# to determine wether our model is overfitting
# the validation accuracy = TRUE ACCURACY OF THE MODEL for the epoch. this is bc the training acuracy is THE AVRGE ACURACY 
# ACROSS BATCHES WHILE THE VALIDATION ACURACY IS THAT OF THE WHOLE VALIDATION SET
# TO acess the overall acuracy of our model we look at the validation acuracy of the last epoch(97%).
# this is a good result already

# lets feedle with some hyperparameters like the hiden_layer_size. increasing the hidden_layer_size causes the accuracy
# of validation set to increase and the training time increased too

# increasing the hidden_layer_size to 200 causes the accuracy to increase nd the training time to reduce a bit from 31 to 14sec
# WIDTH IS THE HIDDEN LAYER size
# DEPTH IS THE NUMBER OF HIDDEN LAYER

### VID 385 TESTING THE MODEL

In [11]:
# we got 98% for the validation accuracy. so we have to test the model on the test dataset bc the final accuracy of the model
# comes from forward propagating the test dataset. the reason is we may have overfit by trying differnt combinations of hyper-
# parameters. so therefor to check the accuracy of the model we use the dateset the model has not seen before (test dataset)

In [12]:
# we can access the test accuracy using the method evaluate. 
# model.evaluate()- returns the loss value and metrics values fot the model in "test mode"

test_loss, test_accuracy = model.evaluate(test_data)

      1/Unknown - 3s 3s/step - loss: 0.0726 - accuracy: 0.9776

### NOTE: the main aim of the test dataset is to simulate model deployment. if we get 50 to 60% accuracy we will 
### know for sure that our model has overfit and it will fail miserably in real life. Getting a value very close to the validation
### accuracy shows we have not overfit. the test accuracy is the accuracy we expect the model to produce 
### when deployed in the real world

In [36]:
# We can apply some nice formatting if we want to

print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))

# After we test the model conceptually we are not alloed to change it. if we start changing the model after this point, the test
# data will no longer be a dataset the model has never seen


Test loss: 0.32. Test accuracy: 81.25%


## SECTION 53 DEEP LEARNING BUSINESS CASE EXAMPLE
## VID 389

### Problem

#### You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company
    
#### The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.
    
#### You have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

#### These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

#### The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

#### The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.
#### This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

### Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. 
### Save the newly created sets in a tensor friendly format (e.g. *.npz). Since we are dealing with real life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but is crucial to creating a good model.
    
###   If you want to know how to do that, go through the code with comments. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supervized learning datasets). Keep in mind that a specific problem may require additional preprocessing.
    
### Note that we have removed the header row, which contains the names of the categories. We simply want the data

## Extract the data from the csv

## note that u can use this same code to preprocess any dataset that has two classes

In [37]:
import numpy as np
from sklearn import preprocessing
# We will use the sklearn preprocessing library, as it will be easier to standardize the data

In [38]:
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')

In [39]:
raw_csv_data

# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information) and the last one [:,-1] (which is our targets)

array([[9.9400e+02, 1.6200e+03, 1.6200e+03, ..., 5.0000e+00, 9.2000e+01,
        0.0000e+00],
       [1.1430e+03, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [2.0590e+03, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 3.8800e+02,
        0.0000e+00],
       ...,
       [3.1134e+04, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [3.2832e+04, 1.6200e+03, 1.6200e+03, ..., 0.0000e+00, 9.0000e+01,
        0.0000e+00],
       [2.5100e+02, 1.6740e+03, 3.3480e+03, ..., 0.0000e+00, 0.0000e+00,
        1.0000e+00]])

In [40]:
unscaled_inputs_all = raw_csv_data[:,1:-1] # this excludes customer ID column and target column

# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data[:,-1] # this outputs only the target columns

## Balance the dataset

In [41]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# We want to create a balanced dataset, so we will have to remove some input/target pairs.
# Declare a variable that will do that:
indices_to_remove = []

# Count the number of targets that are 0.
# Once there are as many 0s as 1s,(i will know the indices of all data point to be removed) mark entries where the target is 0.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# append() is a method that adds (appends) an object to a list

# THE VAR INDICES_TO_REMOVE WILL CONTAIN THE INDICES OF ALL TARGTES WE WONT NEED ND DELETING THEM WILL BALANCE THE DATASET


In [42]:
# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked \"to remove\" in the loop above.

unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
   
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

## Standardize (scale) the inputs

In [43]:
# this is the only place we use sklearn functionality. We will take advantage of its preprocessing capabilities
# stdizing the inputs will grtly improve the algo
# At the end of the business case, you can try to run the algorithm WITHOUT this line of code.
# The result will be interesting.

scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

# preprocessing.scale (x) is a method that stdizes the dataset along each variable

## Shuffle the data

In [44]:
# A little trick is to shuffle the inputs and the targets(since we will be batching). we keep the same information but in a radom order

# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.

# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# np.random.shuffle(x) is a method that shuffles the numbers in a given sequence

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

## Split the dataset into train, validation, and test

In [45]:
# Count the total number of samples

samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.\n",
# Naturally, we want to make sure the numbers are integers

train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data

test_samples_count = samples_count - train_samples_count - validation_samples_count


# Create variables that record the inputs and targets for training\n",

# In our shuffled dataset, they are the first \"train_samples_count\" observations\n",

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next \"validation_samples_count\" observations, folllowing the \"train_samples_count\" we already assigned

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test
# They are everything that is remaining

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

In [46]:
# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were
# taken from a shuffled dataset. Check if they are balanced, too. Note that each time you rerun this code, 
# you will get different values, as each time they are shuffled randomly.
# Normally you preprocess ONCE, so you need not rerun this code once it is done.
# If you rerun this whole sheet, the npzs will be overwritten with your newly preprocessed data.
    
# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.

print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1767.0 3579 0.49371332774518023
231.0 447 0.5167785234899329
239.0 448 0.5334821428571429


## Save the three datasets in .npz

In [47]:
# Save the three datasets in .npz.
    
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

## VID 391 BUSINESS CASE: LOAD THE PREPROCESSED DATA
## The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.
## This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s

In [48]:
# Our deep learning net has 10 units(ie input from our csv) and 2 output nodes as there are only 2 possibilities 0s nd 1s
# we will build a net with 2 hidden layers, the no of units in each layer will be 50 (but we can change it) as 50 provide
# enough complexity so we expect the algo to be much more sorphisticated than a linear or logistic regression and we dont want
# to put too many units initially so we can complete learning as fast as possible

## Create the machine learning algorithm

In [49]:
# import necessary library
import numpy as np
import tensorflow as tf

## DATA

In [50]:
# let's create a temporary variable npz, where we will store each of the three Audiobooks datasets

npz = np.load('Audiobooks_data_train.npz')

# we extract the inputs using the keyword under which we saved them
# to ensure that they are all floats, let's also take care of that
train_inputs = npz['inputs'].astype(np.float)

# np.ndarray.astype()- creates a copy of the array, cast to a specific type

# targets must be int because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)
train_targets = npz['targets'].astype(np.int)

# we load the validation data in the temporary variable
npz = np.load('Audiobooks_data_validation.npz')

# we can load the inputs and the targets in the same line
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

# we load the test data in the temporary variable
npz = np.load('Audiobooks_data_test.npz')

# we create 2 variables that will contain the test inputs and the test targets
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

# Model
### Outline, optimizers, loss, early stopping and training

In [51]:
# Set the input and output sizes
input_size = 10
output_size = 2

# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50

# tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
# it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
# the final layer is no different, we just make sure to activate it with softmax

# define how the model will look like

model = tf.keras.Sequential([ 
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
                            tf.keras.layers.Dense(output_size, activation='softmax') # output laye
                            ])

### The Flatten layer in machine learning, particularly in deep learning, is primarily used to convert multi-dimensional data into a one-dimensional array. This transformation is crucial when transitioning from convolutional or recurrent layers to fully connected (dense) layers in neural networks.   

Transitioning from Convolutional Layers to Dense Layers:

Convolutional layers, commonly used in image processing and computer vision, extract features from images in the form of multi-dimensional feature maps.   
Dense layers, however, require one-dimensional input vectors.
The Flatten layer acts as a bridge between these two types of layers by reshaping the multi-dimensional feature maps into a single, long vector.   
Simplifying Data for Dense Layers:

Dense layers operate on vectors of features. By flattening the input, we remove any spatial or temporal structure present in the original data and convert it into a format suitable for processing by densely connected neurons.   
Reducing Model Complexity:

Flattening can reduce the dimensionality of the data, which can help reduce the number of parameters in subsequent dense layers. This can be beneficial in preventing overfitting, especially in deep networks with many parameters.   


In [52]:
### Choose the optimizer and the loss function
# we define the optimizer we'd like to use
# the loss function and the metrics we are interested in obtaining at each iteration

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# sparse_categorical_crossentropy is used to ensure that our integer targets are one-hot encoded appropriately when calculating
# the loss

## Training

In [53]:
# That's where we train the model we have built
# set the batch size
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2) # making sure we get enough information about the training process

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 2s - loss: 0.5854 - accuracy: 0.6843 - val_loss: 0.5219 - val_accuracy: 0.7383
Epoch 2/100
3579/3579 - 0s - loss: 0.4702 - accuracy: 0.7550 - val_loss: 0.4542 - val_accuracy: 0.7718
Epoch 3/100
3579/3579 - 0s - loss: 0.4225 - accuracy: 0.7784 - val_loss: 0.4232 - val_accuracy: 0.7651
Epoch 4/100
3579/3579 - 0s - loss: 0.3986 - accuracy: 0.7832 - val_loss: 0.3919 - val_accuracy: 0.8031
Epoch 5/100
3579/3579 - 0s - loss: 0.3812 - accuracy: 0.7907 - val_loss: 0.3832 - val_accuracy: 0.7696
Epoch 6/100
3579/3579 - 0s - loss: 0.3720 - accuracy: 0.7955 - val_loss: 0.3634 - val_accuracy: 0.8277
Epoch 7/100
3579/3579 - 0s - loss: 0.3648 - accuracy: 0.8011 - val_loss: 0.3553 - val_accuracy: 0.8031
Epoch 8/100
3579/3579 - 0s - loss: 0.3546 - accuracy: 0.8080 - val_loss: 0.3483 - val_accuracy: 0.8188
Epoch 9/100
3579/3579 - 0s - loss: 0.3541 - accuracy: 0.8044 - val_loss: 0.3566 - val_accuracy: 0.7919
Epoch 10/100
3579/3579 - 0

<tensorflow.python.keras.callbacks.History at 0x1d59aa02080>

### WHEN YOU TRAIN FOR LONG THERE IS HIGH CHANCE OF OVERFITTING. THIS IS EVIDENT WHEN THE TRAINING LOSS IS CONSTANTLY DECREASING
### AND VALIDATION LOSS IS SOMETIME INCREASING. A FIX IS SETTING EARLY STOPING MECHNISM

### VID 394 BUSINESS CASE. SETTING AN EARLY STOPPING MECHNISM WITH TENSORFLOW
### THE FIT METHOD CONTAINS AN ARGUEMENT CALLED CALLBACKS
THERE ARE DIFERRENT TYPES OF CALLBACKS BUT WE WILL USE THE "EARLY STOPPING". IT IS CALLED AT CERTAIN POINT DURING TRAINING
EACH TIME THE VALIDATION LOSS IS CALCULATED IT IS COMPARED TO THE VAL. LOSS ONE EPOCH EGO, AND IF IT STARTS INCREASING THE MODEL IS OVERFITTING AND WE SHOULD STOP TRAINING.

EARLY STOPING MECHNISM IS A HYPER PARAMETER

## setting an early stopping mechanism and repeating the TRAINING

In [54]:
# That's where we train the model we have built
# set the batch size
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping()
                                                  
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2) # making sure we get enough information about the training process

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 0s - loss: 0.3098 - accuracy: 0.8212 - val_loss: 0.3238 - val_accuracy: 0.8166
Epoch 2/100
3579/3579 - 0s - loss: 0.3133 - accuracy: 0.8212 - val_loss: 0.3188 - val_accuracy: 0.8389
Epoch 3/100
3579/3579 - 0s - loss: 0.3101 - accuracy: 0.8296 - val_loss: 0.3186 - val_accuracy: 0.8121
Epoch 4/100
3579/3579 - 0s - loss: 0.3138 - accuracy: 0.8282 - val_loss: 0.3052 - val_accuracy: 0.8434
Epoch 5/100
3579/3579 - 0s - loss: 0.3074 - accuracy: 0.8310 - val_loss: 0.3101 - val_accuracy: 0.8434


<tensorflow.python.keras.callbacks.History at 0x1d5915a0780>

#### SOMETIMES IF WE NOTICE THAT THE VALIDATION LOSS HAS INCREASED BY AN INSIGNIFICANT AMOUNT WE MAY PREFER TO LET ONE OR TWO 
#### VALIDATION INCREASES SLIDE. TO ALLOW FOR THIS TOLERANCE, WE CAN ADJUST THE EARLY STOPPING OBJECT. AN ARGUEMENT CALLED PATIENCE WHICH BY DEFAULT IS SET TO ZERO
#### tf.keras.callbacks.EarlyStopping(patience)-configures the early stopping mechnism of the algo while patience lets us 
#### decide how many consecutive increases we can tolerate

## setting an early stopping mechanism with "patience" and repeating the TRAINING

In [55]:
# That's where we train the model we have built
# set the batch size
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)
                                                  
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2) # making sure we get enough information about the training process

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 0s - loss: 0.3095 - accuracy: 0.8273 - val_loss: 0.3135 - val_accuracy: 0.8367
Epoch 2/100
3579/3579 - 0s - loss: 0.3099 - accuracy: 0.8284 - val_loss: 0.3151 - val_accuracy: 0.8389
Epoch 3/100
3579/3579 - 0s - loss: 0.3072 - accuracy: 0.8310 - val_loss: 0.3099 - val_accuracy: 0.8389
Epoch 4/100
3579/3579 - 0s - loss: 0.3083 - accuracy: 0.8296 - val_loss: 0.3028 - val_accuracy: 0.8367
Epoch 5/100
3579/3579 - 0s - loss: 0.3074 - accuracy: 0.8265 - val_loss: 0.3153 - val_accuracy: 0.8300
Epoch 6/100
3579/3579 - 0s - loss: 0.3081 - accuracy: 0.8310 - val_loss: 0.3093 - val_accuracy: 0.8367


<tensorflow.python.keras.callbacks.History at 0x1d59a9bbb38>

In [56]:
### TAKE HOME POINTS
# 1. Our priors were 50% and 50% so our algo learnt alot definitely. the final validation accuracy of the model is around 81%
# it managed to classify around 81% of customers correctly. so if we are given 10 customers we will be able to correctly 
# identify the future cusomer behaviour of 8 of them

### TEST THE MODEL

In [57]:
# It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.
# The test is the absolute final instance. You should not test before you are completely done with adjusting your model.
#If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)

# RECALL THAT EVALUATE RETURNS THE LOSS AND EVERY OTHER METRIC WE HAVE REQUESTED IN OUR MODEL OUTLINE



In [58]:
# lets print with some nice formatting
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))

# this is the final accuracy of the model nd naturally it is close to the validation accuracy as we did not fiddle too much
# with the hyperparameters
# NOTE that u can get test accuracy higher than validation sometimes by luck, but theoritically it should be lower than or
# equal to the validation accuracy


Test loss: 0.30. Test accuracy: 84.38%


# tips to improve an algorithm
1. improve the preprocessing
2. Finetune the model by adjusting the width and depth of the algo. Also the number of hidden layers can be increased, hwever both the width and depth are computationally expensive . this can cause grt improvement 
3. play around with the activation fn
4. feedle with the batch size
   -batch size of 1 =SGD - here the algo will learn quickly but not so acurately
   -Apply batch size that will likely preserve the underlying dependencies
5. Experiment with the learning rate/ optimizer. although it may not be as fruitful as ADAM adapts it dynamically
 - visit the TF website and check out other optimizers comparing their performance to that of ADAM for ur specific problem
 - Alternatively check out tf.contribut. -thats where the tensorflow community contributes to the framework
6. source for data eg in kaggle(kaggle dataset) nd implement this strategy to solve a given problem 