# Deep Neural Network for MNIST

In [1]:
import numpy as np
import tensorflow as tf

# use tensorflow data provider of MNIST

import tensorflow_datasets as tfds

In [2]:
# as_supervised can load the dataset in a two tuple structure input and target
# with_info cna provides a tuple containing info about version, features and number of samples of the dataset

mnist_dataset, mnist_info = tfds.load(name = 'mnist', with_info = True, as_supervised = True)



In [3]:
# extract the train and test dataset
# by default tensorflow has training and testing datasets but no validation datasets
# that's one of the more irritating properties of the tensorflow datasets module
# but in fact it gives us the opportunity to actually practice splitting datasets on our own
# the train dataset is much bigger than the test one
# so it will take the validation data from the train dataset

mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test']

# take an arbitrary percentage of the train dataset to serve as validation
# it should start by setting the number of validation samples
# it can extract the number of samples by writting mnist_info.splits['train'].num_examples

num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples

# get the validation samples equal the number of training samples divided by 10
# but we are not sure this will be an integer though which is not really a possible number of validation samples
# to solbe this issue effortlessly, can overwrite the numbe of validation samples variables
# use this method will cast the value of stored in the number of validation samples variable to an integer
# thereby preventing any potential issues

num_validation_samples = tf.cast(num_validation_samples, tf.int64)

# store the number of test samples and dedicated variable

num_test_samples = mnist_info.splits['test'].num_examples
num_test_samples = tf.cast(num_test_samples, tf.int64)

# normally, we'd like to scale data in some way to make the result more numerically stable
# so it will simply prefer to have inputs between 0 and 1
# define a function that will scale the inputs called scale

# as a precaution, let's make sure all values are floats
# next proceed by scaling it, the mnist images contain values from 0 to 255
# it representing the 256 of gray, so divided each element by 255, we'll get the desired result
# all element will be between 0 and 1
# the . at the end signifies that we want a result to be a float

def scale(image, label):
    
    image = tf.cast(image, tf.float32)
    image /= 255.
    
    return image, label

# there is a tensorflow method called map which allow us to apply a custom transformation to a given dataset
# moreover this map can only apply transformation that can take an input and a label and return an input and a label
# already decided we will take the validation data from train
# this will scale the whole train dataset and store it in our new variable

scale_train_and_validation_data = mnist_train.map(scale)

test_data = mnist_test.map(scale)

# will shuffle the data and then creat the validation dataset
# shuffle is mean keeping the same infromation but in a different order

# it's possible that the targets are stored in ascending order resulting in the first batches haveing 0 targets and the other batches having only 1 targets
# since we'll be matching, we'd better shuffle the data, it should be randomly spread as possible so that matching works as intended

# imagine the data is ordered and we have 10 batches, each batch contains only given digit
# so the first batch has only 0, the second has only 1 etc
# it will confuse the stochastic gradient descent algorithm
# becuase each batch is homogenous inside it but completely different from all other batches causing the loss to differ greatly
# in other word, the data should be shuffled

# start by defining a buffer size
# this buffer size parameter is used in cases when we dealing with enormous datasets
# in such cases, we can't shuffle the whole dataset in one go, because we can't possibly fit it all in the memory of the computer
# so instead we must instruct tensorflow to take samples 10000 at time, shuffle them and then take the next 10000

# if buffer_size is 1, there no shuffling will actually happen
# if buffer_size is equal or bigger than the total number of samples, shuffling will take place at once and shuffle them uniformly
# if a buffer_size is between the 1 and the total sample size, it will be optimizing the computational power

BUFFER_SIZE = 10000

# there is shuffle method readily available and we just need to specify the buffer_size

shuffled_train_and_validation_data = scale_train_and_validation_data.shuffle(BUFFER_SIZE)

# once we have scaled and shuffle the data
# we can proceed to actually extracting the train and validation datasets
# our validation data will be equal to 10% of the training set, which we have already calcultated and stored in num_validation_samples
# we can use the method take to extract that many samples

validation_data = shuffled_train_and_validation_data.take(num_validation_samples)

# in the same way, we can create the train data by extracting all element but the first X validation samples

train_data = shuffled_train_and_validation_data.skip(num_validation_samples)

# using batching gradient descent to train this model
# this is the most efficient way to perform deep learning as the tradeoff accuracy and speed is optimal
# to do that we must set a batch size and prepare the data for batching

# the batch size is 1, is the stochastic gradient descent
# the batch size is the nuber of samples, is the singel batch gradient descent
# the batch size is between the 1 and the total sample size, is the mini-batch gradient descent

BATCH_SIZE = 100

# there is a method batch we can use on the dataset to combine its consecutive elements in the batches
# add a new column to tensor that would indicate to the model how many samples it should take in each batch

train_data = train_data.batch(BATCH_SIZE)

# what about the validation data
# since we won't be backpropagation on the validation data
# but only forward propagating, we don't really need to batch
# the batching was useful in updating weights only once per batch, which is like 100 samples rather than at every sample, hence reducing noise in the training updates
# so whenever we validate or test we simply forward propagate once
# when batching we usually find the average loss and average accuracy
# during validation and testing we want the exact values, therefore we should take all the data at once
# moreover when forward propagating we don's use that much computational power so it's not expensive to calculate the exact values
# however the model expects our validation set in batch form too
# create a new column in tensor indicating that the model should take the whole validation dataset at once when it utilizes it 

validation_data = validation_data.batch(num_validation_samples)

test_data = test_data.batch(num_test_samples)

# finally our validation data must have the same shape and object properties as the train and test data
# the mnist data is iterable and in 2-tuple format
# therefore we must extract and convet the validation inputs and targets appropriately

# iter is the python syntax for making the validation data and iterator
# by the default it will make the dataset iterable but will not load any data
# next loads the next batch
# since there is only one batch it will load the inputs and the targets

validation_inputs, validation_targets = next(iter(validation_data))

#### Model

There are 784(28 * 28) inputs, that's this model's input layer, this model also have 10 outputs, one for each digit, so that's this model's output layer and this model will also have two hidden layers, consisting of 50 nodes each, the model's width and depth of the net are hyperparemeters.

In [4]:
input_size = 784
output_size = 10

# the underlying assumption is that all hidden layers are of the same size
# alternatively can create hidden layers with different width and see if they work better for this model

hidden_layer_size = 50

# define the actual model

# the first layer is the input layer
# each observation is 28 by 28 by 1 or a tensor of rank 3
# we need to flatten the image into a vector, this is a common operation in deep learning
# so there is dedicated method called flatten
# flatten is a part of the layers module and takes its argument the shape of object want to flatten
# this method transform it or more specifically flattens it into a vector

# prepare the data for feed forward neural network
# use the Dense to build each consecutive layer
# the Dense method was basically finding th dot product of the inputs and weights and adding the bias
# this also where can apply an activation function
# getting from the inputs to the first hidden layer 
# therefore the output of the first mathematical operation will have the shape of the first hidden layer

# the second hidden layer is the same of the first hidden layer

# the output layer is also use the Dense method
# and creating a classifier, the activation function of the output layer must transform the values into probabilies

model = tf.keras.Sequential([
                            tf.keras.layers.Flatten(input_shape = (28, 28, 1)),
                            tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                            tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                            tf.keras.layers.Dense(output_size, activation = 'softmax')
                            ])

#### Choose the optimizer and the loss function

In [5]:
# must specify the optimizer and the loss through the compile method we call on the model object
# start by specifying the optimizer, one of the best choices is the adaptive moment estimation
# the strings are not case sensitive, so can capitalize the first letter of all letter if you wish

# like to employ a loss that is used for classifier, cross-entropy would normally be the first choice
# however there are different types of cross-entropy in tensorflow
# there are three built in variations of cross-entropy loss
# binary_crossentropy refers to the case where we've got binary encoding
# categorical_crossentropy expects that you've one-hot encoded the targets
# sparse_categorical_crossentropy applies one-hot encoding

# can include metrics that we wish to calculate throughout the training and testing processes
# typically that's the accuracy

model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

#### Training

In [6]:
# create a variable storing the number of epochs that we wish to train for
# this number is arbitrary

NUM_EPOCHS = 5

# in tensorflow 2, need to determine validation steps when feeding validation data to the fit function
# batch size

VALIDATION_STEPS = num_validation_samples

# next it can fit the model
# similar to the tensorflow intro, it use the fit method

# first specify the data 
# second set the number of epochs

# need to validate
# have to do is included as an argument that same method

# set verbose 2, make sure receive only the most important information for each epochs

model.fit(train_data, epochs = NUM_EPOCHS, validation_data = (validation_inputs, validation_targets), validation_steps = VALIDATION_STEPS, verbose = 2)


# what happens inside an epoch
# 1. at the beginning of each epoch, the training loss will be set to 0
# 2. the algorithm will iterate over a preset number of batches, all from the train_data, essentially the whole training set will be utilized but in batches
# 3. the weights and biases will be updated as many time as there are batches
# 4. it will get a value for the loss function, indicating how the training is going
# 5. it will also see a training accuracy
# 6. at the end of the epoch, the algorithm will forward propagate the whole validation set in a single batch through the optimized model and calculate the validation accuracy
# when we reach the maximum number of epochs the training will be over


# the output have several line
# first it have information about the number of the epoch
# next it got the number of batches, it says 540/540 because if we had a progress bar that would fill out gradually
# the third of information is the time it took for the epoch to conclude
# next it can see the training loss, it should be compared to the training loss across epoch, in this case it is mostly decreasing
# the loss didn't change too much, because even after the first epoch, we've already had 540 different weight and bias updates one for each batch
# what follow is the accuracy, the accuracy showa and what % of the cases our outputs were equal to the targets
# logically it follows the trend of the loss, after all they both represent how well the outputs match the targets
# finally we've got the loss and the accuracy for the validation dataset, this is our check
# we usually keep an eye on the validation loss(or set early stopping mechanisms) to determine whether the model is overfitting
# the val_accuracy means the validation accuracy is the true accuracy of the model for the epoch
# this is because the training accuracy is the average accuracy across batches
# while the validation accuracy is that of the whole validation set

Epoch 1/5
540/540 - 115s - loss: 0.4087 - accuracy: 0.8864 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/5
540/540 - 111s - loss: 0.1816 - accuracy: 0.9480 - val_loss: 0.1676 - val_accuracy: 0.9523
Epoch 3/5
540/540 - 114s - loss: 0.1409 - accuracy: 0.9588 - val_loss: 0.1482 - val_accuracy: 0.9585
Epoch 4/5
540/540 - 110s - loss: 0.1178 - accuracy: 0.9656 - val_loss: 0.1166 - val_accuracy: 0.9700
Epoch 5/5
540/540 - 111s - loss: 0.0993 - accuracy: 0.9703 - val_loss: 0.1059 - val_accuracy: 0.9707


<tensorflow.python.keras.callbacks.History at 0x141543710>

#### Test the model

It must still test the model on the test dataset because the final accuracy of the model comes from forward propagating the test dataset not the validation.
___

We train on the training data and then validate on the validation data, validate on the validation data make sure our parameters, the weights and the bias dont's overfit.

Once we train our first model though we fiddle with the hyperparameters, normally we won't change only the width of the hidden layers, we can adjust the depth, the learning rate, the batch size, the activation functions for each layer and so on.

Each time we make a change, we run the model once more and check out the validation accuracy improved, after 10 to 20 different combinations, we may reach a model without standing validation accuracy, in essence we are trying to find the best hyperparameters, but what we find are not the best hyperparameters in general.

These are the hyperparameters that fit our valudation dataset best, basically by fine tuning them, we are overfitting the validation dataset.

During the training stage, we can overfit the parameters or the weights and biases, the validation dataset it our reality check that prevents us from overfitting the parameters.

After fiddling with the hyperparameters we can overfit the validation dataset, as we are considering the validation accuracy as a benchmark for how good the model is.

The test dataset is out reality check that prevents us from overfitting the hyperparameters, such like width, depth, batch size, epochs and so on.

The test dataset is the model has truly never seen.

In [8]:
# test the model then we can assess the test accuracy using the method evaluate
# we will be forward propagating the test data through the net
# there would be two outputs, the loss and the accuracy
# after we test the model, conceptually, we are no longer allowed to change it
# the main point of the test dataset is to simulate model deployment
# getting a test accuracy very close to the validation accuracy shows that we have not overfit
# finally the test accuracy is the accuracy we expect to observe if we deploy the model in the real world

test_loss, test_accuracy = model.evaluate(test_data)



In [9]:
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy * 100.))

Test loss: 0.11. Test accuracy: 96.48%
