## Deep Neural Network for MNIST Classification
The problem we've chosen is referred to as the "Hello World" for machine learning. The dataset is called MNIST and refers to handwritten digit recognition. You can find more about it on Yann LeCun's website (Director of AI Research, Facebook). He is one of the pioneers of what we've been talking about and of more complex approaches that are widely used today, such as covolutional networks. The dataset provides 28x28 images of handwritten digits (1 per image) and the goal is to write an algorithm that detects which digit is written. Since there are only 10 digits, this is a classification problem with 10 classes. In order to exemplify what we've talked about in this section, we will build a network with 2 hidden layers between inputs and outputs.

## Import the relevant packages

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# data has been split into training, validation, and test
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


## Outline the model

In [None]:
input_size = 784
output_size = 10
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 5000

# Reset any variables left in memory from previous runs.
tf.reset_default_graph()

# As in the previous example - declare placeholders where the data will be fed into.
inputs = tf.placeholder(tf.float32, [None, input_size])
targets = tf.placeholder(tf.float32, [None, output_size])

# Weights and biases for the first linear combination between the inputs and the first hidden layer.
# Use get_variable in order to make use of the default TensorFlow initializer which is Xavier.
weights_1 = tf.get_variable("weights_1", [input_size, hidden_layer_size])
biases_1 = tf.get_variable("biases_1", [hidden_layer_size])

# Operation between the inputs and the first hidden layer.
# We've chosen ReLu as our activation function. You can try playing with different non-linearities.
outputs_1 = tf.nn.relu(tf.matmul(inputs, weights_1) + biases_1)

# Weights and biases for the second linear combination.
# This is between the first and second hidden layers.
weights_2 = tf.get_variable("weights_2", [hidden_layer_size, hidden_layer_size])
biases_2 = tf.get_variable("biases_2", [hidden_layer_size])

# Operation between the first and the second hidden layers. Again, we use ReLu.
outputs_2 = tf.nn.relu(tf.matmul(outputs_1, weights_2) + biases_2)


# Don't forget to change the shape of weights_3 and biases_3. 
# They used to lead to the output layer, but now they lead to the third hidden layer
weights_3 = tf.get_variable("weights_3", [hidden_layer_size, hidden_layer_size])
biases_3 = tf.get_variable("biases_3", [hidden_layer_size])

# Create outputs_3 variable. I'll use ReLu once again
outputs_3 = tf.nn.relu(tf.matmul(outputs_2,weights_3) + biases_3)

weights_4 = tf.get_variable("weights_4", [hidden_layer_size, hidden_layer_size])
biases_4 = tf.get_variable("biases_4", [hidden_layer_size])
outputs_4 = tf.nn.relu(tf.matmul(outputs_3,weights_4) + biases_4)

weights_5 = tf.get_variable("weights_5", [hidden_layer_size, hidden_layer_size])
biases_5 = tf.get_variable("biases_5", [hidden_layer_size])
outputs_5 = tf.nn.relu(tf.matmul(outputs_4,weights_5) + biases_5)

weights_6 = tf.get_variable("weights_6", [hidden_layer_size, hidden_layer_size])
biases_6 = tf.get_variable("biases_6", [hidden_layer_size])
outputs_6 = tf.nn.relu(tf.matmul(outputs_5,weights_6) + biases_6)

weights_7 = tf.get_variable("weights_7", [hidden_layer_size, hidden_layer_size])
biases_7 = tf.get_variable("biases_7", [hidden_layer_size])
outputs_7 = tf.nn.relu(tf.matmul(outputs_6,weights_7) + biases_7)

weights_8 = tf.get_variable("weights_8", [hidden_layer_size, hidden_layer_size])
biases_8 = tf.get_variable("biases_8", [hidden_layer_size])
outputs_8 = tf.nn.relu(tf.matmul(outputs_7,weights_8) + biases_8)

weights_9 = tf.get_variable("weights_9", [hidden_layer_size, hidden_layer_size])
biases_9 = tf.get_variable("biases_9", [hidden_layer_size])
outputs_9 = tf.nn.relu(tf.matmul(outputs_8,weights_9) + biases_9)

weights_10 = tf.get_variable("weights_10", [hidden_layer_size, hidden_layer_size])
biases_10 = tf.get_variable("biases_10", [hidden_layer_size])
outputs_10 = tf.nn.relu(tf.matmul(outputs_9,weights_10) + biases_10)

weights_11 = tf.get_variable("weights_11", [hidden_layer_size, output_size])
biases_11 = tf.get_variable("biases_11", [output_size])

# The outputs are a function of outputs_4, weights_5, and biases_5
outputs = tf.matmul(outputs_10, weights_11) + biases_11


# Calculate the loss function for every output/target pair.
# The function used is the same as applying softmax to the last layer and then calculating cross entropy
# with the function we've seen in the lectures. This function, however, combines them in a clever way, 
# which makes it both faster and more numerically stable (when dealing with very small numbers).
# Logits here means: unscaled probabilities (so, the outputs, before they are scaled by the softmax)
# Naturally, the labels are the targets.
loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=outputs, labels=targets)

# Get the average loss
mean_loss = tf.reduce_mean(loss)

# Define the optimization step. Using adaptive optimizers such as Adam in TensorFlow
# is as simple as that.
optimize = tf.train.AdamOptimizer(learning_rate=0.0002).minimize(mean_loss)

# Get a 0 or 1 for every input in the batch indicating whether it output the correct answer out of the 10.
out_equals_target = tf.equal(tf.argmax(outputs, 1), tf.argmax(targets, 1))

# Get the average accuracy of the outputs.
accuracy = tf.reduce_mean(tf.cast(out_equals_target, tf.float32))

# Declare the session variable.
sess = tf.InteractiveSession()

# Initialize the variables. Default initializer is Xavier.
initializer = tf.global_variables_initializer()
sess.run(initializer)

# Batching
batch_size = 150

# Calculate the number of batches per epoch for the training set.
batches_number = mnist.train._num_examples // batch_size

# Basic early stopping. Set a miximum number of epochs.
max_epochs = 50

# Keep track of the validation loss of the previous epoch.
# If the validation loss becomes increasing, we want to trigger early stopping.
# We initially set it at some arbitrarily high number to make sure we don't trigger it
# at the first epoch
prev_validation_loss = 9999999.

import time
start_time = time.time()

# Create a loop for the epochs. Epoch_counter is a variable which automatically starts from 0.
for epoch_counter in range(max_epochs):
    
    # Keep track of the sum of batch losses in the epoch.
    curr_epoch_loss = 0.
    
    # Iterate over the batches in this epoch.
    for batch_counter in range(batches_number):
        
        # Input batch and target batch are assigned values from the train dataset, given a batch size
        input_batch, target_batch = mnist.train.next_batch(batch_size)
        
        # Run the optimization step and get the mean loss for this batch.
        # Feed it with the inputs and the targets we just got from the train dataset
        _, batch_loss = sess.run([optimize, mean_loss], 
            feed_dict={inputs: input_batch, targets: target_batch})
        
        # Increment the sum of batch losses.
        curr_epoch_loss += batch_loss
    
    # So far curr_epoch_loss contained the sum of all batches inside the epoch
    # We want to find the average batch losses over the whole epoch
    # The average batch loss is a good proxy for the current epoch loss
    curr_epoch_loss /= batches_number
    
    # At the end of each epoch, get the validation loss and accuracy
    # Get the input batch and the target batch from the validation dataset
    input_batch, target_batch = mnist.validation.next_batch(mnist.validation._num_examples)
    
    # Run without the optimization step (simply forward propagate)
    validation_loss, validation_accuracy = sess.run([mean_loss, accuracy], 
        feed_dict={inputs: input_batch, targets: target_batch})
    
    # Print statistics for the current epoch
    # Epoch counter + 1, because epoch_counter automatically starts from 0, instead of 1
    # We format the losses with 3 digits after the dot
    # We format the accuracy in percentages for easier interpretation
    print('Epoch '+str(epoch_counter+1)+
          '. Mean loss: '+'{0:.3f}'.format(curr_epoch_loss)+
          '. Validation loss: '+'{0:.3f}'.format(validation_loss)+
          '. Validation accuracy: '+'{0:.2f}'.format(validation_accuracy * 100.)+'%')
    
    # Trigger early stopping if validation loss begins increasing.
    if validation_loss > prev_validation_loss:
        break
        
    # Store this epoch's validation loss to be used as previous validation loss in the next iteration.
    prev_validation_loss = validation_loss

# Not essential, but it is nice to know when the algorithm stopped working in the output section, rather than check the kernel
print('End of training.')

#Add the time it took the algorithm to train
print("Training time: %s seconds" % (time.time() - start_time))

In [None]:
# The final accuracy of the model comes from forward propagating through the test dataset as otherwise we may overfit
# It took 5 epochs until the validation loss started increasing
# Batching much faster
# Validation helps eliminate overfitting (the parameters W and B)
# All can be adjusted
 # Width, depth, Activations, learning rate, and batch size
# We try to find the best hyperparameters, but not the best parameters, they fit the validation set the best
# Test data set is the set the machine has not seen yet
# Validation is not the best instance for accuracy

## Test

In [None]:
input_batch, target_batch = mnist.test.next_batch(mnist.test.num_examples)
test_accuracy = sess.run([accuracy],
                        feed_dict = {inputs: input_batch, targets: target_batch})

test_accuracy_percent = test_accuracy[0]*100.

print('Test accuracy: '+'{0:2f}'.format(test_accuracy_percent)+'%')

In [None]:
# The test accuracy is the final accuracy of the model
# Gets 97 out of 98 digits right!

## Exercises
There are several main adjustments you may try.

It is useful to take note of the time it takes the algorithm to train. This will provide you with an additional level of understanding. That is also your first task.

Using the code from the lecture as the basis, fiddle with the hyperparameters of the algorithm.

The width (the hidden layer size) of the algorithm. Try a hidden layer size of 200. How does the validation accuracy of the model change? What about the time it took the algorithm to train? Can you find a hidden layer size that does better?
    # Updating the hidden layer size to 200 has a minimal effect. 

The depth of the algorithm. Add another hidden layer to the algorithm. This is an extremely important exercise! How does the validation accuracy change? What about the time it took the algorithm to train? Hint: Be careful with the shapes of the weights and the biases.

    # Adding another hidden layer actually reduced accuracy slightly

The width and depth of the algorithm. Add as many additional layers as you need to reach 5 hidden layers. Moreover, adjust the width of the algorithm as you find suitable. How does the validation accuracy change? What about the time it took the algorithm to train?

    # Adding multiple hidden layers and increasing or decreasing the width actually reduced accuracy slightly

Fiddle with the activation functions. Try applying sigmoid transformation to both layers. The sigmoid activation is given by the method: tf.nn.sigmoid()

    # Get 95.62 % Accuracy - decreases

Fiddle with the activation functions. Try applying a ReLu to the first hidden layer and tanh to the second one. The tanh activation is given by the method: tf.nn.tanh()

    #  All tanh get an accuracy of 96.98

Adjust the batch size. Try a batch size of 1000. How does the required time change? What about the accuracy?

    # Increasing the batch size takes the result to 97.229999%

Adjust the batch size. Try a batch size of 1. That's the SGD. How do the time and accuracy change? Is the result coherent with the theory?

    # Time increases drastically due because the batch size is so small. It is "crawling" Accuracy is slightly lower.

Adjust the learning rate. Try a value of 0.0001. Does it make a difference?

    # Accuracy drops to 93.5%

Adjust the learning rate. Try a value of 0.02. Does it make a difference?
 
    # Learning rate 94.58%

Combine all the methods above and try to reach a validation accuracy of 98.5+ percent.

    #hidden_layer_size = 5000
    #batch_size = 150
    #learning_rate = 0.0002
    #Hidden Layers = 10