# Deep Neural Network for MNIST Classification

We'll apply all the knowledge from the lectures in this section to write a deep neural network. The problem we've chosen is referred to as the "Hello World" for machine learning because for most students it is their first example. The dataset is called MNIST and refers to handwritten digit recognition. You can find more about it on Yann LeCun's website (Director of AI Research, Facebook). He is one of the pioneers of what we've been talking about and of more complex approaches that are widely used today, such as covolutional networks. The dataset provides 28x28 images of handwritten digits (1 per image) and the goal is to write an algorithm that detects which digit is written. Since there are only 10 digits, this is a classification problem with 10 classes. In order to exemplify what we've talked about in this section, we will build a network with 2 hidden layers between inputs and outputs.

## Import the relevant packages

In [18]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# TensorFLow includes a data provider for MNIST that we'll use.
# This function automatically downloads the MNIST dataset to the chosen directory. 
# The dataset is already split into training, validation, and test subsets. 
# Furthermore, it preprocess it into a particularly simple and useful format.
# Every 28x28 image is flattened into a vector of length 28x28=784, where every value
# corresponds to the intensity of the color of the corresponding pixel.
# The samples are grayscale (but standardized from 0 to 1), so a value close to 0 is almost white and a value close to
# 1 is almost purely black. This representation (flattening the image row by row into
# a vector) is slightly naive but as you'll see it works surprisingly well.
# Since this is a classification problem, our targets are categorical.
# Recall from the lecture on that topic that one way to deal with that is to use one-hot encoding.
# With it, the target for each individual sample is a vector of length 10
# which has nine 0s and a single 1 at the position which corresponds to the correct answer.
# For instance, if the true answer is "1", the target will be [0,0,0,1,0,0,0,0,0,0] (counting from 0).
# Have in mind that the very first time you execute this command it might take a little while to run
# because it has to download the whole dataset. Following commands only extract it so they're faster.


## Outline the model

The whole code is in one cell, so you can simply rerun this cell (instead of the whole notebook) and train a new model.
The tf.reset_default_graph() function takes care of clearing the old parameters. From there on, a completely new training starts.

In [19]:
input_size = 784
output_size = 10
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50

# Reset any variables left in memory from previous runs.
tf.keras.backend.clear_session()

# Use TensorFlow 1.x compatibility mode for placeholders and sessions
import tensorflow as tf
tf.compat.v1.disable_eager_execution()

# As in the previous example - declare placeholders where the data will be fed into.
inputs = tf.compat.v1.placeholder(tf.float32, [None, input_size])
targets = tf.compat.v1.placeholder(tf.float32, [None, output_size])

# Weights and biases for the first linear combination between the inputs and the first hidden layer.
# Use get_variable in order to make use of the default TensorFlow initializer which is Xavier.
weights_1 = tf.compat.v1.get_variable("weights_1", [input_size, hidden_layer_size])
biases_1 = tf.compat.v1.get_variable("biases_1", [hidden_layer_size])

# Operation between the inputs and the first hidden layer.
# We've chosen ReLu as our activation function. You can try playing with different non-linearities.
outputs_1 = tf.nn.relu(tf.matmul(inputs, weights_1) + biases_1)

# Weights and biases for the second linear combination.
# This is between the first and second hidden layers.
weights_2 = tf.compat.v1.get_variable("weights_2", [hidden_layer_size, hidden_layer_size])
biases_2 = tf.compat.v1.get_variable("biases_2", [hidden_layer_size])

# Operation between the first and the second hidden layers. Again, we use ReLu.
outputs_2 = tf.nn.relu(tf.matmul(outputs_1, weights_2) + biases_2)

# Weights and biases for the final linear combination.
# That's between the second hidden layer and the output layer.
weights_3 = tf.compat.v1.get_variable("weights_3", [hidden_layer_size, output_size])
biases_3 = tf.compat.v1.get_variable("biases_3", [output_size])

# Operation between the second hidden layer and the final output.
# Notice we have not used an activation function because we'll use the trick to include it directly in 
# the loss function. This works for softmax and sigmoid with cross entropy.
outputs = tf.matmul(outputs_2, weights_3) + biases_3

# Calculate the loss function for every output/target pair.
# The function used is the same as applying softmax to the last layer and then calculating cross entropy
# with the function we've seen in the lectures. This function, however, combines them in a clever way, 
# which makes it both faster and more numerically stable (when dealing with very small numbers).
# Logits here means: unscaled probabilities (so, the outputs, before they are scaled by the softmax)
# Naturally, the labels are the targets.
loss = tf.nn.softmax_cross_entropy_with_logits(logits=outputs, labels=targets)

# Get the average loss
mean_loss = tf.reduce_mean(loss)

# Define the optimization step. Using adaptive optimizers such as Adam in TensorFlow
# is as simple as that.
optimize = tf.compat.v1.train.AdamOptimizer(learning_rate=0.001).minimize(mean_loss)

# Get a 0 or 1 for every input in the batch indicating whether it output the correct answer out of the 10.
out_equals_target = tf.equal(tf.argmax(outputs, 1), tf.argmax(targets, 1))

# Get the average accuracy of the outputs.
accuracy = tf.reduce_mean(tf.cast(out_equals_target, tf.float32))

# Declare the session variable.
sess = tf.compat.v1.InteractiveSession()

# Initialize the variables. Default initializer is Xavier.
initializer = tf.compat.v1.global_variables_initializer()
sess.run(initializer)

# Batching
batch_size = 100

import numpy as np

# Flatten images and normalize
x_train_flat = x_train.reshape(x_train.shape[0], -1).astype(np.float32) / 255.0
x_test_flat = x_test.reshape(x_test.shape[0], -1).astype(np.float32) / 255.0

# One-hot encode labels
def one_hot(labels, num_classes=10):
    return np.eye(num_classes)[labels]

y_train_oh = one_hot(y_train)
y_test_oh = one_hot(y_test)

# Split validation set from training set
validation_size = 5000
x_validation_flat = x_train_flat[-validation_size:]
y_validation_oh = y_train_oh[-validation_size:]
x_train_flat = x_train_flat[:-validation_size]
y_train_oh = y_train_oh[:-validation_size]

# Training parameters
epochs = 10
batches_number = x_train_flat.shape[0] // batch_size
prev_validation_loss = float('inf')

for epoch_counter in range(epochs):
    curr_epoch_loss = 0.
    for batch_counter in range(batches_number):
        batch_start = batch_counter * batch_size
        batch_end = batch_start + batch_size
        input_batch = x_train_flat[batch_start:batch_end]
        target_batch = y_train_oh[batch_start:batch_end]

        # Run the optimization step and get the mean loss for this batch.
        _, batch_loss = sess.run([optimize, mean_loss],
            feed_dict={inputs: input_batch, targets: target_batch})

        # Increment the sum of batch losses.
        curr_epoch_loss += batch_loss

    # Average batch loss for the epoch
    curr_epoch_loss /= batches_number

    # Validation
    input_batch = x_validation_flat
    target_batch = y_validation_oh
    validation_loss, validation_accuracy = sess.run([mean_loss, accuracy],
        feed_dict={inputs: input_batch, targets: target_batch})

    print('Epoch '+str(epoch_counter+1)+
          '. Mean loss: '+'{0:.3f}'.format(curr_epoch_loss)+
          '. Validation loss: '+'{0:.3f}'.format(validation_loss)+
          '. Validation accuracy: '+'{0:.2f}'.format(validation_accuracy * 100.)+'%')

    # Early stopping
    if validation_loss > prev_validation_loss:
        break
    prev_validation_loss = validation_loss

print('End of training.')

ERROR:tensorflow:An interactive session is already active. This can cause out-of-memory errors or some other unexpected errors (due to the unpredictable timing of garbage collection) in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s). Please use `tf.Session()` if you intend to productionize.
Epoch 1. Mean loss: 0.436. Validation loss: 0.160. Validation accuracy: 95.64%
Epoch 2. Mean loss: 0.184. Validation loss: 0.126. Validation accuracy: 96.58%
Epoch 3. Mean loss: 0.139. Validation loss: 0.113. Validation accuracy: 97.06%
Epoch 4. Mean loss: 0.113. Validation loss: 0.104. Validation accuracy: 97.28%
Epoch 5. Mean loss: 0.094. Validation loss: 0.099. Validation accuracy: 97.40%
Epoch 6. Mean loss: 0.080. Validation loss: 0.097. Validation accuracy: 97.48%
Epoch 7. Mean loss: 0.069. Validation loss: 0.095. Validation accuracy: 97.50%
Epoch 8. Mean loss: 0.060. Validation loss: 0.094. Validation accuracy: 97.56%
Epoch 9

## Test the model

As we discussed in the lectures, after training on the training and validation sets, we test the final prediction power of our model by running it on the test dataset that the algorithm has not seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset. The test is the absolute final instance. You should not test before you are completely done with adjusting your model.

In [21]:
# Use the preprocessed test set
test_accuracy = sess.run([accuracy], 
    feed_dict={inputs: x_test_flat, targets: y_test_oh})

# Test accuracy is a list with 1 value, so we want to extract the value from it, using x[0]
# Uncomment the print to see how it looks before the manipulation
# print(test_accuracy)
test_accuracy_percent = test_accuracy[0] * 100.

# Print the test accuracy formatted in percentages
print('Test accuracy: '+'{0:.2f}'.format(test_accuracy_percent)+'%')

Test accuracy: 97.27%


Using the initial model and hyperparameters given in this notebook, the final test accuracy should be roughly between 97% and 98%. Each time the code is rerunned, we get a different accuracy as the batches are shuffled, the weights are initialized in a different way, etc.

Finally, we have intentionally reached a suboptimal solution, so you can have space to build on it.