# Classifying Handwritten Digits from the MNIST dataset with Deep Learning

### This example is adapted from Example 2.1 in François Chollet's 'Deep Learning with Python'.[**]

Begin by importing all the relevant modules. For this example we will only require modules from the Keras library: the models module to build the overall network and the layers module to access all the layers we will clip together to build the architecture of the network.

In [16]:
from keras import models
from keras import layers

from keras.utils import to_categorical

## Phase 1 - Load and Preprocess the Data

The popular MNIST dataset contains $70,000$ $28 \times 28$ greyscale images of handwritten digits from the set $\{0,1,2,3,4,5,6,7,8,9\}$. The MNIST dataset is common enough to the machine learning community that it comes pre-loaded with the Keras library[36]. To load this data we simply import it from the ```keras.datasets``` library. Calling ```mnist.load_data()``` returns a tuple of NumPy arrays containing a further two tuples: the first, ```(train images, train labels)```, containing $60,000$ images and their corresponding labels to be used for the process of training the network and the
second, ```(test images, test labels)```, containing the remaining $10,000$ images and corresponding labels for testing the performance of the network after training.

In [3]:
from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In the code cells below, using NumPy's reshape attribute, we flatten each $28 \times 28$ array of pixels into a $784 \times 1$ vector. Each pixel value, $p \in \mathbb{Z}$, in this vector is constrained by $0 < p < 256$ where a value of $0$ corresponds to a black pixel, a value of $255$ corresponds to a white pixel and all integer values in-between correspond to shades of grey which increase in brightness as the pixel value p increases. Using this information, we re-scale the vector so that each pixel value, $p \in (0, 1)$. We do this rescaling to ensure the pixel values are similar in magnitude to the weights [37]. Finally, convert each of the numerical labels to categorical labels with the help of ```to_categorical``` from the ```keras.preprocessing``` package. The data is now in an appropriate form to be fed to the input layer of the neural network.

In [4]:
train_images = train_images.reshape((60000, 784))
test_images = test_images.reshape((10000, 784))

In [5]:
train_images = train_images / 255.0
test_images = test_images / 255.0

In [6]:
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

## Section 2 - Specify the Architecture of the Network

Using Keras' sequential network generating paradigm, we build a neural network with one hidden layer. The architecture of this network is a $784$ unit input layer, followed by a $128$ unit hidden layer and ending in a $10$ unit output layer (one for each digit in $\{0, 1, 2, 3, 4, 5, 6, 7, 8, 9 \}$). By specifying the shape of the input data, we indirectly to specify the shape of the input layer. Furthermore, it is only necessary to specify the shape of the input data as Keras can automatically infer the input shape of any subsequent layer from the dimension of the previous one. We use the most suitable ReLu and softmax functions as activation functions for the layers of the network [38].

In [7]:
network = models.Sequential()




Calling ```network.summary()``` displays an overview of the architecture of the network: it shows the number of layers, type of each layer and number of trainable parameters in each layer. For instance, in layer ```dense_9``` we observe there are precisely $100,480 = 784 \times 128 + 128$ parameters. This is the number we expect since for a dense layer, every neuron in the current layer (128 of them) is connected to every other neuron in the previous layer (784 of them) making exactly $100,352 = 784 \times 128$ connections/parameters. The additional 128 trainable parameters correspond to the bias. 

By an exactly similar argument, ```dense_10``` has precisely $1290 = 10 \times 128 + 10$ trainable parameters.

In [8]:
network.add(layers.Dense(128, activation='relu', input_shape = (784, )))
network.add(layers.Dense(10, activation='softmax'))





Calling ```network.summary()``` displays an overview of the architecture of the network: it shows the number of layers, type of each layer and number of trainable parameters in each layer. For instance, in layer ```dense 9``` we observe there are precisely $100,480 = (784 \times 128) + 128$ parameters. This is the number we expect since, for a densely connected layer, every neuron in the current layer ($128$ of them) is connected to every neuron in the previous layer ($784$ of them) making exactly $100,352 = 784 \times 128$ connections, and equivalently, trainable parameters. The additional $128$ trainable parameters correspond to the bias parameters of each neuron in the hidden layer. By an exactly similar argument, ```dense 10``` has precisely $1290 = (10 \times 128) + 10$ trainable parameters.

In [9]:
network.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 128)               100480    
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________


## Section 3 - Compile the Network

The next step is to select a suitable loss function, optimiser and metric that the network can use to train. Since we are solving a multi-class classification problem, catgeorical-crossentropy is the standard, suitable choice of loss function. We select the rmsprop optimiser to start with, if the network does not give the results we expect, we will return here and change this. Accuracy, the fraction of examples the network correctly classifies, is a suitable metric for this task. (Note, however, it is not suitable for all tasks as we shall see later.)

In [10]:
network.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics=['accuracy'])





## Section 4 - Fit the Data to the Model

The final step is to train the network. Using the ```network.fit()``` command, we pass the training data ($60,000$ vectorised pixel arrays and their associated labels) to the network.

The parameter epochs corresponds to the number of passes of the data through the network. By setting ```epochs = 5``` we are instructing Keras to perform $5$ passes of the data (forward and back) through the network to train the parameters. The parameter batch size corresponds to the number of examples we pass through the network in one pass. An increased batch size is more computationally expensive (requires more memory) but decreases the amount of training timerequired for each epoch.

Notice that at the end of each epoch in the output, we are provided with a loss score and an accuracy score. These scores correspond to the value of the loss function and the accuracy score (percentage of training examples correctly classified) at the end of each epoch. After five epochs, we reach an accuracy of approximately $97$ percent.

In [11]:
network.fit(train_images, train_labels, epochs = 5, batch_size = 128)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x235710510f0>

Finally, we call the ```network.evaluate()``` command. This command accepts the $10,000$ testing examples and their associated labels, passes the examples through the network and compares them to the true labels. It calculates the percentage accuracy of the network on the test set to be approximately $97$ percent also. The agreement between these figures is a good indication that the network has not overfit the data, that the network has generalised well and that the network has good predictive power for unseen examples.

In [12]:
test_loss, test_acc = network.evaluate(test_images, test_labels)
print("Test Accuracy: ", test_acc)

Test Accuracy:  0.9745


[**] François Chollet (2017) Deep Learning with Python, : Manning Publications Co. (Example 2.1 available in print and online at <https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/2.1-a-first-look-at-a-neural-network.ipynb>)