# MNIST handwritten digits classification with MLPs

In this notebook, we'll train a multi-layer perceptron model to classify MNIST digits using [TensorFlow](https://www.tensorflow.org/) (version $\ge$ 2.0 required) with the [Keras API](https://www.tensorflow.org/guide/keras/overview).

First, the needed imports.

In [None]:
%matplotlib inline

from pml_utils import show_failures

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.utils import plot_model, to_categorical

from distutils.version import LooseVersion as LV

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using Tensorflow version: {}, and Keras version: {}.'.format(tf.__version__, tf.keras.__version__))
assert(LV(tf.__version__) >= LV("2.0.0"))

Let's check if we have GPU available.

In [None]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus) > 0:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    from tensorflow.python.client import device_lib
    for d in device_lib.list_local_devices():
        if d.device_type == 'GPU':
            print('GPU', d.physical_device_desc)
else:
    print('No GPU, using CPU instead.')

## MNIST data set

Next we'll load the MNIST handwritten digits data set using TensorFlow's own tools.  First time we may have to download the data, which can take a while.

#### Altenative: Fashion-MNIST

Alternatively, MNIST can be replaced with Fashion-MNIST, which can be used as drop-in replacement for MNIST.   Fashion-MNIST contains images of 10 fashion categories:

Label|Description|Label|Description
--- | --- |--- | ---
0|T-shirt/top|5|Sandal
1|Trouser|6|Shirt
2|Pullover|7|Sneaker
3|Dress|8|Bag
4|Coat|9|Ankle boot


In [None]:
from tensorflow.keras.datasets import mnist, fashion_mnist

## MNIST:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
## Fashion-MNIST:
#(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

nb_classes = 10

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255.0
X_test /= 255.0

# one-hot encoding:
Y_train = to_categorical(y_train, nb_classes)
Y_test = to_categorical(y_test, nb_classes)

print()
print('MNIST data loaded: train:',len(X_train),'test:',len(X_test))
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('Y_train:', Y_train.shape)

The training data (`X_train`) is a 3rd-order tensor of size (60000, 28, 28), i.e. it consists of 60000 images of size 28x28 pixels. `y_train` is a 60000-dimensional vector containing the correct classes ("0", "1", ..., "9") for each training sample, and `Y_train` is a [one-hot](https://en.wikipedia.org/wiki/One-hot) encoding of `y_train`.

Let's take a closer look. Here are the first 10 training digits (or fashion items for Fashion-MNIST):

In [None]:
pltsize=1
plt.figure(figsize=(10*pltsize, pltsize))

for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    plt.imshow(X_train[i,:,:], cmap="gray")
    plt.title('Class: '+str(y_train[i]))
    print('Training sample',i,': class:',y_train[i], ', one-hot encoded:', Y_train[i])

## Multi-layer perceptron (MLP) network

Let's create an MLP model that has multiple layers, non-linear activation functions, and optionally dropout layers for regularization.

### Initialization

We first create the `Input` of shape 28x28 to match the size of the input data. Then we use a `Flatten` layer to convert the 2D image data into vectors of size 784.

We add a `Dense` layer that 20 output nodes. The `Dense` layer connects each input to each output with some weight parameter and then passes the result through a ReLU non-linear activation function.

Commented out is an alternative, more complex, model that you can also try out.  It uses more layers and dropout.  `Dropout()` randomly sets a fraction of inputs to zero during training, which is one approach to regularization and can sometimes help to prevent overfitting.

The output of the last layer needs to be a softmaxed 10-dimensional vector to match the groundtruth (`Y_train`).  This means that it will output 10 values between 0 and 1 which sum to 1, hence, together they can be interpreted as a probability distribution over our 10 classes.

After all layers are created, we create the `Model` by specifying its inputs and outputs.

Finally, we select *categorical crossentropy* as the loss function, select [*adam*](https://keras.io/optimizers/#adam) as the optimizer, add *accuracy* to the list of metrics to be evaluated, and `compile()` the model.  Adam is simply a an advanced version of stochastic gradient descent, note there are [several different options](https://keras.io/optimizers/) for the optimizer in Keras that we could use instead of *adam*.

In [None]:
# Model initialization:
inputs = keras.Input(shape=(28, 28))
x = layers.Flatten()(inputs)

# A simple model:
x = layers.Dense(units=20, activation="relu")(x)

# A bit more complex model:
#x = layers.Dense(units=50, activation="relu")(x)
#x = layers.Dropout(rate=0.2)(x)
#x = layers.Dense(units=50, activation="relu")(x)
#x = layers.Dropout(rate=0.2)(x)

# The last layer needs to be like this:
outputs = layers.Dense(units=10, activation='softmax')(x)

model = keras.Model(inputs=inputs, outputs=outputs,
                    name="mlp_model")
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
print(model.summary())

The summary shows that there are 15,910 parameters in total in our model.

For example for the first dense layer we have 785x20 = 15,700 parameters as the weight matrix is of size 785x20 (not 784, as there's an additional bias term).

We can also draw a fancier graph of our model.

In [None]:
plot_model(model, show_shapes=True)

### Learning

Next, we'll train our model.  Notice how the interface is similar to scikit-learn: we still call the `fit()` method on our model object.

An *epoch* means one pass through the whole training data, we'll begin by running training for 10 epochs.

You can run code below multiple times and it will continue the training process from where it left off.  If you want to start from scratch, re-initialize the model using the code a few cells ago. 

We use a batch size of 32, so the actual input will be 32x784 for each batch of 32 images.

In [None]:
%%time
epochs = 10

history = model.fit(X_train, Y_train, 
                    epochs=epochs, 
                    batch_size=32,
                    verbose=2)

Let's now see how the training progressed. 

* *Loss* is a function of the difference of the network output and the target values.  We are minimizing the loss function during training so it should decrease over time.
* *Accuracy* is the classification accuracy for the training data.  It gives some indication of the real accuracy of the model but cannot be fully trusted, as it may have overfitted and just memorizes the training data.

In [None]:
plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['loss'])
plt.title('loss')

plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['accuracy'])
plt.title('accuracy');

### Inference

For a better measure of the quality of the model, let's see the model accuracy for the test data. 

In [None]:
%%time
scores = model.evaluate(X_test, Y_test, verbose=2)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

We can now take a closer look at the results using the `show_failures()` helper function.

Here are the first 10 test digits the MLP classified to a wrong class:

In [None]:
predictions = model.predict(X_test)

show_failures(predictions, y_test, X_test)

We can use `show_failures()` to inspect failures in more detail. For example, here are failures in which the true class was "6":

In [None]:
show_failures(predictions, y_test, X_test, trueclass=6)

We can also compute the confusion matrix to see which digits get mixed the most, and look at classification accuracies separately for each class:

In [None]:
from sklearn.metrics import confusion_matrix

print('Confusion matrix (rows: true classes; columns: predicted classes):'); print()
cm=confusion_matrix(y_test, np.argmax(predictions, axis=1), labels=list(range(10)))
print(cm); print()

print('Classification accuracy for each class:'); print()
for i,j in enumerate(cm.diagonal()/cm.sum(axis=1)): print("%d: %.4f" % (i,j))

## Model tuning

Modify the MLP model.  Try to improve the classification accuracy, or experiment with the effects of different parameters.  If you are interested in the state-of-the-art performance on permutation invariant MNIST, see e.g. this [recent paper](https://arxiv.org/abs/1507.02672) by Aalto University / The Curious AI Company researchers.

You can also consult the Keras documentation at https://keras.io/.  For example, the Dense, Activation, and Dropout layers are described at https://keras.io/layers/core/.