In [156]:
import numpy as np
import pandas as pd
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Flatten, Activation
from keras import optimizers

from keras.utils import np_utils

from keras.datasets import mnist

In [157]:
np.random.seed(1)

Read the MNIST dataset using Keras's load_data method
Since class labels need to be organised into one-hot-encoded vectors we do that using keras.utils.to_categorical

In [158]:
num_classes = 10
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# convert class labels to binary class one-hot-encoded vectors
y_train_ohe = keras.utils.to_categorical(y_train, num_classes)
y_test_ohe = keras.utils.to_categorical(y_test, num_classes)

In [159]:
#convert to float before normalising
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
# normalise by dividing by (max-min) i.e. (255-0)
X_train/=255
X_test/=255

In [160]:
print('Shape of the training data', X_train.shape)
print('Example of a y label convereted to ohe:', y_train_ohe[2])


Shape of the training data (60000, 28, 28)
Example of a y label convereted to ohe: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]


# MNIST ANN
We will now use Keras to build a basic fully connected ANN. 
We create a sequential model

Firstly the input shape which as 2-dimensional representation of the pixels i.e. 28x28 needs to be flattened.
We achieve this by adding a Flatten layer.
Thereafeter we add desnse layers followed by activation layers. 
Apart from the input first Desnse layer ; all other layers need only specify the number of neurons / hidden units to be included in a given hidden layer. That number automatically becomes input to the next layer.

In [161]:
model = Sequential()

In [162]:
model.add (Flatten( input_shape=(28, 28)))
model.add(Reshape([28,28,1]))

model.add (Dense ( input_dim=28*28, units = 512))

model.add ( Activation('sigmoid'))

model.add (Dense ( units = 128))

model.add ( Activation('sigmoid'))

model.add (Dense ( units = 10))

model.add ( Activation('softmax'))

In [168]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_18 (Flatten)         (None, 784)               0         
_________________________________________________________________
dense_38 (Dense)             (None, 512)               401920    
_________________________________________________________________
activation_51 (Activation)   (None, 512)               0         
_________________________________________________________________
dense_39 (Dense)             (None, 128)               65664     
_________________________________________________________________
activation_52 (Activation)   (None, 128)               0         
_________________________________________________________________
dense_40 (Dense)             (None, 10)                1290      
_________________________________________________________________
activation_53 (Activation)   (None, 10)                0         
Total para

Choose the optimiser for the network wight update strategy and then compile it.

In [170]:
# pass optimizer by name: default parameters will be used
opt = optimizers.SGD(lr=0.01)
#specify the loss function and the metric for evaluation
model.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])


In [167]:
model.fit(X_train, y_train_ohe, batch_size=128, nb_epoch=5)

Epoch 1/5
 1152/60000 [..............................] - ETA: 9s - loss: 0.0896 - acc: 0.1302

  """Entry point for launching an IPython kernel.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1d338155a20>

In [166]:
score = model.evaluate(X_test, y_test_ohe)
print('Total loss on training', score[0])
print('Total accuracy on training', score[1])

Total loss on training 0.0896342469215393
Total accuracy on training 0.128


# Exercise 
Consider how you might improve the accuracy of the basic neural network for MNIST.
- try a different loss functions such as categorical_crossentropy (see https://keras.io/losses/#available-loss-functions)
which is known to work better with the softmax layer than does the mean squared error loss. The cross-entropy loss calculates the error rate between the predicted value and the original value. The formula for calculating cross-entropy loss is given here https://en.wikipedia.org/wiki/Cross_entropy. Categorical is used because there are 10 classes to predict from. If there were 2 classes, we would have used binary_crossentropy.
- try different optimizers such as Adam and RMSprop (see https://keras.io/optimizers/). For instance the Adam optimizer is an improvement over SGD(Stochastic Gradient Descent). The optimizer is responsible for updating the weights of the neurons via backpropagation. It calculates the derivative of the loss function with respect to each weight and subtracts it from the weight. 
- try different activation functions such as relu or tanh (see https://keras.io/activations/#available-activations).
- try increasing the number of hidden units. However the more complex the network the longer it takes for training. 
- try adding hidden layers


# CNN for MNIST
Next we look at how to use the CNN implmentation of Keras on the MNIST data. 

Here we need to reshape our data such that we maintain the 2-dimensional 28x28 representation; instead of having to  flattening it as we did for the basic ANN above. Because we are using only a grey scale representation we will need to also set the number of channels as 1 (instead of say 3 in case we used a RGB input). 

Keras allows us to specify the number of filters we want and the size of the filters. So, in our first layer, 32 is number of filters and (3, 3) is the size of the filter. We also need to specify the shape of the input which is (28, 28, 1), but we have to specify it only once.

The second layer is the Activation layer. We have used ReLU (rectified linear unit) as our activation function. ReLU function is f(x) = max(0, x), where x is the input. It sets all negative values in the matrix ‘x’ to 0 and keeps all the other values constant. It is the most used activation function since it reduces training time and prevents the problem of vanishing gradients.

The third layer is the MaxPooling layer. MaxPooling layer is used to down-sample the input to enable the model to make assumptions about the features so as to reduce over-fitting. It also reduces the number of parameters to learn, reducing the training time.

It’s a best practice to always do BatchNormalization. BatchNormalization normalizes the matrix after it is been through a convolution layer so that the scale of each dimension remains the same. It reduces the training time significantly.

After creating all the convolutional layers, we need to flatten them, so that they can act as an input to the Dense layers.

Dense layers are keras’s alias for Fully connected layers. These layers give the ability to classify the features learned by the CNN.

Dropout is the method used to reduce overfitting. It forces the model to learn multiple independent representations of the same data by randomly disabling neurons in the learning phase. In our model, dropout will randomnly disable 20% of the neurons.

The second last layer is the Dense layer with 10 neurons. The neurons in this layer should be equal to the number of classes we want to predict as this is the output layer.

The last layer is the Softmax Activation layer. Softmax activation enables us to calculate the output based on the probabilities. Each class is assigned a probability and the class with the maximum probability is the model’s output for the input.

In [129]:
print("X_train original shape", X_train.shape)
print("y_train original shape", y_train.shape)
print("X_test original shape", X_test.shape)
print("y_test original shape", y_test.shape)

X_train original shape (60000, 28, 28)
y_train original shape (60000,)
X_test original shape (10000, 28, 28)
y_test original shape (10000,)


## Reshaping the data for the CNN
Now the shape of X_train is (60000, 28, 28, 1). As all the images are in grayscale, the number of channels is 1. If it was a color image, then the number of channels would be 3 (R, G, B).

Here we’ve rescaled the image data so that each pixel lies in the interval [0, 1] instead of [0, 255]. It is always a good idea to normalize the input so that each dimension has approximately the same scale.

Now, we need to one-hot encode the labels i.e. Y_train and Y_test. In one-hot encoding an integer is converted to an array which contains only one ‘1’ and the rest elements are ‘0’.

In [122]:

X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)

X_train.shape

(60000, 28, 28, 1)

In [136]:
from keras.layers.normalization import BatchNormalization
from keras.layers import Conv2D, MaxPooling2D, ZeroPadding2D, GlobalAveragePooling2D
from keras.layers.core import Reshape

In [137]:
# Three steps to create a CNN
# 1. Convolution
# 2. Activation
# 3. Pooling
# Repeat Steps 1,2,3 for adding more hidden layers

# 4. After that make a fully connected network
# This fully connected network gives ability to the CNN
# to classify the samples

model_cnn = Sequential()

In [138]:

model_cnn.add(Reshape([28,28,1]))

model_cnn.add(Conv2D(32, (3, 3), input_shape=(28,28,1)))
model_cnn.add(BatchNormalization(axis=-1))
model_cnn.add(Activation('relu'))
model_cnn.add(Conv2D(32, (3, 3)))
model_cnn.add(BatchNormalization(axis=-1))
model_cnn.add(Activation('relu'))
model_cnn.add(MaxPooling2D(pool_size=(2,2)))

#model.add(Conv2D(64,(3, 3)))
#model.add(BatchNormalization(axis=-1))
#model.add(Activation('relu'))
#model.add(Conv2D(64, (3, 3)))
#model.add(BatchNormalization(axis=-1))
#model.add(Activation('relu'))
#model.add(MaxPooling2D(pool_size=(2,2)))

model_cnn.add(Flatten())

# Fully connected layer
model_cnn.add(Dense(128))
model_cnn.add(BatchNormalization())
model_cnn.add(Activation('relu'))
model_cnn.add(Dropout(0.2))
model_cnn.add(Dense(10))

model_cnn.add(Activation('softmax'))

In [139]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_9 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
batch_normalization_12 (Batc (None, 26, 26, 32)        128       
_________________________________________________________________
activation_39 (Activation)   (None, 26, 26, 32)        0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 24, 24, 32)        9248      
_________________________________________________________________
batch_normalization_13 (Batc (None, 24, 24, 32)        128       
_________________________________________________________________
activation_40 (Activation)   (None, 24, 24, 32)        0         
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 12, 12, 32)        0         
__________

In [144]:

opt = optimizers.Adam(lr=0.01)
model_cnn.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

To reduce over-fitting, we use another technique known as Data Augmentation. Data augmentation rotates, shears, zooms, etc the image so that the model learns to generalize and not remember specific data. If the model overfits, it will perform very well on the images that it already knows but will fail if new images are given to it.

This is how we can do Data Augmentation in Keras. You can play with the values and check if it improves the accuracy of the model.

We have to create batches, so that we use less memory. Moreover, we can also train our model faster by creating batches. Here we are using batch of 64, so the model will take 64 images at a time and train on them.

Lets fit the model using one epoch.
You will notice that this does take quite a long time compared to the ANN. 
This is because there are many more parameters in a CNN to work with ; 
however there is a significant improvement in accuracy.
We have to create batches, so that we use less memory. Moreover, we can also train our model faster by creating batches. 
Here we are using batch of 64, so the model will take 64 images at a time and train on them.

In [147]:
model_cnn.fit(X_train, y_train_ohe, batch_size=64, nb_epoch=1, validation_data=(X_test, y_test_ohe))# model.

  """Entry point for launching an IPython kernel.


Train on 60000 samples, validate on 10000 samples
Epoch 1/1


<keras.callbacks.History at 0x1d3369fc828>

In [149]:
score = model_cnn.evaluate(X_test, y_test_ohe)
print()
print('Test accuracy: ', score[1])


Test accuracy:  0.9812


In [155]:
print('Saving the actual and predicted labels for the test set into a file...')
predictions  = model_cnn.predict_classes(X_test)

predictions = list(predictions)
actuals = list(y_test)

sub = pd.DataFrame({'Actual': actuals, 'Predictions': predictions})
sub.to_csv('output_cnn.csv', index=False)

Saving the actual and predicted labels for the test set into a file...
