# Kaggle Competition: Digit recognition on MNIST data
8/3/2017 Wei-Ying Wang

This is my tutorial about how to use Keras to construct a CNN model for digit recognition. The tutorial tried to be comprehensive about building CNN with Keras. Keras is designed to be easy to use and manipulate, however I found difficult to understand the structure I built when I first used it. I hope this tutorial can help smooth the learning curve of using Keras.

Most of the information is on chapter 2 and 3. I will emphasize a lot on knowing the **number of parameters**, **inputs**, and **outputs**.

I eventually got 99.21% correction rate (With 2 layer of CNN and one dense layer, i.e. normal neural network layer, and some filters of different sizes from here). Note that it is not surprising that one can get 100% on the test set provided by Kaggle, since it is not difficult to find all the MNIST data somewhere else. The real winner (correct me if I am wrong) so far is from [Dan Cireşan et. al. 2012](https://arxiv.org/abs/1202.2745), who got 99.77% correction rate, which achieved near human performance. He used CNN, too.

## Table of content

### 1. Import modules and preprocessing the data

### 2. A Typical CNN structure: one CNN layer

### 3. Stack more CNN layers

### 4. Using the learned trained model to predict the test set

You can download this Ipython notebook here or at [My Github Website](https://github.com/wayinone/MNIST_Kaggle).


## 1. Import modules and preprocessing the data

In [1]:
from importlib import reload
#from __future__ import print_function
import keras
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout,Flatten,Conv2D, MaxPooling2D
from keras.optimizers import RMSprop
from keras.utils.np_utils import to_categorical
import pandas as pd
from sklearn.model_selection import train_test_split

Import the data.

In [3]:
train = pd.read_csv('../input/train.csv')
train.head()
print('training data is (%d, %d).'% train.shape)

Let's split the training set into 2, training set and validation set. This is done with `train_test_split` in `sklearn` module.

In [6]:
X_train_all = (train.iloc[:,1:].values).astype('float32')/255 # all pixel values, convert to value in [0,1]
y_train_all = train.iloc[:,0].values.astype('int32') # only labels i.e targets digits
y_train_all= to_categorical(y_train_all) # This convert y into onehot representation
X_train, X_val, y_train, y_val = train_test_split(X_train_all, y_train_all, test_size=0.10, random_state=42)

We have to map the image vectors (size 784) back to image. Specially, we have to convert to (28,28,1), since the convoluation layer of Kares only accept image of dimension 3 (the last dimension is the color channel).

In [7]:
X_train_img=np.reshape(X_train,(X_train.shape[0],28,28,1))
X_val_img = np.reshape(X_val,(X_val.shape[0],28,28,1))

In [13]:
plt.imshow(X_train_img[0,:,:,0],cmap = 'gray')
plt.show()

## 2. A Typical CNN structure: one CNN layer.

We are going to setup the following CNN model. Basically: input layer --> CNN layer --> softmax layer

### 1. Input layer: 
The original input image is `28 x 28`, so the `input_shape=(28,28,1)`, where `1` indicates that number of color channels is 1. 
 
### 2. Convolution layer:
 In the following convolution layer, there are 32 filters, and each filters is `3 x 3`.
  * You can set border differently by:
       ```
       border_mode='same', 'fall', or 'valid' (default)
       ```
  * With `valid` border, after convolution (with stride=0), the filted image (AKA **feature map**) size is `(28-2) x (28-2) = 26 x 26` . Note there are 32 feature maps.
    
  * Set **stride** 2 by `strides=(2,2)`. The stride action will skip one pixel (both vertically and horizontally) while applying convolution. So now the every feature map (the filtered and subsampled image) is `13 x 13`
 
  * There are $$32\cdot3\cdot3+32 =320$$ parameters:
     - Each "pixel" of a feature map is obtained by $$\sum_{i=1}^9 w_i x_i +b,$$ where $$w_1,...,w_9,b\in \mathbb{R}^{10}$$ are parameters and $$(x_1,...,x_9)$$ are pixel values of a `3 x 3` image patches, AKA **receptive field**, in the original input image.
   
  * Using `relu` units by `activation='relu'`, then max Pooling by 2. So the output of this CNN layer will be 32 of `6 x  6` "images".
   
### 3. Flatten layer:
 one has to 'flatten' the output of the CNN layer before applying softmax layer. After flatten, the input of the next layer is of $$6\cdot 6\cdot 32=1152$$ values.
 
### 4. Softmax layer:
So the last layer ('softmax') would require $$1152\cdot 10+10=11530$$ parameters.


In [8]:
model = Sequential()
model.add(Conv2D(32, (3, 3),
                 activation='relu',
                 input_shape=(28,28,1),strides=(2,2)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten(name='flatten'))
model.add(Dense(10, activation='softmax'))
model.summary()

Before fitting the parameters, we need to compile the model first.

In [9]:
model.compile(loss='categorical_crossentropy',
              optimizer='RMSprop',
              metrics=['accuracy'])

We can now fit the parameters. Note that:

 * A better `batch_size` should be the exponent of 2, (e.g. 32, 64, 128, ..., etc) this is from the document of Keras.
 * Each epoch will scan through all the training data. Where each update of parameters uses `batch_size` number of data (It will be chosen randomly through each epoch, i.e. the data will be shuffled before entering the next epoch.)
 * `initial_epoch` allows you to continue your last parameter fitting.
   - e.g. `initial_epoch`=10, will continue the parameter fitting from the 10th epoch.

 * `verbose` controls the information to be displayed. 0: no information displayed, 1: max information, 2: somewhat between.

In [10]:
batch_size = 64 
nb_epochs = 5 
history = model.fit(X_train_img, y_train,
                    batch_size=batch_size,
                    epochs=nb_epochs,
                    verbose=2, 
                    validation_data=(X_val_img, y_val),
                    initial_epoch=0)

We can see that after 20 epoch, the validation accuracy is not improved (around 97.6%). The training set accuracy is about 98.8%. We should first try more complicated model to see if the training set accuracy get higher.

## 3. Stack more CNN layers

In the following model, we have:
### 1. Input layer
* The input is `28 x 28 x 1`.  Note that I can describe it as `28 x 28` like before. However, I describe this way (`28 x 28 x 1`) because it will be easier for you to understand if you want to stack another CNN layer later.

### 2. First CNN layer:
* Filter (kernel) size: `5 x 5 x 1`. With `valid` padding.
* Number of filters: 64.
* The output is 64 of `24 x 24` feature maps. i.e.` 24 x 24 x 64`
* The number of parameters: $$25\cdot 64+64=1664$$.
    
### 3. Second CNN layer:
* The input is `24 x 24 x 64`
* Filter (kernel) size: `3 x 3 x 64`. With `valid` padding.
    * That is, we treat the input as a `24 x 24` image with **64 color channels**.
* Number of filters: 32.
* The output is 32 of `22 x 22` filtered images (or say, feature maps). i.e. `22 x 22 x 32`
* The number of parameters: $$3\cdot 3 \cdot 64\cdot 32+32 = 18464$$.
    * Think this way: There are 64 nodes, and each node is `24 x 24`  feature maps.
    
### 4. MaxPool layer: 
* Set `pool_size = (2,2)`, this means a pixel of the new feature map is the max intensity of 4 of the pixels from old feature map.
* The output is `11 x 11 x 32`, i.e. 32  feature maps.

### 5. Dense layer: A normal neural network layer with 128 nodes.
* Each node here will connect to 32 nodes of previous layer. (Each node in previous layer is represented as a `11 x 11` image.)
* The output is `11 x 11 x 128`.
* The number of parameters: $$32\cdot 128+128=4224.$$
* Apply dropout rate 0.2 to prevent overfitting.

### 6.: Flatten the output of the previous layer:
* The output is $$11\cdot 11\cdot 128=15488$$ nodes.

### 7.  Softmax layer: 10 softmax units
* The number of parameter: $$15488\cdot10+10=154890.$$

In [26]:
model = Sequential()
model.add(Conv2D(64, (5, 5),
                 activation='relu',
                 input_shape=(28,28,1)))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))   
model.add(Flatten(name='flatten'))
model.add(Dense(10, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
              optimizer='RMSprop',
              metrics=['accuracy'])

The following will be very slow. It will take several minute to finish one epoch.

In [28]:
batch_size = 64
nb_epochs = 2
history = model.fit(X_train_img, y_train,
                    batch_size=batch_size,
                    epochs=nb_epochs,
                    verbose=2, # verbose controls the infromation to be displayed. 0: no information displayed
                    validation_data=(X_val_img, y_val),
                    initial_epoch=0)


The following function shows the wrongly predicted images. Many of them I can't even tell what it is...

## 4. Using the learned model to predict the test set.

In [11]:
test = pd.read_csv('../input/test.csv')
X_test = (test.values).astype('float32')/255 # all pixel values
X_test_img = np.reshape(X_test,(X_test.shape[0],28,28,1))

In [12]:
pred_classes = model.predict_classes(X_test_img,verbose=0)

Take a look at what we predicted on a test sample.

In [13]:
i=10
plt.imshow(X_test_img[i,:,:,0],cmap='gray')
plt.title('prediction:%d'%pred_classes[i])
plt.show()

## Appendix:

The following function `plot_difficult_samples` will plot difficult samples from training set.

In [36]:
def plot_difficult_samples(model,x,y, verbose=True):
    """
    model: trained model from keras
    x: size(n,h,w,c)
    y: is categorical, i.e. onehot, size(n,10)
    """ 
    #%%
    
    pred_classes = model.predict_classes(x,verbose= 0)
    y_val_classes = np.argmax(y, axis=1)
    er_id = np.nonzero(pred_classes!=y_val_classes)[0]
    #%%
    K = np.ceil(np.sqrt(len(er_id)))
    fig = plt.figure()
    print('There are %d wrongly predicted images out of %d validation samples'%(len(er_id),x.shape[0]))
    for i in range(len(er_id)):
        ax = fig.add_subplot(K,K,i+1)
        k = er_id[i]
        ax.imshow(x[er_id[i],:,:,0])
        ax.axis('off')
        if verbose:
            ax.set_title('%d as %d'%(y_val_classes[k],pred_classes[k]))

In [37]:
plot_difficult_samples(model,X_val_img,y_val,verbose=False)
plt.show()