In this notebook, I am going to try to build a model that reaches above 0.995 scores on kaggle's leaderboard (top 10%). To do so, we first need to pick the best CNN architecture using RandomizedSearchCV. Also we have to try data augmentation, dropout, and use a learning schedule.

# Import libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV

from keras.models import Sequential
from keras.layers import InputLayer, Conv2D, MaxPool2D, Flatten, Dense, Dropout
from keras.optimizers import Adam
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import to_categorical
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
import keras

# Loading, preprocessing, and split the data

In [2]:
train_set = pd.read_csv("../input/digit-recognizer/train.csv")
test_set = pd.read_csv("../input/digit-recognizer/test.csv")

In [3]:
X_train = train_set.drop("label", axis=1)
y_train = train_set["label"]

X_train = X_train / 255.
X_test = test_set / 255.

X_train = X_train.values.reshape(-1, 28, 28, 1)
X_test = X_test.values.reshape(-1, 28, 28, 1)

# label encoded to one hot vectors
y_train = to_categorical(y_train, num_classes=10)

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

# Build the model

A typical CNN architecture generally stack few convolutional layers and pooling layer, then repeat the operation several times.
Let's start by creating a function that build and compile our keras model.

In [4]:
def build_model(hidden_layers=1, feature_maps=16, kernel_size=3, n_neurons=32, dropout=0.1):
    model = Sequential([])
    model.add(InputLayer(input_shape=(28, 28, 1)))
    for n in range(hidden_layers):
        model.add(Conv2D((n + 1) * feature_maps, kernel_size, activation="relu", padding="same"))
        model.add(Conv2D((n + 1) * feature_maps, kernel_size, activation="relu", padding="same"))
        model.add(MaxPool2D())
        model.add(Dropout(dropout))   
    model.add(Flatten())
    model.add(Dense(n_neurons))
    model.add(Dropout(dropout))
    model.add(Dense(10, activation="softmax"))
    model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
    return model

We will use randomized search rather than grid search because there are many hyperparameters and the model may perform slightly better so it's not worth the computational cost.

In order to do RandomizedSearchCV we need to wrap our keras model using a KerasClassifier class.

In [5]:
keras.backend.clear_session()
np.random.seed(42)

keras_clf = KerasClassifier(build_model)

params = {
    "hidden_layers": [1, 2, 3],
    "feature_maps": [16, 24, 32],
    "n_neurons": [64, 128, 256],
    "dropout": [0.2, 0.3, 0.4]
}
batch_size = 64

search_cv = RandomizedSearchCV(keras_clf, params, n_iter=15, cv=3, verbose=2)
search_cv.fit(X_train, y_train, epochs=30,
              validation_data=(X_valid, y_valid),
              callbacks=EarlyStopping(patience=7),
              batch_size=batch_size,
              verbose=0)

Fitting 3 folds for each of 15 candidates, totalling 45 fits
[CV] n_neurons=64, hidden_layers=2, feature_maps=16, dropout=0.3 .....


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  n_neurons=64, hidden_layers=2, feature_maps=16, dropout=0.3, total=  31.1s
[CV] n_neurons=64, hidden_layers=2, feature_maps=16, dropout=0.3 .....


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   31.1s remaining:    0.0s


[CV]  n_neurons=64, hidden_layers=2, feature_maps=16, dropout=0.3, total=  27.0s
[CV] n_neurons=64, hidden_layers=2, feature_maps=16, dropout=0.3 .....
[CV]  n_neurons=64, hidden_layers=2, feature_maps=16, dropout=0.3, total=  28.3s
[CV] n_neurons=64, hidden_layers=1, feature_maps=16, dropout=0.2 .....
[CV]  n_neurons=64, hidden_layers=1, feature_maps=16, dropout=0.2, total=  22.3s
[CV] n_neurons=64, hidden_layers=1, feature_maps=16, dropout=0.2 .....
[CV]  n_neurons=64, hidden_layers=1, feature_maps=16, dropout=0.2, total=  21.0s
[CV] n_neurons=64, hidden_layers=1, feature_maps=16, dropout=0.2 .....
[CV]  n_neurons=64, hidden_layers=1, feature_maps=16, dropout=0.2, total=  17.3s
[CV] n_neurons=128, hidden_layers=2, feature_maps=32, dropout=0.2 ....
[CV]  n_neurons=128, hidden_layers=2, feature_maps=32, dropout=0.2, total=  24.0s
[CV] n_neurons=128, hidden_layers=2, feature_maps=32, dropout=0.2 ....
[CV]  n_neurons=128, hidden_layers=2, feature_maps=32, dropout=0.2, total=  25.5s
[CV] 

[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed: 21.1min finished


RandomizedSearchCV(cv=3,
                   estimator=<tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x7eff9ca1c250>,
                   n_iter=15,
                   param_distributions={'dropout': [0.2, 0.3, 0.4],
                                        'feature_maps': [16, 24, 32],
                                        'hidden_layers': [1, 2, 3],
                                        'n_neurons': [64, 128, 256]},
                   verbose=2)

In [6]:
model = search_cv.best_estimator_.model
model.summary()

Model: "sequential_45"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_174 (Conv2D)          (None, 28, 28, 24)        240       
_________________________________________________________________
conv2d_175 (Conv2D)          (None, 28, 28, 24)        5208      
_________________________________________________________________
max_pooling2d_87 (MaxPooling (None, 14, 14, 24)        0         
_________________________________________________________________
dropout_132 (Dropout)        (None, 14, 14, 24)        0         
_________________________________________________________________
conv2d_176 (Conv2D)          (None, 14, 14, 48)        10416     
_________________________________________________________________
conv2d_177 (Conv2D)          (None, 14, 14, 48)        20784     
_________________________________________________________________
max_pooling2d_88 (MaxPooling (None, 7, 7, 48)        

In [7]:
search_cv.best_score_

0.9915873010953268

# Data augmentation

We will use the Keras' `ImageDataGenerator` class to apply on-the-fly data augmentation. Note that this class only returns the randomly transformed training data.
To learn more about data augmentation and the Keras' `ImageDataGenerator` class please read:
* The blog post [Keras ImageDataGenerator and Data Augmentation
](https://www.pyimagesearch.com/2019/07/08/keras-imagedatagenerator-and-data-augmentation/)
* This stackoverflow question: https://stackoverflow.com/questions/51677788/data-augmentation-in-pytorch

In [8]:
n_epochs = 30
s = n_epochs * len(X_train) // batch_size # number of steps in n_epochs epochs
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.001, s, 0.1)

img_generator = ImageDataGenerator(
    rotation_range=0.1,
    zoom_range=0.1,
    width_shift_range=0.1, 
    height_shift_range=0.1
)

model.compile(optimizer=Adam(learning_rate), loss="categorical_crossentropy", metrics=["accuracy"])

history = model.fit(img_generator.flow(X_train, y_train, batch_size=batch_size, seed=42),
                   epochs=n_epochs, validation_data=(X_valid, y_valid),
                   callbacks=[EarlyStopping(patience=7)])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30


# Ensemble Learning

Here we are going to try to combine 8 CNN to have a better classifier. We predict the class that gets the most votes.

Hopefully, this will give us a slightly better accuracy than our first model.

In [9]:
train_set = pd.read_csv("../input/digit-recognizer/train.csv")
test_set = pd.read_csv("../input/digit-recognizer/test.csv")
X_train = train_set.drop("label", axis=1)
y_train = train_set["label"]
X_train = X_train / 255.
X_test = test_set / 255.
X_train = X_train.values.reshape(-1, 28, 28, 1)
X_test = X_test.values.reshape(-1, 28, 28, 1)
y_train = to_categorical(y_train, num_classes=10)

nets = 8
ensemble = [search_cv.best_estimator_.model for _ in range(nets)]
history = [0] * nets

for j, clf in enumerate(ensemble):
    X_train2, X_valid2, y_train2, y_valid2 = train_test_split(X_train, y_train, test_size=0.15, random_state=42)   
    clf.compile(optimizer=Adam(learning_rate), loss="categorical_crossentropy", metrics=["accuracy"])
    history[j] = clf.fit(img_generator.flow(X_train2, y_train2, batch_size=batch_size, seed=42),
                         epochs=n_epochs, validation_data=(X_valid2, y_valid2), verbose=0)
    print("CNN {0:d}: Epochs={1:d}, Train accuracy={2:.5f}, Validation accuracy={3:.5f}".format(
          j+1,n_epochs,max(history[j].history['accuracy']),max(history[j].history['val_accuracy']) ))

CNN 1: Epochs=30, Train accuracy=0.99501, Validation accuracy=0.99603
CNN 2: Epochs=30, Train accuracy=0.99549, Validation accuracy=0.99540
CNN 3: Epochs=30, Train accuracy=0.99571, Validation accuracy=0.99556
CNN 4: Epochs=30, Train accuracy=0.99636, Validation accuracy=0.99587
CNN 5: Epochs=30, Train accuracy=0.99658, Validation accuracy=0.99556
CNN 6: Epochs=30, Train accuracy=0.99650, Validation accuracy=0.99476
CNN 7: Epochs=30, Train accuracy=0.99669, Validation accuracy=0.99524
CNN 8: Epochs=30, Train accuracy=0.99689, Validation accuracy=0.99492


In [10]:
from scipy.stats import mode

y_pred = np.empty([nets, len(X_test)])

for clf_index, clf in enumerate(ensemble):
    y_pred[clf_index] = np.argmax(clf.predict(X_test), axis=1)

y_pred_majority_votes, n_votes = mode(y_pred, axis=0)

In [11]:
results = y_pred_majority_votes.reshape([-1]).astype(int)
results = pd.Series(results,name="Label")
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)
submission.to_csv("MNIST-CNN-ENSEMBLE.csv",index=False)

# References

The code of this notebook was inspired by the following good kernels and tutorials:

* [How to choose CNN Architecture MNIST](https://www.kaggle.com/cdeotte/how-to-choose-cnn-architecture-mnist#Experiment-1)
* [25 Million Images! [0.99757] MNIST](https://www.kaggle.com/cdeotte/25-million-images-0-99757-mnist/data#Accuracy=99.75%-using-25-Million-Training-Images!!)
* [Introduction to CNN Keras - 0.997 (top 6%)](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6#3.-CNN)
* Chapter 10 from the book Hands-on machine learning (Aurélien Géron).