# MNIST - Digit Recognizer Kaggle Competition with CNN

What you will find in this notebook:

* Case study of the application and optimization of a CNN
* Explanation of the functioning of the CNN and the different choices made
___
**Disclaimer**: The purpose of this notebook is simply to clarify my thoughts by sharing them so that they may be useful to others. I don't claim to be flawless so if there are any corrections in my explanations or code that you feel are needed, please feel free to let me know.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import cv2
import keras
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import GlobalAveragePooling2D, Dense
import tensorflow as tf
from keras_tuner import Hyperband
from kerastuner.engine.hyperparameters import HyperParameters



## Importing and splitting data

In [2]:
raw_dataset = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
raw_dataset_images = raw_dataset.iloc[:, 1:]
raw_dataset_label = raw_dataset.iloc[:, 0]


training_data_images, testing_data_images, training_data_label, testing_data_label = train_test_split(
    raw_dataset_images, raw_dataset_label, test_size=0.2, random_state=0)

training_data_label = training_data_label.astype(int)
testing_data_label = testing_data_label.astype(int)

## Resizing each image to be in 28 x 28

In [5]:
reshaped_train_df = training_data_images.values.reshape(-1, 28, 28, 1)
reshaped_test_df = testing_data_images.values.reshape(-1, 28, 28, 1)

## Generating more data to train

To increase the number of training images, it is possible to do Image Augmentation. For this, random transformations (rotation, resizing, shift, zoom, horizontal/vertical reversal) will be applied to each image which will add variability to the initial dataset thus improving the performance and generalization of the model. With Keras, ```ImageDataGenerator``` function allows to realize these transformations.

In [22]:
datagen = ImageDataGenerator(
    rotation_range=30,
    zoom_range=0.20,
    width_shift_range=False,
    height_shift_range=False
)

datagen.fit(reshaped_train_df)

## Building CNN

This CNN is built quite classically around the following structure:

**Convolutional Layers:**
Convolutional layers exploit local patterns and spatial relationships in images. This enables the network to capture local features and reduce computational complexity. Multiple convolutional layers in a CNN architecture (like this one) offer hierarchical feature extraction, increased receptive fields, and non-linear feature composition. Each ``conv2`` layer builds upon lower-level features, learning more complex and abstract representations.

**MaxPooling Layers:**
After each convolutional layer, a maxpooling layer (MaxPooling2D) with a 2x2 pooling window is applied to reduce the feature map's spatial resolution.

**Flatten Layer:**
The flatten layer takes the multi-dimensional feature maps produced by the `Conv2D`/`MaxPooling` layers and converts them into a flat vector. This allows the subsequent layers to easily process the information and transmit it to the following `Dense` layer.

**Dense Layers:**
Dense layers are responsible for learning complex relationships between the input features and the output classes.

**Output Layer:**
The output Dense layer consists of 10 neurons corresponding to the 10 number types in the MNIST dataset.

___
**Regularization and Optimization:**
- L2 regularization is applied to the convolutional layers weights to mitigate overfitting (encouraged to prioritize simpler and smoother weight configurations).
- The Adam optimizer is employed with a learning rate determined by the Float hyperparameter to update the model weights during training.
- The "sparse categorical crossentropy" loss function is used, suitable for multiclass classification.

In [None]:
def model_building(hp_optimizer):
    model = keras.Sequential()
    model.add(keras.layers.Conv2D(hp_optimizer.Choice('filters_1', values=[8, 16, 24, 32, 40, 48, 56, 64]), 3, activation="relu", input_shape=(28, 28, 1), padding="same", kernel_regularizer=keras.regularizers.L2(0.0005)))
    model.add(keras.layers.Conv2D(hp_optimizer.Choice('filters_2', values=[16, 32, 48, 64, 80, 96, 112, 128]), 3, activation="relu", padding="same", kernel_regularizer=keras.regularizers.L2(0.0005)))
    model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
    model.add(keras.layers.Conv2D(hp_optimizer.Choice('filters_3', values=[32, 64, 96, 128, 160, 192, 224, 256]), 3, activation="relu", padding="same", kernel_regularizer=keras.regularizers.L2(0.0005)))
    model.add(keras.layers.Conv2D(hp_optimizer.Choice('filters_4', values=[64, 128, 192, 256, 320, 384, 448, 512]), 3, activation="relu", padding="same", kernel_regularizer=keras.regularizers.L2(0.0005)))
    model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
    model.add(keras.layers.Flatten())
    model.add(Dense(hp_optimizer.Int('units_1', min_value=32, max_value=512, step=32), activation='relu'))
    model.add(Dense(hp_optimizer.Int('units_2', min_value=32, max_value=512, step=32), activation='relu'))
    model.add(Dense(10, activation='softmax'))
    model.compile(optimizer=keras.optimizers.Adam(hp_optimizer.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log')), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

## Optimizing hyper parameters

Neural networks, including their constituent layers, rely on several hyperparameters that likely hide optimal values for achieving better performance. To simplify the process of finding these optimal values, various packages such as ``keras_tuner`` offer optimization algorithms like Random Search, **Hyperband**, and Bayesian Optimization, which perform multiple trials to converge on the best set of hyperparameters. 

To incorporate this optimization process into our neural network, we need to define a function that specifies the model architecture (``model_building`` here). This function should include the hyperparameters to be optimized, along with the corresponding intervals of values to be tested for each parameter (as it was done in the ``model_building`` function).

**Note:** The ``EarlyStopping`` callback helps prevent overfitting and saves computational resources by stopping the training process early when certain criteria are met (stops the training if the accuracy does not improve over a certain number of epochs).

In [None]:
early_stopping = keras.callbacks.EarlyStopping(monitor="val_loss",patience=5,restore_best_weights=True)

hyperparameters = HyperParameters()

tuner = Hyperband(
    model_building,
    objective='val_accuracy',
    max_epochs=20,
    factor=3,
    seed=123,
    hyperparameters=hyperparameters,
    directory='/kaggle/working/',
    project_name='MNIST_comp'
)

tuner.search(reshaped_train_df, training_data_label, validation_data=(reshaped_test_df, testing_data_label), epochs=20,callbacks=[early_stopping])

best_model = tuner.get_best_models(num_models=1)[0]
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]


In [8]:
print("[RESULT] optimal number of filters in conv2D layer 1: {}".format(
best_hyperparameters.get("filters_1")))
print("[RESULT] optimal number of filters in conv2D layer 2: {}".format(
best_hyperparameters.get("filters_2")))
print("[RESULT] optimal number of filters in conv2D layer 3: {}".format(
best_hyperparameters.get("filters_3")))
print("[RESULT] optimal number of filters in conv2D layer 4: {}".format(
best_hyperparameters.get("filters_4")))
print("[RESULT] optimal number of units in dense layer 1: {}".format(
best_hyperparameters.get("units_1")))
print("[RESULT] optimal number of units in dense layer 2: {}".format(
best_hyperparameters.get("units_2")))
print("[RESULT] optimal learning rate: {:.4f}".format(
best_hyperparameters.get("learning_rate")))

[RESULT] optimal number of filters in conv2D layer 1: 40
[RESULT] optimal number of filters in conv2D layer 2: 112
[RESULT] optimal number of filters in conv2D layer 3: 160
[RESULT] optimal number of filters in conv2D layer 4: 384
[RESULT] optimal number of units in dense layer 1: 288
[RESULT] optimal number of units in dense layer 2: 32
[RESULT] optimal learning rate: 0.0005


Now that the optimal parameters are known, it is a matter of applying them to our data set with the image augmentation in addition:

In [32]:
best_model.compile(optimizer=best_model.optimizer,
                   loss=best_model.loss,
                   metrics='accuracy')
best_model.fit(datagen.flow(reshaped_train_df, training_data_label, batch_size=300), batch_size=300,epochs=30,
          validation_data=(reshaped_test_df, testing_data_label),
          verbose=1, shuffle=True)

Epoch 1/30




Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7a68a43eca60>

## Submit Prediction

In [26]:
raw_test_data = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")

resized_test_data = raw_test_data.values.reshape(-1, 28, 28, 1)

In [33]:
predictions = best_model.predict(resized_test_data)
predicted_classes = np.argmax(predictions, axis=1)



In [34]:
predicted_classes_df = pd.DataFrame(predicted_classes, columns=["Label"])
predicted_classes_df['ImageId'] = predicted_classes_df.index + 1
predicted_classes_df['Label'], predicted_classes_df['ImageId'] = predicted_classes_df['ImageId'], predicted_classes_df['Label']
predicted_classes_df = predicted_classes_df.rename(columns={'ImageId': 'Label', 'Label': 'ImageId'})
predicted_classes_df.to_csv('/kaggle/working/test_output.csv', index=False)