# Kannada MNIST Knowledge Competition

The purpose of this Kaggle Knowledge competition is  to practice training convolutional neural networks (CNN) using dataset other than famous MNIST dataset. The dataset used in the competition is the recently-released dataset of Kannada digits. Kannada is a language spoken predominantly by people of Karnataka in southwestern India. The language has roughly 45 million native speakers and is written using the Kannada script. Extensive information about the language and its speakers can be found at

https://en.wikipedia.org/wiki/Kannada

Tha dataset consists of 10 distinct digits that have no resemblance to the usual arabic numerals.
One can see how the Kannada digits look like as well as download the original datasets by visiting the Kaggle webpage

https://www.kaggle.com/c/Kannada-MNIST


The code provided below was used in the aforementioned competition. This competition was Knowlege-type one, and its purpose is to practice rather than to find hyperparameters leading to a perfect score. The competition was kernel-based meaning that one had to commit and run the whole code first on the Kaggle provided GPU before generating a submission file. The submission was scored on both the public test set, as well as a private (unseen) test set. Since it takes enormous amount of time 
to run it on CPU, it is feasible to run it only if one has access to GPU.

The CNN architecture provided here allowed to achieve the accuracies 0.99851 and 0.99430 on the custom made training and evaluation sets, and 0.98820 on both public and private test sets. 

## Reading and preparing the datasets

We first import the necessary modules

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import confusion_matrix
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split 

from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense,Conv2D,Flatten,MaxPooling2D,Dropout,BatchNormalization
from keras.optimizers import Nadam, RMSprop
from keras.callbacks import ModelCheckpoint, LearningRateScheduler

from tensorflow.keras.preprocessing.image import ImageDataGenerator

#from keras import regularizers

Using TensorFlow backend.


Then we read and examine the data. There are two datasets (that consist of the the pixel-values of the associated images) train.cv and Dig-MNIST.cv. Though one can use one set for training and the other for validation, we combine them in a single dataset, that will be split after shuffling into self-made training, validation and evaluation sets. 

In [2]:
#train_df = pd.read_csv('.>/input/Kannada-MNIST/train.csv')
#eval_df = pd.read_csv('../input/Kannada-MNIST/Dig-MNIST.csv')
#test_df = pd.read_csv('../input/Kannada-MNIST/test.csv')

train_df = pd.read_csv('./data/train.csv')
eval_df = pd.read_csv('./data/Dig-MNIST.csv')
test_df = pd.read_csv('./data/test.csv')


In [3]:
train_df = pd.concat([train_df, eval_df], ignore_index=True)

We seee that we have the completely balanced datasets in terms of the numbers of classes.

In [4]:
#print(train_df['label'].value_counts().sort_index().to_dict())
#print(eval_df['label'].value_counts().sort_index().to_dict())

Let's prepare the data for putting them into the deep network. The images are 28 times 28 in size being gray-scale and thus having only one channel. We first prepare the image data.

In [5]:
IMG_SIZE = 28
N_CHANNELS = 1 

X = train_df.iloc[:, 1:].values.reshape(-1, IMG_SIZE, IMG_SIZE, N_CHANNELS).\
                astype('float32')/255

X_test = test_df.iloc[:, 1:].values.reshape(-1, IMG_SIZE, IMG_SIZE, N_CHANNELS).\
                astype('float32')/255

# print(X[145,:,:,:])

Then we convert the labels using one-hot encoding.

In [6]:
y = to_categorical(train_df.iloc[:, 0].values)

#print(y[556])

We then shuffle the combined dataset. 

In [7]:
X, y = shuffle(X, y, random_state = 736)

After shuffling, we leave 75 percent of the data for training and 25 percent for evaluation.

In [8]:
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size = 0.25, stratify = y, 
                                                  random_state = 6743)

## CNN training and evaluation

First we define some constants that will be used in the training process.

In [9]:
#constants

NUM_CLASSES = 10
INPUT_SHAPE = (IMG_SIZE, IMG_SIZE, N_CHANNELS)

VERBOSE = 1

#BATCH_SIZE = 256
BATCH_SIZE = 512
#BATCH_SIZE = 1024

EPOCHS = 40
#EPOCHS = 120

We use the keras provided ImageDataGenerator to augment the dataset in real time. Using the generator allows us to notably increase the validation and test accuracies. The optimal parameters such as rotation range in degrees as well as others can be seen from below. 

In [10]:
train_datagen = ImageDataGenerator(rotation_range = 11,
                                   width_shift_range = 0.25,
                                   height_shift_range = 0.25,
                                   shear_range = 0.2,
                                   zoom_range = 0.3,
                                   horizontal_flip = False,
                                   vertical_flip = False)

In [11]:
valid_datagen = ImageDataGenerator() 

Then, we will construct the CNN model to train and evaluate the data. The architecture of CNN is as follows:

(i) Three convolutional layers each having 64 feature maps and followed by batch normalization.

(ii) 2D pooling layer with 2x2 filter and 0.25 dropout.

(iii) Three convolutional layers each having 128 feature maps and followed by batch normalization.

(iv) 2D pooling layer with 2x2 filter and 0.25 dropout.

(v) Two convolutional layers each having 256 feature maps and followed by batch normalization.

(vi) 2D pooling layer with 2x2 filter and 0.25 dropout.

(vii) After flattening, we add the Dense layer of size 512 and 0.5 dropout.

(viii) Finally, the Dense layer with 10 nodes (number of classes) and softmax activation is used to make the final decision.

RELu activations and He uniform initializers for weights were used everywhere except the last layer. 

In [12]:
model = Sequential()
    
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_initializer='he_uniform',
                     input_shape=INPUT_SHAPE, padding = 'same'))
model.add(BatchNormalization())

model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform',
                 padding = 'same'))
model.add(BatchNormalization())

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_initializer='he_uniform',
                     input_shape=INPUT_SHAPE, padding = 'same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', kernel_initializer='he_uniform',
                     input_shape=INPUT_SHAPE, padding = 'same'))
model.add(BatchNormalization())

model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform',
                 padding = 'same'))
model.add(BatchNormalization())
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', kernel_initializer='he_uniform',
                     input_shape=INPUT_SHAPE, padding = 'same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))


model.add(Conv2D(256, kernel_size=(3, 3), activation='relu', kernel_initializer='he_uniform',
                     input_shape=INPUT_SHAPE, padding = 'same'))
model.add(BatchNormalization())

model.add(Conv2D(256, (3, 3), activation='relu', kernel_initializer='he_uniform',
                 padding = 'same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))


model.add(Flatten())

model.add(Dense(512, activation='relu', kernel_initializer='he_uniform'))
#model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(NUM_CLASSES, activation='softmax'))


model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 28, 28, 64)        640       
_________________________________________________________________
batch_normalization_1 (Batch (None, 28, 28, 64)        256       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 28, 28, 64)        36928     
_________________________________________________________________
batch_normalization_2 (Batch (None, 28, 28, 64)        256       
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 28, 28, 64)        36928     
_________________________________________________________________
batch_normalization_3 (Batch (None, 28, 28, 64)        256       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 64)       

Our model has 2,518,410 parameters. We then split the training dataset on the training set itself and validation set. The size of the validation part is 20 percent, meaning that the size of purely training set is 60 percent (25 persent were left for evaluation after training is complete earlier)

In [13]:
#VALIDATION_SPLIT = 0.25

X_tr, X_valid, y_tr, y_valid = train_test_split(X_train, y_train, test_size = 0.2, 
                                        stratify = y_train, random_state = 0)

The next step compile the model using categorical crossentropy as the loss function.

In [14]:
INITIAL_LR = 0.0025

model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr = INITIAL_LR),
                  metrics=['accuracy'])

#model.compile(loss='categorical_crossentropy', optimizer=Nadam(),
#                  metrics=['accuracy'])

Our model is  fitted using the checkpoint for each epoch. Best model based on validation accuracy will be used for prediction. The RMSppprop optimizer and variable learning rate are used in the process of training. 

In [15]:
# learning rate decay
def lr_decay(epoch):
    return INITIAL_LR * 0.96 ** epoch


checkpoint = ModelCheckpoint('weights.hdf5', monitor='val_accuracy', 
                            verbose=VERBOSE, save_best_only=True, mode='max')

#checkpoint = ModelCheckpoint('/kaggle/working/weights.hdf5', monitor='val_accuracy', 
#                             verbose=VERBOSE, save_best_only=True, mode='max')

callbacks_list = [checkpoint, LearningRateScheduler(lr_decay)]



In [16]:
%%time

history = model.fit_generator(
      train_datagen.flow(X_tr, y_tr, batch_size = BATCH_SIZE),
      steps_per_epoch = int(np.ceil(X_tr.shape[0]/BATCH_SIZE)),
      epochs = EPOCHS,
      callbacks = callbacks_list,
      validation_data = valid_datagen.flow(X_valid, y_valid),
      validation_steps =  int(np.ceil(X_valid.shape[0]/BATCH_SIZE)), 
      verbose = VERBOSE)

Epoch 1/40

Epoch 00001: val_accuracy improved from -inf to 0.70089, saving model to weights.hdf5
Epoch 2/40

Epoch 00002: val_accuracy did not improve from 0.70089
Epoch 3/40

Epoch 00003: val_accuracy did not improve from 0.70089
Epoch 4/40

Epoch 00004: val_accuracy did not improve from 0.70089
Epoch 5/40

Epoch 00005: val_accuracy improved from 0.70089 to 0.91220, saving model to weights.hdf5
Epoch 6/40

Epoch 00006: val_accuracy improved from 0.91220 to 0.98661, saving model to weights.hdf5
Epoch 7/40

Epoch 00007: val_accuracy improved from 0.98661 to 0.98810, saving model to weights.hdf5
Epoch 8/40

Epoch 00008: val_accuracy did not improve from 0.98810
Epoch 9/40

Epoch 00009: val_accuracy did not improve from 0.98810
Epoch 10/40

Epoch 00010: val_accuracy did not improve from 0.98810
Epoch 11/40

Epoch 00011: val_accuracy improved from 0.98810 to 0.99256, saving model to weights.hdf5
Epoch 12/40

Epoch 00012: val_accuracy did not improve from 0.99256
Epoch 13/40

Epoch 00013: 


Epoch 00040: val_accuracy did not improve from 0.99851
Wall time: 1d 3h 23min 28s


In [17]:
score = model.evaluate(X_eval, y_eval,verbose = VERBOSE)

print("Score on evalution set:", score[0])
print('Accuracy on evaluation set:', score[1])

Score on evalution set: 0.026426915126011568
Accuracy on evaluation set: 0.9943052530288696


As a result of training during 40 epochs, we achieved the accuracy of 0.994305 on the evaluation set.

## Prediction and Submission

In [18]:
#prediction

model.load_weights('weights.hdf5')
#model.load_weights('/kaggle/working/weights_v5d.hdf5')

y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis = 1)

# creating submission file
subm = pd.DataFrame({'id': test_df.iloc[:,0].values,
                       'label': y_pred_classes})

subm.to_csv('submission.csv', index=False)

Testing the model on the Kaggle provided test set (which is a public test set) gave the accuracy of 0.98820. The same accuracy was obtained when the model was applied to the private test set to form the private leaderboard. 