# Digit Recognition 99.185%

The MNIST dataset is the de facto "hello world" dataset of computer vision. It contains images of handwritten digits 0-9. In this notebook, we will take a look at some of the data, preprocess it, and build models for classifying the digits. 

## Import Libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import matplotlib
import matplotlib.pyplot as plt

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

from sklearn.model_selection import train_test_split

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/digit-recognizer/train.csv
/kaggle/input/digit-recognizer/sample_submission.csv
/kaggle/input/digit-recognizer/test.csv


## Import Dataset

In [2]:
train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
test = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')

In [3]:
train.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Separate the training dataset into training and validation sets.

In [4]:
X_train, X_val, y_train, y_val = train_test_split(train.iloc[:, 1:], train.iloc[:, 0], test_size=0.2)
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_val:', X_val.shape)
print('y_val:', y_val.shape)


X_train: (33600, 784)
y_train: (33600,)
X_val: (8400, 784)
y_val: (8400,)


Let's take a look at some data.

In [5]:
@interact
def show_digital(x=(0, 1000)):
    return plt.imshow(np.array(X_train.iloc[x]).reshape((28,28))), print('The number is', y_train.iloc[x])


interactive(children=(IntSlider(value=500, description='x', max=1000), Output()), _dom_classes=('widget-intera…

## Preprocessing

The only preprocessing we will do here is to normalize the dataset to distribution of [0,1], and turn the Pandas data structure into Numpy arrays.

In [6]:
def normalization(X):
    X = X / 255.0
    return X

In [7]:
X_train = np.array(normalization(X_train))
X_val = np.array(normalization(X_val))
y_train = np.array(y_train)
y_val = np.array(y_val)

## Modeling

We will run couple classification algorithms on the dataset, and train the best one on the entire training set (train+val) to predict the test set.

### Logistic Regression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [8]:
%%time
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
print('Score:', LR.score(X_val, y_val))

Score: 0.9158333333333334
CPU times: user 12.7 s, sys: 84 ms, total: 12.8 s
Wall time: 12.9 s




### Support Vector Machines (SVM)
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

In [9]:
%%time
from sklearn.svm import LinearSVC
SVM = LinearSVC().fit(X_train, y_train)
print('Score:', SVM.score(X_val, y_val))

Score: 0.9067857142857143
CPU times: user 54 s, sys: 120 ms, total: 54.2 s
Wall time: 54.2 s




### K-Nearest Neighbors (KNN)
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [10]:
%%time
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_train, y_train)
print('Score:', KNN.score(X_val, y_val))

Score: 0.9665476190476191
CPU times: user 11min 37s, sys: 1.43 s, total: 11min 39s
Wall time: 6min 5s


### Convolutional Neural Network (CNN)
https://keras.io/layers/convolutional/

#### Import Libraies

In [11]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import to_categorical
from keras.losses import categorical_crossentropy
from keras.optimizers import Adadelta

Using TensorFlow backend.


#### Preprocessing

In [12]:
X_train = X_train.reshape(-1, 28, 28)
X_train = np.expand_dims(X_train, axis=1)
X_val = X_val.reshape(-1, 28, 28)
X_val = np.expand_dims(X_val, axis=1)
y_train = to_categorical(y_train, 10)
y_val = to_categorical(y_val, 10)


In [13]:
X_train.shape

(33600, 1, 28, 28)

#### Model Building and Training


In [14]:
def CNN_model():
    model = Sequential()
    model.add(Conv2D(32, (3,3), activation='relu', padding='same', input_shape=(1,28,28), data_format='channels_first'))
    model.add(MaxPooling2D((2,2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(64, (3,3), activation='relu', padding='same'))
    model.add(MaxPooling2D((2,2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
    model.add(MaxPooling2D((2,2)))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(10, activation='softmax'))
    return model

In [15]:
%%time
CNN = CNN_model()

CNN.compile(loss=categorical_crossentropy,
              optimizer=Adadelta(),
              metrics=['accuracy'])

CNN.fit(X_train, y_train,
          batch_size=128,
          epochs=30,
          verbose=1,
          validation_data=(X_val, y_val))


Train on 33600 samples, validate on 8400 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
CPU times: user 1min 58s, sys: 15.6 s, total: 2min 14s
Wall time: 1min 53s


<keras.callbacks.callbacks.History at 0x7fbe9cc69128>

## Prediction

Convolutional Neural Network scored the highest on validation accuracy, so we will use that model to train on the whole trianing set and make prediction on the test set.

### Preprocessing

In [16]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
X_test = test

In [17]:
X_train = np.array(normalization(X_train))
y_train = np.array(y_train)
X_test = np.array(normalization(X_test))

In [18]:
X_train = X_train.reshape(-1, 28, 28)
X_train = np.expand_dims(X_train, axis=1)
X_test = X_test.reshape(-1, 28, 28)
X_test = np.expand_dims(X_test, axis=1)
y_train = to_categorical(y_train, 10)

### Training

In [19]:
CNN = CNN_model()

CNN.compile(loss=categorical_crossentropy,
              optimizer=Adadelta(),
              metrics=['accuracy'])

CNN.fit(X_train, y_train,
          batch_size=128,
          epochs=30,
          verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.callbacks.History at 0x7fbe9c475fd0>

### Predicting

In [20]:
prediction_prob = CNN.predict(X_test)

In [21]:
prediction = np.argmax(prediction_prob, axis=1)

In [22]:
prediction

array([2, 0, 9, ..., 3, 9, 2])

In [23]:
prediction_df = {"ImageId":range(1, X_test.shape[0]+1), "Label":prediction}
prediction_df = pd.DataFrame(prediction_df)
prediction_df.to_csv("prediction.csv", index = False)