**MNIST Handwritten Digit Classification**
---

The MNIST dataset is an acronym that stands for the Modified National Institute of Standards and Technology dataset. Dataset contains 60,000 images of 28×28 pixel grayscale of handwritten single digits between 0 and 9. We are going to classify a given image of a handwritten digit into one of 10 classes representing integer values from 0 to 9. We are going to use Keras API for training this model.

### importing modules

**numpy** - NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and matrices.

**csv** - CSV (Comma Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

**tqdm** - Whether you’re installing software, loading a page or doing a transaction, it always eases your mind whenever you see that small progress bar giving you an estimation of how long the process would take to complete or render. If you have a simple progress bar in your script or code, it looks very pleasing to the eye and gives proper feedback to the user whenever he executes the code. You can use the Python external library tqdm, to create simple & hassle-free progress bars which you can add in your code and make it look lively.

**os** - The OS module in python provides functions for interacting with the operating system. OS, comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality. The *os* and *os.path* modules include many functions to interact with the file system.

**pandas** -  It provides ready to use high-performance data structures and data analysis tools. Pandas module runs on top of NumPy and it is popularly used for data science and data analytics.

**keras** - Keras modules contains pre-defined classes, functions and variables which are useful for deep learning algorithm. By default, keras runs on top of TensorFlow backend.

In [None]:
import numpy as np
import csv
from tqdm import tqdm
import os

import pandas as pd
import keras
from keras.preprocessing.image import load_img, img_to_array

In [None]:
from keras.layers import Input, Dense, Conv2D, MaxPool2D, Flatten, Dropout, BatchNormalization
from keras.models import Model
from keras.optimizers import Adam
from keras.regularizers import l2
from keras.callbacks import ModelCheckpoint
from keras.utils import plot_model



import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

%matplotlib inline

In [None]:
DATA_DIR = "/content/drive/My Drive/DATA/identify-digits/Images/train/"
TRAIN_FILE = "/content/drive/My Drive/DATA/identify-digits/train.csv"
TEST_FILE = "/content/drive/My Drive/DATA/identify-digits/Test_fCbTej3_0j1gHmj.csv"
TEST_DIR = "/content/drive/My Drive/DATA/identify-digits/Images/test/"

IMG_SIZE = 28
MODEL_NAME = "{}-{}".format("1-conv-layers", 1)

### creating, processing, loading dataset

**Categorical** data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set. Categorical variables are often called nominal.Each value represents a different category. Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application. In this case, a one-hot encoding can be applied to the integer representation.

In [None]:
# converting categorical values in to 10 size one-hot encoded list
def label_img(img):
  one_hot = [0 for i in range(10)]
  one_hot[img] = 1
  return one_hot

In [None]:
# function takes images from training dataset as input and converts into
# numpy array taking 2 values(one-hot encoded labels, image in np.array)
def create_train_data():
  training_data = []
  with open(TRAIN_FILE, 'r') as f:
    csvreader = csv.reader(f)
    fields = next(csvreader)

    for row in tqdm(csvreader):
      filename, label = row[0], label_img(int(row[1]))
      path = DATA_DIR + filename
      img = load_img(path, grayscale=True)
      img = img_to_array(img)
      training_data.append([np.array(label), img])
    training_data = np.array(training_data)
    f.close()
  np.save("training_data.npy", training_data)
  return training_data

In [None]:
# function takes images from test dataset as input and converts into
# numpy array taking 2 values(one-hot encoded labels, image in np.array)
def create_test_data():
  testing_data = []
  with open(TEST_FILE, 'r') as f:
    csvreader = csv.reader(f)
    fields = next(csvreader)

    for row in tqdm(csvreader):
      filename = row[0]
      path = DATA_DIR + filename
      img = load_img(path, grayscale=True)
      img = img_to_array(img)
      testing_data.append(img)
    testing_data = np.array(testing_data)
    f.close()
  np.save("testing_data.npy", training_data)
  return testing_data

we **normalize** the images is to make the model converge faster. When the data is not normalized, the shared weights of the network have different calibrations for different features, which can make the cost function to converge very slowly and ineffectively. Normalizing the data makes the cost function much easier to train.

In [None]:
def load_dataset(test=False):
  if os.path.exists('/content/drive/My Drive/DATA/identify-digits/' +  'training_data.npy'):
    training_data = np.load('/content/drive/My Drive/DATA/identify-digits/' +  'training_data.npy', allow_pickle=True)
  else:
    training_data = create_train_data()
  
  X_train = np.array([img[1] for img in training_data])
  X_train = X_train/255.
  X_train = X_train.astype('float32')
  print("Data Normalized!!")
  Y_train = np.array([img[0] for img in training_data])

  print("Total training data images = ", X_train.shape[0])
  print("Images data shape = ", X_train.shape[1:])
  print("Total categories = ", Y_train.shape[1])

  X_test = []
  if(test):
    if os.path.exists('/content/drive/My Drive/DATA/identify-digits/'+'testing_data.npy'):
      testing_data = np.load('/content/drive/My Drive/DATA/identify-digits/'+'testing_data.npy', allow_pickle=True)
    else:
      testing_data = create_test_data()
    
    X_test = np.array([img[1] for img in testing_data])
    X_test = X_test.reshape(X_test.shape + (1,))
    X_test = X_test/255.
    X_test = X_test.astype('float32') 

    print("Total testing data images = ", X_test.shape[0])
    print("Images data shape = ", X_test.shape[1:])
  
  return X_train, Y_train, X_test

In [None]:
X_train, Y_train, X_test = load_dataset(False)

### Defining Model

Since this is a small network, for the convolutional front-end, we can start with ***convolutional layer*** with a small filter size (3,3) and a number of filters (32) followed by a *max pooling layer*. The filter maps can then be flattened to provide features to the classifier.

Given that the problem is a *multi-class classification* task, we know that we will require an output layer with 10 nodes in order to predict the probability distribution of an image belonging to each of the 10 classes. This will also require the use of a *softmax* activation function. Between the feature extractor and the output layer, we can add a dense layer to interpret the features, in this case with 100 nodes.

All layers will use the ReLU activation function.

The categorical cross-entropy loss function will be optimized, suitable for multi-class classification, and we will monitor the classification accuracy metric, which is appropriate given we have the same number of examples in each of the 10 classes.

The digitModel() function below will define and return this model.

In [None]:
def digitModel(input_shape):

  X_input = Input(input_shape)

  X = Conv2D(32, kernel_size=(3,3), activation='relu', padding='same')(X_input)
  X = Conv2D(32,  kernel_size=(3,3), activation='relu', padding='same')(X)
  X = MaxPool2D(pool_size=(2, 2), padding='same')(X)

  X = Conv2D(64, kernel_size=(3,3), activation='relu')(X)
  X = Conv2D(64, kernel_size=(3,3), activation='relu')(X)
  X = MaxPool2D(pool_size=(2, 2))(X)

  X = Flatten()(X)

  X = Dense(128, activation='relu')(X)

  X = Dense(10, activation='softmax')(X)

  model = Model(X_input, X)

  return model

In [None]:
model = digitModel(X_train.shape[1:])

In [None]:
model.summary()

---

Deep learning models can take hours, days or even weeks to train. If the run is stopped unexpectedly, you can lose a lot of work. We are going to use keras.callbacks module to import ModelCheckpoint. 

We are going to monitor validation accuracy and we are only going to save the best.

In [None]:
checkpoint = ModelCheckpoint('{}.h5'.format(MODEL_NAME),
                            monitor='val_accuracy',
                            save_best_only=True)

We are going to use **Adam** optimizer and **categorical_crossentropy** as our loss function.


In [None]:
#opt = Adam(learning_rate=0.0005)
model.compile('adam',loss='categorical_crossentropy', metrics=['accuracy'])

We have defined our model and compiled it ready for efficient computation. Now it is time to execute the model on some data. We can train or fit our model on our loaded data by calling the fit() function on the model. Training occurs over epochs and each epoch is split into batches

In [None]:
history = model.fit(X_train, Y_train, batch_size=512, epochs=7, callbacks=[checkpoint], validation_split=0.1)

### History

We can see, the model achieves 98% validation accuracy. These are good results.

In [None]:
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In this case, we can see that the model generally achieves a good fit, with train and test learning curves converging. There is no obvious sign of **over-fitting** or **under-fitting**.

We are going to submit the file to drive  for download

In [None]:
TEST_FILE = "/content/drive/My Drive/DATA/identify-digits/Test_fCbTej3_0j1gHmj.csv"
TEST_DIR = "/content/drive/My Drive/DATA/identify-digits/Images/test/"
SUB_FILE = "/content/drive/My Drive/DATA/identify-digits/sample_submission_npBPSZB.csv"
with open(TEST_FILE, "r") as f1:
  with open(SUB_FILE, 'w') as f2:
    csvreader = csv.reader(f1)
    fields = next(csvreader)

    f2.write('filename,label\n')

    for img in tqdm(csvreader):
        file_name = img[0]
        path = TEST_DIR + img[0]
        img = load_img(path, grayscale=True)
        x = img_to_array(img)
        x = np.expand_dims(x, axis=0)
            
        val = np.argmax(model.predict(x))  
        
        f2.write('{},{}\n'.format(file_name, int(val)) )
    f2.close()
  f1.close()

testing an image to see if it works fine.

In [None]:
TEST_DIR = "/content/drive/My Drive/DATA/identify-digits/Images/test/"
path = TEST_DIR + '49000.png'
img = load_img(path, grayscale=True)
x = img_to_array(img)
x = np.expand_dims(x, axis=0)
    
print(np.argmax(model.predict(x)))
plt.imshow(img)
plt.show()

In [None]:
from keras.models import load_model
model = load_model('/content/drive/My Drive/DATA/identify-digits/1-conv-layers-1.h5')

In [None]:
model.evaluate(x=X_train, y=Y_train)