## MNIST Dataset Notebook
    A jupyter notebook explaining how to read the MNIST dataset effciently into memory in Python.

## Contents
- [MNIST database](#id1)
    - [Dataset](#id2)
    - [Performance](#id3)
    - [Classifiers](#id4)
- [References](#idr)

<a id="id1"></a>
## MNIST database
- The **MNIST** database (**Modified National Institute of Standards and Technology database**) is a large database of **handwritten digits** that is commonly used for training various image processing systems.
- The database is also widely used for training and testing in the field of **machine learning**.
-  It was created by "re-mixing" the samples from **NIST's original datasets**.
- The creators felt that since **NIST's** training dataset was taken from **American Census Bureau employees**, while the testing dataset was taken from **American high school students**, it was not well-suited for machine learning experiments.
- Furthermore, the black and white images from **NIST** were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced **grayscale levels**.
- The **MNIST** database **contains 60,000 training images** and **10,000 testing images**.
- Half of the training set and half of the test set were taken from **NIST's** training dataset, while the other half of the training set and the other half of the test set were taken from **NIST's** testing dataset.
- An **extended dataset** similar to **MNIST** called **EMNIST** has been published in 2017, which contains **240,000 training images**, and **40,000 testing images** of handwritten digits and characters.
<img src="Images/Sample MNIST.png" alt="Sample MNIST" title="Sample MNIST" />
__*Sample images from MNIST test dataset.*__

<a id="id2"></a>
### Dataset
- The set of images in the **MNIST** database is a combination of two of **NIST's** databases: **Special Database 1** and **Special Database 3**. **Special Database 1** and **Special Database 3** consist of digits written by high school students and employees of the **United States Census Bureau**, respectively.

<a id="id3"></a>
### Performance
- Some researchers have achieved **"near-human performance"** on the **MNIST** database, using a committee of **neural networks**; in the same paper, the authors achieve performance double that of humans on other recognition tasks.
- The highest error rate listed on the original website of the database is **12 percent**, which is achieved using a **simple linear classifier with no preprocessing**.
- **In 2004**, a best-case error rate of **0.42 percent** was achieved on the database by researchers using a **new classifier** called the **LIRA**, which is a **neural classifier with three neuron layers** based on **Rosenblatt's** perceptron principles.
- Some researchers have tested artificial intelligence systems using the database put under random distortions. The systems in these cases are usually neural networks and the distortions used tend to be either affine distortions or elastic distortions. Sometimes, these systems can be very successful; one such system achieved an error rate on the database of **0.39 percent**.
- **In 2011**, an error rate of **0.27 percent**, improving on the previous best result, was reported by researchers using a similar system of neural networks. **In 2013**, an approach based on regularization of neural networks using DropConnect has been claimed to achieve a **0.21 percent** error rate. **Recently**, the single convolutional neural network best performance was **0.31 percent** error rate. As of August 2018, the best performance of a single convolutional neural network trained on **MNIST** training data using realtime data augmentation is **0.26** percent error rate. Also, the Parallel Computing Center (Khmelnitskiy, Ukraine) obtained an ensemble of only 5 convolutional neural networks which performs on **MNIST** at **0.21 percent** error rate. **Incorrect labeling of the testing dataset may prevent reaching test error rates of 0%**.

<a id="id4"></a>
### Classifiers
| Type | Classifier | Distortion | Preprocessing | Error rate (%) |
| ------------- | ------------ | ----------- | ------------ | ----------- |	 	 	 	
|Deep Neural network|2-layer 784-800-10|None|None|1.6|
|Deep Neural network|2-layer 784-800-10|elastic distortions|None|0.7|
|Non-linear classifier|40 PCA + quadratic classifier|None|None|3.3|
|Deep neural network|6-layer 784-2500-2000-1500-1000-500-10|elastic distortions|None|0.35|
|Convolutional neural network|6-layer 784-40-80-500-1000-2000-10|None|Expansion of the training data|0.31|
|Convolutional neural network|6-layer 784-50-100-500-1000-10-10|None|Expansion of the training data|0.27|
|Convolutional neural network|Committee of 35 CNNs, 1-20-P-40-P-150-10|elastic distortions|Width normalizations|0.23|
|Convolutional neural network|Committee of 5 CNNs, 6-layer 784-50-100-500-1000-10-10|None|Expansion of the training data|0.21|
|K-Nearest Neighbors|K-NN with non-linear deformation (P2DHMDM)|None|Shiftable edges|0.52|
|Linear classifier|Pairwise linear classifier|None|Deskewing|7.6|
|Boosted Stumps|Product of stumps on Haar features|None|Haar features|0.87|
|Support vector machine|Virtual SVM, deg-9 poly, 2-pixel jittered|None|Deskewing|0.56|

## Handwritten Digit Recognition
- MNIST is a widely used dataset for the hand-written digit classification task. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). The task at hand is to train a model using the 60,000 training images and subsequently test its classification accuracy on the 10,000 test images.
<img src="Images/MNIST.png" alt="MNIST" title="MNIST" />
__*Sample images from the MNIST dataset.*__

### Loading Data
- Before we define the model, let’s first fetch the MNIST dataset.
- The following source code downloads and loads the images and the corresponding labels into memory.

In [2]:
import mxnet as mx
mnist = mx.test_utils.get_mnist()

# Fix the seed
mx.random.seed(42)

# Set the compute context, GPU is available otherwise CPU
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()

- After running the above source code, the entire MNIST dataset should be fully loaded into memory. Note that for large datasets it is not feasible to pre-load the entire dataset first like we did here. What is needed is a mechanism by which we can quickly and efficiently stream data directly from the source. MXNet Data iterators come to the rescue here by providing exactly that. Data iterator is the mechanism by which we feed input data into an MXNet training algorithm and they are very simple to initialize and use and are optimized for speed. During training, we typically process training samples in small batches and over the entire training lifetime will end up processing each training example multiple times. In this tutorial, we’ll configure the data iterator to feed examples in batches of 100. Keep in mind that each example is a 28x28 grayscale image and the corresponding label.

- Image batches are commonly represented by a 4-D array with shape (batch_size, num_channels, width, height). For the MNIST dataset, since the images are grayscale, there is only one color channel. Also, the images are 28x28 pixels, and so each image has width and height equal to 28. Therefore, the shape of input is (batch_size, 1, 28, 28). Another important consideration is the order of input samples. When feeding training examples, it is critical that we don’t feed samples with the same label in succession. Doing so can slow down training. Data iterators take care of this by randomly shuffling the inputs. Note that we only need to shuffle the training data. The order does not matter for test data.

- The following source code initializes the data iterators for the MNIST dataset. Note that we initialize two iterators: one for train data and one for test data.

In [3]:
batch_size = 100
train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)

### Training
- We will cover a couple of approaches for performing the hand written digit recognition task. The first approach makes use of a traditional deep neural network architecture called Multilayer Perceptron (MLP). We’ll discuss its drawbacks and use that as a motivation to introduce a second more advanced approach called Convolution Neural Network (CNN) that has proven to work very well for image classification tasks.

### Multilayer Perceptron
- The first approach makes use of a Multilayer Perceptron to solve this problem. We’ll define the MLP using MXNet’s symbolic interface. We begin by creating a place holder variable for the input data. When working with an MLP, we need to flatten our 28x28 images into a flat 1-D structure of 784 (28 * 28) raw pixel values. The order of pixel values in the flattened vector does not matter as long as we are being consistent about how we do this across all images.

In [4]:
data = mx.sym.var('data')
# Flatten the data from 4-D shape into 2-D (batch_size, num_channel*width*height)
data = mx.sym.flatten(data=data)

- One might wonder if we are discarding valuable information by flattening. That is indeed true and we’ll cover this more when we talk about convolutional neural networks where we preserve the input shape. For now, we’ll go ahead and work with flattened images.

- MLPs contains several fully connected layers. A fully connected layer or FC layer for short, is one where each neuron in the layer is connected to every neuron in its preceding layer. From a linear algebra perspective, an FC layer applies an affine transform to the n x m input matrix X and outputs a matrix Y of size n x k, where k is the number of neurons in the FC layer. k is also referred to as the hidden size. The output Y is computed according to the equation Y = X WT + b. The FC layer has two learnable parameters, the k x m weight matrix W and the 1 x k bias vector b. The summation of bias vector follows the broadcasting rules explained in mxnet.sym.broadcast_to(). Conceptually, broadcasting replicates row elements of the bias vector to create an n x k matrix before summation.

- In an MLP, the outputs of most FC layers are fed into an activation function, which applies an element-wise non-linearity. This step is critical and it gives neural networks the ability to classify inputs that are not linearly separable. Common choices for activation functions are sigmoid, tanh, and rectified linear unit (ReLU). In this example, we’ll use the ReLU activation function which has several desirable properties and is typically considered a default choice.

- The following code declares two fully connected layers with 128 and 64 neurons each. Furthermore, these FC layers are sandwiched between ReLU activation layers each one responsible for performing an element-wise ReLU transformation on the FC layer output.

In [5]:
# The first fully-connected layer and the corresponding activation function
fc1  = mx.sym.FullyConnected(data=data, num_hidden=128)
act1 = mx.sym.Activation(data=fc1, act_type="relu")

# The second fully-connected layer and the corresponding activation function
fc2  = mx.sym.FullyConnected(data=act1, num_hidden = 64)
act2 = mx.sym.Activation(data=fc2, act_type="relu")

- The last fully connected layer often has its hidden size equal to the number of output classes in the dataset. The activation function for this layer will be the softmax function. The Softmax layer maps its input to a probability score for each class of output. During the training stage, a loss function computes the cross entropy between the probability distribution (softmax output) predicted by the network and the true probability distribution given by the label.

- The following source code declares the final fully connected layer of size 10. 10 incidentally is the total number of digits. The output from this layer is fed into a SoftMaxOutput layer that performs softmax and cross-entropy loss computation in one go. Note that loss computation only happens during training.

In [6]:
# MNIST has 10 classes
fc3  = mx.sym.FullyConnected(data=act2, num_hidden=10)
# Softmax with cross entropy loss
mlp  = mx.sym.SoftmaxOutput(data=fc3, name='softmax')

In [None]:
<img src="Images/NA MNIST.png" alt="NA MNIST" title="NA MNIST" />
__*MLP network architecture for MNIST.*__

<a id="idr"></a>
###### References: 
- __[MNIST database](https://en.wikipedia.org/wiki/MNIST_database)__
- __[Handwritten Digit Recognition](https://mxnet.incubator.apache.org/tutorials/python/mnist.html)__
