# MNIST Deep ML using Convolutional Neural Network using Tensorflow

References: 

* [Tensorflow Tutorial](https://www.tensorflow.org/versions/master/get_started/mnist/pros)

MNIST is a simple computer vision dataset. It consists of images of handwritten digits like these:

![](https://www.tensorflow.org/images/MNIST.png)

IN this 

In [1]:
"""A deep MNIST classifier using convolutional layers.

See extensive documentation at
https://www.tensorflow.org/get_started/mnist/pros
"""

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

## Data import 

The MNIST data is hosted on [this website](http://yann.lecun.com/exdb/mnist/). 


In [2]:
# Import Data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The MNIST data is split into three parts: 

1. 55,000 data points of training data (`mnist.train`), 
2. 10,000 points of test data (`mnist.test`), 
3. 5,000 points of validation data (`mnist.validation`). 

This split is very important: it's of course essential in ML that we have separate data which we don't learn from so that we can make sure that what we've learned actually generalizes!

Each image is 28 pixels by 28 pixels. We can interpret this as a big array of numbers:

![](https://www.tensorflow.org/images/MNIST-Matrix.png)

Thus after flattening the image into vectors of 28*28=784, we obtain as `mnist.train.images` a tensor (an n-dimensional array) with a shape of [55000, 784].

## Model creation

MNIST images is of a handwritten digit between zero and nine. So there are only ten possible things that a given image can be. 
We want to be able to look at an image and give the probabilities for it being each digit, thus base on Softmax Regressions as activation function. `softmax()` has the advantage of allowing for an easy mapping to a probability (as sum = 1) and thus can be used a nice last layout of the ML process. 

* See also [List of activation function](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions)

A softmax regression has two steps: 

1. first we add up the evidence of our input image being in certain classes. For that, we do a weighted sum of the pixel intensities $y=W*x+b$, where the weight is negative if that pixel having a high intensity is evidence against the image being in that class, and positive if it is evidence in favor. 
2. and then we convert that evidence into probabilities throught the application of the `softmax()` function


In [3]:
# Create the model
x  = tf.placeholder(tf.float32, [None, 784])  # Placeholder for the input images

y_ = tf.placeholder(tf.float32, [None, 10])   # Placeholder to input the **correct** answers

## Deep Neural Network

We are going to use a more sophisticated model than the simple one previously: a small convolutional neural network.

![](https://www.tensorflow.org/images/mnist_deep.png)

The full code implementing the above graph is given below.
From bottom to top, the following steps are defined: 

* `reshape` step: to use within a convolutional neural net, to reshape input grayscale 28x28 images. The last dimension is for "features" - there is only one here, since images are grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc.
* `conv1`, the first convolutional layer, maps one grayscale image to 32 feature maps
* `pool1`, the first Pooling layer - downsamples by 2X.
* `conv2`, the second convolutional layer -- maps 32 feature maps to 64.
* `pool2`, the second Pooling layer
* `fc1`, the Densely Connected Layer / Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image is down to 7x7x64 feature maps -- maps this to 1024 features.
* `dropout` step, extremely important to reduce overfitting, controls the complexity of the model, prevents co-adaptation of features.
* Finally, we add a readout layer called `fc2`, just like for the one layer softmax regression of the simple example, responsible to map the 1024 features to 10 classes, one for each digit

Further xplainations for some design choices follows. 

In [4]:
def deepnn(x):
  """deepnn builds the graph for a deep net for classifying digits.
  Args:
    x: an input tensor with the dimensions (N_examples, 784), where 784 is the
    number of pixels in a standard MNIST image.
  Returns:
    A tuple (y, keep_prob). y is a tensor of shape (N_examples, 10), with values
    equal to the logits of classifying the digit into one of 10 classes (the
    digits 0-9). keep_prob is a scalar placeholder for the probability of
    dropout.
  """
  # Reshape to use within a convolutional neural net.
  # Last dimension is for "features" - there is only one here, since images are
  # grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc.
  with tf.name_scope('reshape'):
    x_image = tf.reshape(x, [-1, 28, 28, 1])

  # First convolutional layer - maps one grayscale image to 32 feature maps.
  with tf.name_scope('conv1'):
    W_conv1 = weight_variable([5, 5, 1, 32])
    b_conv1 = bias_variable([32])
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

  # Pooling layer - downsamples by 2X.
  with tf.name_scope('pool1'):
    h_pool1 = max_pool_2x2(h_conv1)

  # Second convolutional layer -- maps 32 feature maps to 64.
  with tf.name_scope('conv2'):
    W_conv2 = weight_variable([5, 5, 32, 64])
    b_conv2 = bias_variable([64])
    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

  # Second pooling layer.
  with tf.name_scope('pool2'):
    h_pool2 = max_pool_2x2(h_conv2)

  # Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
  # is down to 7x7x64 feature maps -- maps this to 1024 features.
  with tf.name_scope('fc1'):
    W_fc1 = weight_variable([7 * 7 * 64, 1024])
    b_fc1 = bias_variable([1024])

    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
    h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

  # Dropout - controls the complexity of the model, prevents co-adaptation of
  # features.
  with tf.name_scope('dropout'):
    keep_prob = tf.placeholder(tf.float32)
    h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

  # Map the 1024 features to 10 classes, one for each digit
  with tf.name_scope('fc2'):
    W_fc2 = weight_variable([1024, 10])
    b_fc2 = bias_variable([10])

    y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
  return y_conv, keep_prob


### Toolbox function

To create this model, we're going to need to create a lot of weights and biases. Yet 

* weight should generally be initialized with a small amount of noise for symmetry breaking, and to prevent 0 gradients. 
* as regards bias, since [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) neurons will be used, it is also good practice to initialize them with a slightly positive initial bias to avoid "dead neurons". 

Instead of doing this repeatedly while we build the model, let's create two handy functions to do it for us.

In [5]:
def weight_variable(shape):
  """weight_variable generates a weight variable of a given shape."""
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  """bias_variable generates a bias variable of a given shape."""
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

TensorFlow also gives a lot of flexibility in convolution and pooling operations. How do we handle the boundaries? What is our stride size? In this example, we're always going to choose the vanilla version yet provide these through the following functions:


In [6]:
def conv2d(x, W):
  """conv2d returns a 2d convolution layer with full stride."""
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')


def max_pool_2x2(x):
  """max_pool_2x2 downsamples a feature map by 2X."""
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')


## Build the graph and the deep network model

In [7]:
y_conv, keep_prob = deepnn(x)

## Loss Function, Training and Evaluation

How well does this model do? To train and evaluate it we will use code that is nearly identical to that for the simple one layer SoftMax network above. In particular, we will still rely on the "cross-entropy" as  a _loss_ function expected to be minimized $H_{y'}(y\_conv) = -\sum_i y'_i \log(y\_conv_i) = -\sum y' \log(y\_conv)$ (where $y\_conv$ is our predicted probability distribution, and $y′$ is the true distribution), yet still in its numerically stable version i.e. `tf.nn.softmax_cross_entropy_with_logits` on the raw outputs of 'y_conv', and then average across the batch.

The differences are that:

* We will replace the steepest gradient descent optimizer with the more sophisticated ADAM optimizer.
* We will include the additional parameter keep_prob in feed_dict to control the dropout rate and thus reduce the overfitting
* We will add logging to every 100th iteration in the training process


In [9]:
# The raw formulation of cross-entropy,
#
#   tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.nn.softmax(y_conv)),
#                                 reduction_indices=[1]))
#
# can be numerically unstable.
#
# So here we use tf.nn.softmax_cross_entropy_with_logits on the raw
# outputs of 'y', and then average across the batch.
with tf.name_scope('loss'):
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_,
                                                                           logits=y_conv))
# Training
with tf.name_scope('adam_optimizer'):
    train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

# Stepwise accuracy computation
with tf.name_scope('accuracy'):
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
    correct_prediction = tf.cast(correct_prediction, tf.float32)
    accuracy = tf.reduce_mean(correct_prediction)


## Let's go

We will use `tf.Session` rather than `tf.InteractiveSession`. This better separates the process of creating the graph (model specification) and the process of evaluating the graph (model fitting). 
It generally makes for cleaner code. 

In [12]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(20000):
      batch = mnist.train.next_batch(50)
      if i % 1000 == 0:
        train_accuracy = accuracy.eval(feed_dict={
            x: batch[0], y_: batch[1], keep_prob: 1.0})
        print('step %d, training accuracy %g' % (i, train_accuracy))
      train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

    print('test accuracy %g' % accuracy.eval(feed_dict={
        x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

step 0, training accuracy 0.06
step 1000, training accuracy 1
step 2000, training accuracy 0.94
step 3000, training accuracy 0.98
step 4000, training accuracy 0.98
step 5000, training accuracy 1
step 6000, training accuracy 1
step 7000, training accuracy 0.96
step 8000, training accuracy 1
step 9000, training accuracy 1
step 10000, training accuracy 1
step 11000, training accuracy 1
step 12000, training accuracy 0.98
step 13000, training accuracy 1
step 14000, training accuracy 1
step 15000, training accuracy 1
step 16000, training accuracy 1
step 17000, training accuracy 1
step 18000, training accuracy 0.98
step 19000, training accuracy 1
test accuracy 0.9925


That means and accuracy of around 99.25%, which is way better than the previously obtained results (around 92%) when the [best models](https://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results) allow for 99.79% of accuracy.