# MNIST Machine Learning (ML) using Tensorflow

References: 

* [Tensorflow Tutorial](https://www.tensorflow.org/versions/master/get_started/mnist/beginners)

MNIST is a simple computer vision dataset. It consists of images of handwritten digits like these:

![](https://www.tensorflow.org/images/MNIST.png)

In [1]:
"""A very simple MNIST classifier.
See extensive documentation at
https://www.tensorflow.org/get_started/mnist/beginners
"""

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

## Data import 

The MNIST data is hosted on [this website](http://yann.lecun.com/exdb/mnist/). 


In [2]:
# Import Data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The MNIST data is split into three parts: 

1. 55,000 data points of training data (`mnist.train`), 
2. 10,000 points of test data (`mnist.test`), 
3. 5,000 points of validation data (`mnist.validation`). 

This split is very important: it's of course essential in ML that we have separate data which we don't learn from so that we can make sure that what we've learned actually generalizes!

Each image is 28 pixels by 28 pixels. We can interpret this as a big array of numbers:

![](https://www.tensorflow.org/images/MNIST-Matrix.png)

Thus after flattening the image into vectors of 28*28=784, we obtain as `mnist.train.images` a tensor (an n-dimensional array) with a shape of [55000, 784].

## Model creation

MNIST images is of a handwritten digit between zero and nine. So there are only ten possible things that a given image can be. 
We want to be able to look at an image and give the probabilities for it being each digit, thus base on Softmax Regressions as activation function. `softmax()` has the advantage of allowing for an easy mapping to a probability (as sum = 1) and thus can be used a nice last layout of the ML process. 

* See also [List of activation function](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions)

A softmax regression has two steps: 

1. first we add up the evidence of our input image being in certain classes. For that, we do a weighted sum of the pixel intensities $y=W*x+b$, where the weight is negative if that pixel having a high intensity is evidence against the image being in that class, and positive if it is evidence in favor. 
2. and then we convert that evidence into probabilities throught the application of the `softmax()` function


In [3]:
# Create the model
x  = tf.placeholder(tf.float32, [None, 784])  # Placeholder for the input images
W  = tf.Variable(tf.zeros([784, 10]))         # Model paramerer: weight
b  = tf.Variable(tf.zeros([10]))              # Model parameter; bias
z  = tf.matmul(x, W) + b                      # BEFORE applying softmax
# Real model: 
# y = tf.nn.softmax(z)
y_ = tf.placeholder(tf.float32, [None, 10])   # Placeholder to input the **correct** answers

## Loss function

So our expected model is provided by the following formula:

$$y = softmax(W*x+b)$$

In order to train this model, we need to define what it means for a model to be bad thought a _loss_ function expected to be minimized. One very common, very nice function to determine the loss of a model is called "cross-entropy", defined as:

$$H_{y'}(y) = -\sum_i y'_i \log(y_i) = -\sum y' \log(y)$$

where $y$ is our predicted probability distribution, and $y′$ is the true distribution (i.e. `y_` in the above model definition).
Since this raw formulation can be numerically unstable, we will apply instead `tf.nn.softmax_cross_entropy_with_logits` on the raw outputs of 'y', and then average across the batch.

In [4]:
# The raw formulation of cross-entropy,
#
#   tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.nn.softmax(z)),
#                                 reduction_indices=[1]))
#
# can be numerically unstable.
#
# So here we use tf.nn.softmax_cross_entropy_with_logits on the raw
# outputs of 'y', and then average across the batch.
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=z))

## Training

Now that we know what we want our model to do, it's very easy to have TensorFlow train it to do so. 
Because TensorFlow knows the entire graph of your computations, it can automatically use the [backpropagation algorithm](https://colah.github.io/posts/2015-08-Backprop) to efficiently determine how the variables $W$ and $b$  affect the loss to be minimized. 
Then it can apply your choice of optimization algorithm to modify the variables and reduce the loss.

Tensorflow offers a broad range of [optimization algorithms](https://www.tensorflow.org/versions/master/api_guides/python/train#Optimizers) for the training, here we are going to minimize cross_entropy using the [gradient descent algorithm](https://en.wikipedia.org/wiki/Gradient_descent) with a learning rate of 0.5. Gradient descent is a simple procedure, where TensorFlow simply shifts each variable a little bit in the direction that reduces the cost. 

In [5]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

Behind the scenes, Tensorflow actually add new operations to the graph of the model which implement backpropagation and gradient descent.

## Let's go!

In [6]:
sess = tf.InteractiveSession()
tf.global_variables_initializer().run() # initialize the variables created (W and b)

### Training

Let's train - we'll run the training step 1000 times.
Each step of the loop, we get a "batch" of one hundred random data points from our training set. We run train_step feeding in the batches data to replace the placeholders.

In [7]:
# Train - we'll run the training step 1000 times!
for _ in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

### Model Evaluation

Let's figure out where we predicted the correct label. It is of course **crucial** to use another set of images from `mnist.test`

In [10]:
# Test trained model
correct_prediction = tf.equal(tf.argmax(z, 1), tf.argmax(y_, 1))   # return a list of booleans
# To determine what fraction are correct, we cast to floating point numbers and then take the mean. 
# For example, [True, False, True, True] would become [1,0,1,1] which would become 0.75.
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images,
                                    y_: mnist.test.labels}))

0.919


That means and accuracy of around 92%, when the [best models](https://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results) allow for 99.79% of accuracy.