# MNIST Tutorial in TensorFlow

Based on [this page](https://www.tensorflow.org/versions/r0.10/tutorials/mnist/pros/), below is a lightly altered/annotated version of building a softmax classifer in TensorFlow. First with a shallow architecture, then with a multilayer convolutional architecture.

It's written in python 3 with TensorFlow 1.0.0.

## Import the data

In [3]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


## Import TensorFlow and start a session

* Sessions connect TensorFlow to its C++ backend.
* The `InteractiveSession` class allows interleaving of operations that **build** the graph and operations that **run** the graph. Useful for interactive contexts like Jupyter.
* In non-interactive contexts, a graph is created first and then launched in a session.

In [4]:
import tensorflow as tf
sess = tf.InteractiveSession()

### Note on computational graphs

TensorFlow avoids expensive data transfer by building the graph in python then running all interacting expenive operations outside of python, similar to Theano and Torch.

Learn more about the graph [here](https://www.tensorflow.org/versions/r0.10/get_started/basic_usage#the_computation_graph).

# Build a single layer Softmax Regression Model

We'll start with a single lienar layer, then later extend to a multilayer convolutional network.

### 1. Instantiate _nodes_ for input and output

* A `placeholder` doesn't take a specific value until we as TensorFlow to run a computation.
* However, the arguments can constrain things like the type and shape of data the node will process.

In [5]:
# Image input node, unroll image into row vector
x = tf.placeholder(tf.float32, shape=[None, 28*28])

# Class prediction output node, classes 0-9
y_ = tf.placeholder(tf.float32, shape=[None, 10])

___Note___ that the first dimension in `shape` corresponds to the **batch size**. Setting that dimension to `None` indicates that any size is acceptable to the node.

> The shape argument to placeholder is optional, but it allows TensorFlow to automatically catch bugs stemming from inconsistent tensor shapes.

### 2. Instantiate the variables (weights and biases) to be learned by the model

* Generally, the model parameters should be instantiated as `Variable`s in the computational graph.

In [6]:
# Weights with dimension=[#features, #outputs]
W = tf.Variable(tf.zeros([784, 10]))

# Bias
b = tf.Variable(tf.zeros([10]))

* Variables must be **initialized** within a session before they can be used.
* Can initialize all variables at once using `tf.global_variables_initializer()`
* Tutorial suggests `tf.initialize_all_variables()`, but this throws a deprecation warning.

In [8]:
sess.run(tf.global_variables_initializer())

### 3. Implement the single layer prediction and loss

Recall the placeholders and variables we've created
* x - placeholder for the data
* y_ - placeholder for the output
* W - weights for the model
* b - bias for the model

Since the model is single layer, we need only do one multiplication and apply the softmax function to get probabilities.

In [9]:
# Model output
y = tf.nn.softmax(tf.matmul(x,W) + b)

Specify the loss function.

* `reduce_sum` sums across all classes
* `reduce_mean` takes average over the sums

In [10]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), 
                                             reduction_indices=[1]))

### 4. Train the model

Having specified the model and loss function, TensorFlow knows the entire computational graph. Now automatic differentiation can be used to find the gradients and train. Could also use [other built-in optimizers](https://www.tensorflow.org/versions/r0.10/api_docs/python/train#optimizers).

We'll use steepest gradient descent with a step length of 0.5.

In [11]:
# Define what happens in a traning step
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

___Note___ The above line has actually added operations to the computational graph, such as ones to compute gradients, parameter update steps, and update application steps.

We have to `run` the train step to actually apply gradient descent. The model is trained by repeatedly running the train step.

In [12]:
# Do 1000 steps of training on 100 examples per step
for i in range(1000):
    batch = mnist.train.next_batch(100)     # 100 ex per iter
    train_step.run(feed_dict={x : batch[0], # feed_dict replaces placeholders
                              y_: batch[1]})

___Note___ Any tensor in the graph can be replaced using `feed_dict`, not just `placeholder`s!

## Evaluate the model

TensorFlow has an `argmax` function that can be used along with its `equal` function and the reduce functions to get accuracy.

In [13]:
# Check for correct predictions - outputs boolean
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))

# Cast to floats and take the mean
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Evaluate on the test data.

In [14]:
print(accuracy.eval(feed_dict={x: mnist.test.images, 
                               y_: mnist.test.labels}))

0.9196


About 92%, as required!

# Build a Multilayer Convolutional Network

Initialize a bunch of weights with small non-zero noise to
* break symmetry
* avoid zero gradients

Also, using ReLU neurons means we should initialize with **small positive** to avoid "dead neurons".

Since we'll use many weights and biases, let's make functions to create them.

In [16]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

## Convolution and Pooling

In [17]:
def conv2d(x, W):
    return tf.nn.conv2d(x, 
                        W, 
                        strides=[1, 1, 1, 1], 
                        padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, 
                          ksize=[1, 2, 2, 1], 
                          strides=[1, 2, 2, 1], 
                          padding='SAME')

## First convolutional layer

Convolution followed by pooling
* 32 features for each 5x5 patch.
* Weight tensor shape of [5, 5, 1, 32]
* [patch size, patch size, num input channels, num output channels]
* Also want bias component for each output channel.

In [18]:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

In order to apply the tensor, the data `x` will need to be reshaped into a 4d tensor.
* 2nd dim: width
* 3rd dim: height
* 4th dim: number of color channels

In [19]:
x_image = tf.reshape(x, [-1, 28, 28, 1])

Then, 
* convolve `x_image` with the weight tensor
* add the bias
* apply the ReLU
* max pool

In [20]:
# Convolve, add bias, apply ReLU
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

# Max pool
h_pool1 = max_pool_2x2(h_conv1)

## Second Convolutional Layer

The second layer will have 64 features per 5x5 patch.

In [21]:
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

## Densely Connected Layer

The image has been reduced to 7x7. We'll now add a fully connected layer with 1024 neurons...

In [22]:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

## Dropout

In order to reduce overfitting, apply dropout before the readout layer.

In [23]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

## Readout Layer

Just a softmax layer.

In [24]:
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

## Train and evaluate.

In [25]:
# Loss
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))

# Train step
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

# Evaluate
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Initialize variables
sess.run(tf.global_variables_initializer())

# Train
for i in range(20000):
    # get batch
    batch = mnist.train.next_batch(50)
    
    # show status every 100 iterations
    if i%100==0:
        train_accuracy = accuracy.eval(feed_dict={x:batch[0], 
                                                  y_:batch[1], 
                                                  keep_prob: 1.0})
        print("step %d, training accuracy %g" % (i, train_accuracy))
    
    # run the train step
    train_step.run(feed_dict={x: batch[0], 
                              y_:batch[1], 
                              keep_prob: 0.5})

# Evaluate the model on test set 
print("test accuracy %g"%accuracy.eval(feed_dict={x: mnist.test.images, 
                                                  y_: mnist.test.labels, 
                                                  keep_prob: 1.0})) 
    

step 0, training accuracy 0.08
step 100, training accuracy 0.82
step 200, training accuracy 0.92
step 300, training accuracy 0.9
step 400, training accuracy 0.92
step 500, training accuracy 0.92
step 600, training accuracy 0.96
step 700, training accuracy 0.96
step 800, training accuracy 0.96
step 900, training accuracy 0.98
step 1000, training accuracy 0.94
step 1100, training accuracy 0.94
step 1200, training accuracy 0.94
step 1300, training accuracy 0.94
step 1400, training accuracy 0.98
step 1500, training accuracy 1
step 1600, training accuracy 0.98
step 1700, training accuracy 0.96
step 1800, training accuracy 1
step 1900, training accuracy 0.98
step 2000, training accuracy 0.98
step 2100, training accuracy 0.96
step 2200, training accuracy 1
step 2300, training accuracy 0.98
step 2400, training accuracy 0.98
step 2500, training accuracy 0.98
step 2600, training accuracy 1
step 2700, training accuracy 0.96
step 2800, training accuracy 1
step 2900, training accuracy 0.98
step 300