In [None]:
import numpy as np
import seaborn as sb
import pandas
import sys
import itertools
import matplotlib.pyplot as plt
import nltk
import csv
import datetime
import tensorflow as tf
%matplotlib notebook

# Deep learning of MNIST data using tensorflow

We closely follow the tensorflow tutorial here, trying to train and optimize a neural network to recognize our handwritten digits.

## Importing MNIST

Tensorflow has its own version of the MNIST data, which we import like this:

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

## Training in an interactive session

So far, we have created our graph and then called `sess.run()` to evaluate it. We can be a bit more flexible and run an interactive session that lets us change things under the hood:

In [None]:
sess = tf.InteractiveSession()

## Defining the data

We first need input and label data. These are tensorflow placeholders that will be populated with the actual data later on. 

Since MNIST images are 28x28 pixels, the resulting dimensionality is 784. And we have 10 digit labels, which makes the output a **one-hot encoded** vector with 10 dimensions. 

In order to initialize these properly, we also tell tensorflow the size of each of the inputs. This is not actually necessary, but will allow tensorflow later to debug dimension-mismatches (which happen a lot in neural nets!!)

In [None]:
# None tells tensorflow that that dimension can have any size
x = tf.placeholder(tf.float32, shape=[None,784])
y = tf.placeholder(tf.float32, shape=[None,10])

## Step 1: A linear model with softmax loss

We know that a Linear Ridge regression classifier can get us around 88% accuracy on the MNIST data.

Let's use a similar model, but use the typical softmax loss, which is used to optimize classification problems.

First the linear model with its parameters. Don't forget that we need to initialize variables before their first use!!

In [None]:
# weights map all dimensions to ten classes
w = tf.Variable(tf.zeros([784,10]))
# biases get added
b = tf.Variable(tf.zeros([10]))
sess.run(tf.global_variables_initializer())

# note how tensorflow expects the multiplication order
ym = tf.matmul(x,w)+b
# the output of course has to be label-dimension!
print(ym.shape)

Now for the loss function. We will take the usual softmax cross-entropy function, that is implemented in tensorflow. 

The softmax is applied to the output prediction of the model ($ym$) and then summed across all classes ($\sum_{c=1}^{10}$). The average of all sums is then our total loss. 

In [None]:
l = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=ym))

### Gradient descent

Let's do a minibatch gradient descent, where we feed in 100 examples and then train the parameters successively.

In [None]:
train = tf.train.GradientDescentOptimizer(0.5).minimize(l)
for iter in np.arange(1000):
    miniBatch = mnist.train.next_batch(100)
    train.run(feed_dict={x: miniBatch[0], y: miniBatch[1]})

### Evaluate
We check how many labels are correct (result of that is boolean) and then cast that to a float variable (the result of that is numbers [0,1]) and then average this to get the overall accuracy.

Note, that these are not executed yet but only define the computational operations in the graph!

In [None]:
correct = tf.equal(tf.argmax(y,1),tf.argmax(ym,1))
accuracy = tf.reduce_mean(tf.cast(correct,tf.float32))

Let's evaluate our accuracy on the test set:

In [None]:
print('Linear accuracy:',accuracy.eval(feed_dict={x: mnist.test.images, y: mnist.test.labels}))

### Change to regression loss

Let's use our "normal" L2 regression loss (sometimes called Euclidean loss). This is going to be very ugly, since we will require the classifier to produce weights to model one-hot vectors as closely as possible!!

In [None]:
# standard Euclidean loss - note, this is a bad idea
l = tf.reduce_mean(tf.nn.l2_loss(t=y-ym))
sess.run(tf.global_variables_initializer())
# we need to change the learning rate in order to get somewhere with this, 
# since we have not normalized our data
train = tf.train.GradientDescentOptimizer(0.000005).minimize(l)
for it in np.arange(10000):
    miniBatch = mnist.train.next_batch(100)
    train.run(feed_dict={x: miniBatch[0], y: miniBatch[1]})
    if (it%1000==0):
        print(it,sess.run(l,feed_dict={x: miniBatch[0], y: miniBatch[1]}))

In [None]:
print('Linear accuracy:',accuracy.eval(feed_dict={x: mnist.test.images, y: mnist.test.labels}))

In [None]:
sess.close()

# Going deep!

Now, let's train a proper network architecture!

We will use some convnets with ReLU activation functions and softmax cross-entropy loss at the end.

## Layer 1 - CONV

In [None]:
sess = tf.InteractiveSession()
# layer 1 weights and biases
# we use 5x5 filters, the input dimensionality is "1" and 32 different filters
# the variable is initialized from normally distributed numbers with
# the given standard deviation
w_conv1 = tf.Variable(tf.truncated_normal([5,5,1,32],stddev=0.1))
# we should not forget the biases!
b_conv1 = tf.Variable(tf.constant(0.1,shape=[32]))
# layer 1 input to convolution needs to be reshaped to a 4D tensor
# note that this shape is required by tf.nn.conv2d!!!
x_tensor = tf.reshape(x,[-1,28,28,1])
# layer 1 output is convolution, followed by RELU
h_conv1 = tf.nn.relu(tf.nn.conv2d(x_tensor,w_conv1,strides=[1,1,1,1],padding='SAME')+b_conv1)
# layer 1 pooling of convolution layer, we use standard max-2 pooling
h_pool1 = tf.nn.max_pool(h_conv1,ksize=[1,2,2,1],strides=[1,2,2,1],padding='SAME')
print(h_pool1)

## Layer 2 - CONV

In [None]:
# layer 2 weights and biases
# we use 5x5 filters, the input dimensionality is 32 and we want 64 filters here
w_conv2 = tf.Variable(tf.truncated_normal([5,5,32,64],stddev=0.1))
# and again the biases
b_conv2 = tf.Variable(tf.constant(0.1,shape=[64]))
# layer 2 output is convolution, followed by RELU
# note that the input consists of course of layer1 output, which is a 4D tensor
h_conv2 = tf.nn.relu(tf.nn.conv2d(h_pool1,w_conv2,strides=[1,1,1,1],padding='SAME')+b_conv2)
# layer 2 pooling of the convolution layer with max-2 pooling
h_pool2 = tf.nn.max_pool(h_conv2,ksize=[1,2,2,1],strides=[1,2,2,1],padding='SAME')
print(h_pool2)

## Layer 3 - FC

Now we put on the final, fully-connected layer, which takes the 7x7x64-dimensional tensor output from the second CONV layer and implements a standard ReLU layer activation.

In [None]:
# fully connected layer
w_fc1 = tf.Variable(tf.truncated_normal([7*7*64,1024],stddev=0.1))
b_fc1 = tf.Variable(tf.constant(0.1,shape=[1024]))

# we need to reshape the input to 7*7*64 (i.e., flatten it)
h_pool2_flattened = tf.reshape(h_pool2,[-1, 7*7*64])

# this will result in a flat, 1024 dimensional output
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flattened,w_fc1)+b_fc1)

## Trick 1 - Dropout

CNNs with this many parameters can be prone to overfitting. One way to combat this is to use dropouts during training. This means that each weight is kept only with a certain probability during training.

We can add dropouts on the fully-connected layer as follows

In [None]:
# dropout layer on the previous FC layer
keep_connection_probability = tf.placeholder(tf.float32)
h_fc1_dropout = tf.nn.dropout(h_fc1,keep_connection_probability)

## Read-out and softmax loss
Now we add the final layer that takes the 1024 values from the FC layer and maps them to a 10-dimensional vector on which we finally can evaluate our loss

In [None]:
# read-out layer that will feed into softmax cross-entropy loss
w_fc2 = tf.Variable(tf.truncated_normal([1024,10],stddev=0.1))
b_fc2 = tf.Variable(tf.constant(0.1,shape=[10]))

# the final output as activations of the final, fully-connected layer
y_fc2 = tf.matmul(h_fc1_dropout,w_fc2)+b_fc2

In [None]:
# loss "layer", which is the standard softmax cross-entropy
l = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_fc2))

# Training loop

Let's use a more sophisticated training scheme - the Adam optimizer. This automatically adjusts the learning rate (initialized to be 0.0001 here) over the course of the training based on the changes in gradients and the changes in the loss function. It is one of the most-used optimization algorithms for these kinds of networks.

In [None]:
# select Adam and initialize learning rate
train = tf.train.AdamOptimizer(1e-4).minimize(l)
# evaluate number of correctly-predicted items and accuracy as before
correct = tf.equal(tf.argmax(y_fc2,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
# init
sess.run(tf.global_variables_initializer())
batch = 0
train_accuracy=list()
# run this 4000 times
for i in range(4000):
    # get next batch
    miniBatch = mnist.train.next_batch(50)
    if i%100 == 0:
        # evaluate accuracy with current mini batch
        train_accuracy.append(accuracy.eval(feed_dict={
            x:miniBatch[0], y: miniBatch[1], keep_connection_probability: 1.0}))
        print("step %d, training accuracy %g"%(i, train_accuracy[-1]))
        batch = batch + 1
    # for each batch, run one training step with the data and the dropout probability
    train.run(feed_dict={x: miniBatch[0], y: miniBatch[1], keep_connection_probability: 0.5})


In [None]:
print('CNN test accuracy:',accuracy.eval(feed_dict={x: mnist.test.images, y: mnist.test.labels, keep_connection_probability: 1.0}))

In [None]:
from showTensorflowGraph import show_graph

In [None]:
show_graph(tf.get_default_graph().as_graph_def())

Wait - what? Why does this look so chaotic? Where are the layers? Where is my beautiful 

CONV-RELU-CONV-RELU-FC-DROPOUT-LOGIT-SOFTMAX 

architecture?

The reason we got this "chaotic" graph is that we used the lower-level APIs of tensorflow. If you want a prettier graph that hides some of the complexity of the updates and gradients, etc., you should use the `layers` functionality of the tensorflow API.