<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#MNIST-Classification-with-a-Long-Short-Term-Memory-(LSTM)-network" data-toc-modified-id="MNIST-Classification-with-a-Long-Short-Term-Memory-(LSTM)-network-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>MNIST Classification with a Long Short Term Memory (LSTM) network</a></span></li><li><span><a href="#Init-vars" data-toc-modified-id="Init-vars-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Init vars</a></span></li><li><span><a href="#Build-the-computational-graph" data-toc-modified-id="Build-the-computational-graph-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Build the computational graph</a></span><ul class="toc-item"><li><span><a href="#ELU-model" data-toc-modified-id="ELU-model-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>ELU model</a></span></li><li><span><a href="#Result" data-toc-modified-id="Result-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Result</a></span></li></ul></li><li><span><a href="#TensorBoard---Second-Run" data-toc-modified-id="TensorBoard---Second-Run-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>TensorBoard - Second Run</a></span><ul class="toc-item"><li><span><a href="#Result" data-toc-modified-id="Result-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Result</a></span></li></ul></li><li><span><a href="#TensorBoard---Third-Run" data-toc-modified-id="TensorBoard---Third-Run-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>TensorBoard - Third Run</a></span><ul class="toc-item"><li><span><a href="#Result" data-toc-modified-id="Result-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Result</a></span></li></ul></li><li><span><a href="#TensorBoard---Fourth-Run" data-toc-modified-id="TensorBoard---Fourth-Run-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>TensorBoard - Fourth Run</a></span><ul class="toc-item"><li><span><a href="#Result" data-toc-modified-id="Result-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Result</a></span></li></ul></li><li><span><a href="#TensorBoard---Fifth-Run" data-toc-modified-id="TensorBoard---Fifth-Run-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>TensorBoard - Fifth Run</a></span><ul class="toc-item"><li><span><a href="#Result" data-toc-modified-id="Result-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Result</a></span></li></ul></li></ul></div>

<h1>MNIST classification with a Long Short Term Memory (LSTM) network</h1>

<img style="float: left; margin-right: 15px; width: 30%; height: 30%;" src="images/mnist-image.png" />

# Purpose

The purpose of this write-up is create a predictive classification model utilizing a Long Short Term Memory (LSTM) network written in TensorFlow.    

Goals include:
* Build a LSTM predictive regression model
* Collect and graph model performance via TensorBoard 
* Make predictions with the training model on the test data set and examine accuracy

Dataset source:  [The MNIST Database](http://yann.lecun.com/exdb/mnist/)

# Load libraries and data

In [2]:
import os
import numpy as np
import tensorflow as tf
from tensorflow.contrib import rnn

from tensorflow.examples.tutorials.mnist import input_data

from functools import partial

In [3]:
def resetGraph(seed= 10):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [20]:
def cleanLogs():
    os.system('rm -rf ./logs/mnistLSTM/')

# Init vars

The LSTM wants inputs of shape `[samples, timeSteps, features]`, and we have several thousand MNIST images of size 28 x 28 pixels.  

One way to think of this is a complete image is comprised of 28 rows of 28 pixels each.  If we were to step through the rows one by one and stack them up then the image would be more and more complete as time went by.  So our units of "time" will be the rows stacking together to create a complete image, and the number of features will be the number of pixels in the image row at that step in time (i.e. 28).  This gives us:

* samples     = number of observations (i.e. number of images in the mini batch)
* timeSteps   = number of rows we need to step through/stack up to make a complete image
* features    = the number of features in each row we are stepping through (i.e. also 28)

Additionally, we only care about the final output of the LSTM network which should give us the prediction of which numeral the image represents.  Other LSTM networks do care about the outputs of each LSTM cell (translating each word in a sentence for example), but that doesn't apply in our case.

Having said this we can continue with initializing the various variables we'll need:

In [22]:
# Setup vars for the MINST data set
timeSteps = 28
features = 28

lstmUnits = 128
lr = 0.001
epochs = 10
samples = 50

classes = 10

# Notice we are pulling in the labels as one hot encodings!
mninst = input_data.read_data_sets("./datasets/mnist", one_hot = True)

# For use when we create the LSTM network below
testShape = mninst.test.images.shape

# Note the one hot encoding on the label:
print("\n", "Example label: ", mninst.test.labels[0])

Extracting ./datasets/mnist\train-images-idx3-ubyte.gz
Extracting ./datasets/mnist\train-labels-idx1-ubyte.gz
Extracting ./datasets/mnist\t10k-images-idx3-ubyte.gz
Extracting ./datasets/mnist\t10k-labels-idx1-ubyte.gz

 Example label:  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]


# Build the computational graph

## Static vs Dynamic TensorFlow RNNs

> Dynamic RNN's allow for variable sequence lengths. You might have an input shape (batch_size, max_sequence_length), but this will allow you to run the RNN for the correct number of time steps on those sequences that are shorter than max_sequence_length.
 
> In contrast, there are static RNNs, which expect to run the entire fixed RNN length. There are cases where you might prefer to do this, such as if you are padding your inputs to max_sequence_length anyway.

>In short, dynamic_rnn is usually what you want for variable length sequential data. It has a sequence_length parameter, and it is your friend.  [Source](https://stackoverflow.com/questions/43100981/what-is-a-dynamic-rnn-in-tensorflow)


## LSTM v1

* Utilize tf.contrib.rnn.static_rnn
* tf.random_normal for weight and bias initialization
* Track loss and accuracy

In [23]:
# Reset the TF CG
resetGraph()

# Clean away any old log files
cleanLogs()

# Set the TB logdir - We want two log dirs since we are going to be plotting two values on the same plot
logDirTrain = './logs/mnistLSTM/runOne/train'
logDirTest = './logs/mnistLSTM/runOne/test'


# Create place holders
x = tf.placeholder(tf.float32, shape = [None, timeSteps, features], name = 'x')
# Give 2nd dimension arg to shape since we are using one hot encodings
y = tf.placeholder(tf.int64, shape = [None, classes], name = 'y')

# Create weights and bias tensors
with tf.name_scope("weightBias"):
    # Todo: Better init!!
    w = tf.Variable(tf.random_normal([lstmUnits, classes]))
    b = tf.Variable(tf.random_normal([classes]))


# Add the LSTM cells
with tf.name_scope("LSTM"):
    
    # Later in the code we'll make a call to tf.contrib.rnn.static_rnn
    # tf.contrib.rnn.static_rnn expects a length T list of inputs, each a Tensor of shape [batch_size, input_size]
    # So we need to convert our inputs of shape [batchSize, timeSteps, numberOfInputs] to [batch_size, input_size]
    #
    # https://www.tensorflow.org/versions/r1.1/api_docs/python/tf/contrib/rnn/static_rnn
    
    # https://www.tensorflow.org/api_docs/python/tf/unstack
    inputs = tf.unstack(x, num = timeSteps, axis = 1)
    
    # Create the basic LSTM cell
    # It does not allow cell clipping, a projection layer, and does not use peep-hole connections: it is the basic baseline.
    # https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell
    cell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
    
    # Add the cell to the RNN
    # https://www.tensorflow.org/versions/r1.1/api_docs/python/tf/contrib/rnn/static_rnn
    output, state = tf.contrib.rnn.static_rnn(cell, inputs, dtype = tf.float32)
    
    # We only care about the final output which should be the model's prediction
    yH = tf.matmul(output[-1], w) + b
    
# Add loss function
with tf.name_scope("loss"):
    # We don't use "tf.nn.sparse_softmax_cross_entropy_with_logits" here since we have one hot encodings
    entropy = tf.nn.softmax_cross_entropy_with_logits(logits = yH, labels = y)
    loss = tf.reduce_mean(entropy, name = "loss")
    # Capture loss
    tf.summary.scalar("loss", loss)
    
with tf.name_scope("optimizer"):
    opt = tf.train.AdamOptimizer(learning_rate = lr).minimize(loss)
    
# Eval the model's accuracy
with tf.name_scope("eval"):
    # We don't use "tf.nn.in_top_k(yH, y, 1)" here since are aren't using "tf.nn.sparse_softmax_cross_entropy_with_logits"
    correct = tf.equal(tf.argmax(yH, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    # Capture accuracy
    tf.summary.scalar("accuracy", accuracy)

init = tf.global_variables_initializer()

In [24]:
# Execute the TF CG
counter = 0

with tf.Session() as sess:
    init.run()
    
    # Create the TB writer and init
    trainWriter = tf.summary.FileWriter(logDirTrain, sess.graph)
    testWriter = tf.summary.FileWriter(logDirTest)
    merge = tf.summary.merge_all()
    
    for e in range(epochs + 1):
        for i in range(mninst.train.num_examples // samples):      
            counter += 1     
            
            # Grab the next minibatch
            xBatch, yBatch = mninst.train.next_batch(samples)
            
            # Reshape x to [samples, timeSteps, features] for the LSTM:
            #   The image is given to us as a single vector of dimensionality 784
            #   So to use them we need to gather the number of rows together to be the timeSteps
            xBatch = xBatch.reshape(samples, timeSteps, features)
            
            # Train the model
            summary, _ = sess.run([merge, opt], feed_dict = {x: xBatch, y: yBatch})
            
            # Capture summary data every N steps
            if counter % 10 == 0:
                # Manually add to the train accuracy summary value
                summary, accTrain = sess.run([merge, accuracy], feed_dict = {x: xBatch, y: yBatch})
                trainWriter.add_summary(summary, counter) 
                
                # Manually add to the test accuracy summary value
                
                # If test accuracy calcs are causing speed issues you can reduce the number tested via the following:
                #summary, accTest = sess.run([merge, accuracy], feed_dict = {
                #    x: mninst.test.images[:450].reshape(-1, timeSteps, features), 
                #    y: mninst.test.labels[:450]})
                summary, accTest = sess.run([merge, accuracy], feed_dict = {
                    x: mninst.test.images.reshape(-1, timeSteps, features), 
                    y: mninst.test.labels})
                testWriter.add_summary(summary, counter)
                
        if e % 10 == 0:
            print(e, "Train Acc: ", accTrain, "Test Acc: ", accTest)
        
    print(" ")
    print("FINAL :: ", "Train Acc: ", accTrain, "Test Acc: ", accTest)

0 Train Acc:  0.96 Test Acc:  0.9589
10 Train Acc:  1.0 Test Acc:  0.9882
 
FINAL ::  Train Acc:  1.0 Test Acc:  0.9882


<img style="float: left; margin-right: 15px;" src="images/mnist-run-one.png" />

Although a little slower than some other models we've looked at the LSTM has exellent accuracy on this problem.

## LSTM v2

* Utilize tf.contrib.rnn.dynamic_rnn  
* He for weight and bias initialization
* Track loss and accuracy



Dynamic RNN's allow for variable sequence lengths. You might have an input shape (batch_size, max_sequence_length), but this will allow you to run the RNN for the correct number of time steps on those sequences that are shorter than max_sequence_length.

In contrast, there are static RNNs, which expect to run the entire fixed RNN length. There are cases where you might prefer to do this, such as if you are padding your inputs to max_sequence_length anyway.

In short, dynamic_rnn is usually what you want for variable length sequential data. It has a sequence_length parameter, and it is your friend.

https://stackoverflow.com/questions/43100981/what-is-a-dynamic-rnn-in-tensorflow

In [40]:
# Reset the TF CG
resetGraph()

# Clean away any old log files
cleanLogs()

# Set the TB logdir - We want two log dirs since we are going to be plotting two values on the same plot
logDirTrain = './logs/mnistLSTM/runTwo/train'
logDirTest = './logs/mnistLSTM/runTwo/test'


# Create place holders
x = tf.placeholder(tf.float32, shape = [None, timeSteps, features], name = 'x')
# Give 2nd dimension arg to shape since we are using one hot encodings
y = tf.placeholder(tf.int64, shape = [None, classes], name = 'y')

# Add the LSTM cells with He initialization (we'll let TF worry about the "w" and "b" values)
# Notice to do this we switch from "tf.name_scope" to "tf.variable_scope" and add the "initializer" param
with tf.variable_scope("LSTM", initializer = tf.variance_scaling_initializer()):
        
    # Create the basic LSTM cell
    # It does not allow cell clipping, a projection layer, and does not use peep-hole connections: it is the basic baseline.
    # https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell
    cell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
    
    # Add the cell to the RNN
    # https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn
    # Notice we don't have to unstack x as in the previous model
    output, state = tf.nn.dynamic_rnn(cell, x, dtype = tf.float32)
    
    # We only care about the final output which should be the model's prediction
    yH = tf.layers.dense(state[-1], classes)
    
# Add loss function
with tf.name_scope("loss"):
    # We don't use "tf.nn.sparse_softmax_cross_entropy_with_logits" here since we have one hot encodings
    entropy = tf.nn.softmax_cross_entropy_with_logits(logits = yH, labels = y)
    loss = tf.reduce_mean(entropy, name = "loss")
    # Capture loss
    tf.summary.scalar("loss", loss)
    
with tf.name_scope("optimizer"):
    opt = tf.train.AdamOptimizer(learning_rate = lr).minimize(loss)
    
# Eval the model's accuracy
with tf.name_scope("eval"):
    # We don't use "tf.nn.in_top_k(yH, y, 1)" here since are aren't using "tf.nn.sparse_softmax_cross_entropy_with_logits"
    correct = tf.equal(tf.argmax(yH, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    # Capture accuracy
    tf.summary.scalar("accuracy", accuracy)

init = tf.global_variables_initializer()

In [None]:
# Execute the TF CG
counter = 0

with tf.Session() as sess:
    init.run()
    
    # Create the TB writer and init
    trainWriter = tf.summary.FileWriter(logDirTrain, sess.graph)
    testWriter = tf.summary.FileWriter(logDirTest)
    merge = tf.summary.merge_all()
    
    for e in range(epochs + 1):
        for i in range(mninst.train.num_examples // samples):      
            counter += 1     
            
            # Grab the next minibatch
            xBatch, yBatch = mninst.train.next_batch(samples)
            
            # Reshape x to [samples, timeSteps, features] for the LSTM:
            #   The image is given to us as a single vector of dimensionality 784
            #   So to use them we need to gather the number of rows together to be the timeSteps
            xBatch = xBatch.reshape(samples, timeSteps, features)
            
            # Train the model
            summary, _ = sess.run([merge, opt], feed_dict = {x: xBatch, y: yBatch})
            
            # Capture summary data every N steps
            if counter % 10 == 0:
                # Manually add to the train accuracy summary value
                summary, accTrain = sess.run([merge, accuracy], feed_dict = {x: xBatch, y: yBatch})
                trainWriter.add_summary(summary, counter) 
                
                # Manually add to the test accuracy summary value
                
                # If test accuracy calcs are causing speed issues you can reduce the number tested via the following:
                #summary, accTest = sess.run([merge, accuracy], feed_dict = {
                #    x: mninst.test.images[:450].reshape(-1, timeSteps, features), 
                #    y: mninst.test.labels[:450]})
                summary, accTest = sess.run([merge, accuracy], feed_dict = {
                    x: mninst.test.images.reshape(-1, timeSteps, features), 
                    y: mninst.test.labels})
                testWriter.add_summary(summary, counter)
                
        if e % 10 == 0:
            print(e, "Train Acc: ", accTrain, "Test Acc: ", accTest)
        
    print(" ")
    print("FINAL :: ", "Train Acc: ", accTrain, "Test Acc: ", accTest)

0 Train Acc:  0.92 Test Acc:  0.9556
