<h1 style="color:Blue;">MNIST handwritten digit classification with TensorFlow<br>
    Part 1 - Training and evaluation</h1>
<br>
<b>
This Jupyter NoteBook will explain how to build a simple CNN for classifying the MNIST dataset using the TensorFlow layers API, then how to train and evaluate it. The complete python code (mnist_train.py) can be found in this GitHub repo.
<br><br>
The CNN we are going to build will look like this:</b>

![title](img/network.png)

<b>First, we import the necessary Python packages.</b>

In [1]:
import os
import sys
import shutil
import numpy as np
import tensorflow as tf

<b>Now we create some directories for the TensorBoard event logs and the TensorFlow checkpoints. If the directories already exist, we delete them and recreate them so that we are always starting from scratch.</b>

In [2]:
SCRIPT_DIR = os.getcwd()

TRAIN_GRAPH = 'training_graph.pb'
CHKPT_FILE = 'float_model.ckpt'

CHKPT_DIR = os.path.join(SCRIPT_DIR, 'chkpts')
TB_LOG_DIR = os.path.join(SCRIPT_DIR, 'tb_logs')
CHKPT_PATH = os.path.join(CHKPT_DIR, CHKPT_FILE)
MNIST_DIR = os.path.join(SCRIPT_DIR, 'mnist_dir')


# create a directory for the MNIST dataset if it doesn't already exist
if not (os.path.exists(MNIST_DIR)):
    os.makedirs(MNIST_DIR)
    print("Directory " , MNIST_DIR ,  "created ") 


# create a directory for the TensorBoard data if it doesn't already exist
# delete it and recreate if it already exists
if (os.path.exists(TB_LOG_DIR)):
    shutil.rmtree(TB_LOG_DIR)
os.makedirs(TB_LOG_DIR)
print("Directory " , TB_LOG_DIR ,  "created ") 


# create a directory for the checkpoint if it doesn't already exist
# delete it and recreate if it already exists
if (os.path.exists(CHKPT_DIR)):
    shutil.rmtree(CHKPT_DIR)
os.makedirs(CHKPT_DIR)
print("Directory " , CHKPT_DIR ,  "created ")

Directory  /home/mharvey/ml/mnist_tf/tb_logs created 
Directory  /home/mharvey/ml/mnist_tf/chkpts created 


<b>Set up the learning rate for the Optimizer, the batch size and the number of epochs. We will only run for 3 epochs to keep the training time to a minimum..be aware that real world machine learning algorithms might need thousands of epochs to train properly.
<br>
A batch size of 100 and just 3 epochs of training will still gives us between 97% and 98% final accuracy.</b>

In [3]:
LEARN_RATE = 0.0001
BATCHSIZE = 100
EPOCHS = 3

<h2 style="color:Blue;">Data Wrangling</h2>

<b>
Now we download the MNIST dataset. TensorFlow very conveniently has a built-in function to do the job for us. What you get is a dataset that has been split into 60k images and labels for training, 10k images and labels for test. The 'images' are actually numpy arrays with the datatype of each array member set to 8bit unsigned integer. We scale this image data back to the range 0:1.0 by dividing by 255.0. The labels are also integers, so we one-hot encode them using the 'to_categorical()' method.
</b>

In [4]:
# MNIST dataset has 60k images. Training set is 60k, test set is 10k.
# Each image is 28x28x8bits
mnist_dataset = tf.keras.datasets.mnist.load_data('mnist_data')
(x_train, y_train), (x_test, y_test) = mnist_dataset

# scale pixel values from 0:255 to 0:1
# Also converts uint8 values to float
x_train = (x_train/255.0)  
x_test = (x_test/255.0)

# reshape train & test images
x_train = np.reshape(x_train, [-1, 28, 28, 1])
x_test = np.reshape(x_test, [-1, 28, 28, 1])

# one-hot encode the labels
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)

<b>
The built-in MNIST download function gives us a split of 60k images for training and 10k images for test. We are going to 'steal' 5k images from the training set to create a validation set of 'unseen' images which we will use to test our trained model.
</b>

In [5]:
# take 5000 images from train set to make a dataset for prediction
x_valid = x_train[55000:]
y_valid = y_train[55000:]

# reduce train dataset to 55000 images
y_train = y_train[:55000]
x_train = x_train[:55000]

# calculate total number of batches
total_batches = int(len(x_train)/BATCHSIZE)

<h2 style="color:Blue;">The Computational Graph</h2>

<b>
The placeholders for inputting data have shapes that match the modified datasets. The 'x' placeholder takes in the 28pixel x 28pixel images (..actually numpy arrays..) and so has shape [None, 28, 28, 1].  The 'y' placeholder takes in the one-hot encoded labels.
</b>

In [6]:
# define placeholders for the input data & labels
x = tf.placeholder(tf.float32, shape=[None, 28, 28, 1], name='images_in')
y = tf.placeholder(tf.float32, [None, 10], name='labels_in')

<b>
Now we define our simple CNN as a series of layers..a 2D convolution layer, followed by a max pooling layer, then another convolution layer, then a flatten layer, then a fully-connected (or dense) layer with 256 outputs before a final fully connected layer with 10 outputs.
</b>

In [7]:
# define the CNN
def cnn(x):
  net = tf.layers.conv2d(x, 16, [3, 3], activation=tf.nn.relu)
  net = tf.layers.max_pooling2d(net, pool_size=2, strides=2)
  net = tf.layers.conv2d(net, 32, [3, 3], activation=tf.nn.relu)
  net = tf.layers.flatten(net)
  net = tf.layers.dense(net, units=256, activation=tf.nn.relu)
  logits = tf.layers.dense(net, units=10, activation=None)
  return logits

# build the network, input comes from the 'x' placeholder
logits = cnn(x)
prediction = tf.nn.softmax(logits, name='prediction')

<b>The loss function is a cross entropy function for classification which accepts labels in one-hot format (..which explains why we one-hot encoded the labels earlier..). The training optimizer is an Adaptive Momentum type.</b>

In [8]:
# softmax cross entropy loss function
loss = tf.reduce_mean(tf.losses.softmax_cross_entropy(logits=logits, onehot_labels=y))

# Adaptive Momentum optimizer - minimize the loss
optimizer = tf.train.AdamOptimizer(learning_rate=LEARN_RATE, name='Adam').minimize(loss)

<b>We will calculate the accuracy of our network as the mean of the correct predictions..</b>

In [9]:
# Check to see if the prediction matches the label
correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))

 # Calculate accuracy as mean of the correct predictions
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

<b>We will collect the loss and accuracy data for displaying in TensorBoard along with the images that are fed into the 'x' placeholder.</b>

In [10]:
# TensorBoard data collection
tf.summary.scalar('cross_entropy_loss', loss)
tf.summary.scalar('accuracy', accuracy)
tf.summary.image('input_images', x)

<tf.Tensor 'input_images:0' shape=() dtype=string>

<b>We define an instance of a saver object which will be used inside our session to save the trained model checkpoint.</b>

In [11]:
# set up saver object
saver = tf.train.Saver()

<h2 style="color:Blue;">The Session</h2>

<b>Inside the session, we initialize all the variables then loop through the number of epochs, sending the training data into the 'x' and 'y' placeholders.

When we exit the training loop, the final accuracy is calculated and then the final trained model is saved as a checkpoint and as a graph in a protobuf text file.
</b>

In [12]:
with tf.Session() as sess:

    sess.run(tf.initializers.global_variables())
    
    # TensorBoard writer
    writer = tf.summary.FileWriter(TB_LOG_DIR, sess.graph)
    tb_summary = tf.summary.merge_all()

    # Training phase with training data
    print ('-------------------------------------------------------------')
    print ('TRAINING PHASE')
    print ('-------------------------------------------------------------')
    for epoch in range(EPOCHS):
        print ("Epoch", epoch+1, "/", EPOCHS)

        # process all batches
        for i in range(total_batches):
            
            # fetch a batch from training dataset
            batch_x, batch_y = x_train[i*BATCHSIZE:i*BATCHSIZE+BATCHSIZE], y_train[i*BATCHSIZE:i*BATCHSIZE+BATCHSIZE]

            # Run graph for optimization, loss, accuracy - i.e. do the training
            _, s = sess.run([optimizer, tb_summary], feed_dict={x: batch_x, y: batch_y})
            writer.add_summary(s, (epoch*total_batches + i))
            # Display accuracy per 100 batches
            if i % 100 == 0:
              acc = sess.run(accuracy, feed_dict={x: x_test, y: y_test})
              print (' Step: {:4d}  Training accuracy: {:1.4f}'.format(i,acc))


    print ('-------------------------------------------------------------')
    print ('FINISHED TRAINING')
    print('Run `tensorboard --logdir=%s --port 6006 --host localhost` to see the results.' % TB_LOG_DIR)
    print ('-------------------------------------------------------------')
    writer.flush()
    writer.close()


    # Evaluation phase with validation dataset
    print ('EVALUATION PHASE:')
    print ("Final Accuracy with validation set:", sess.run(accuracy, feed_dict={x: x_valid, y: y_valid}))
    print ('-------------------------------------------------------------')

    # save post-training checkpoint & graph
    print ('SAVING:')
    save_path = saver.save(sess, os.path.join(CHKPT_DIR, CHKPT_FILE))
    print('Saved checkpoint to %s' % os.path.join(CHKPT_DIR,CHKPT_FILE))
    tf.train.write_graph(sess.graph_def, CHKPT_DIR, TRAIN_GRAPH, as_text=False)
    print('Saved binary graphDef to %s' % os.path.join(CHKPT_DIR,TRAIN_GRAPH))
    print ('-------------------------------------------------------------')


-------------------------------------------------------------
TRAINING PHASE
-------------------------------------------------------------
Epoch 1 / 3
 Step:    0  Training accuracy: 0.0763
 Step:  100  Training accuracy: 0.8605
 Step:  200  Training accuracy: 0.9064
 Step:  300  Training accuracy: 0.9211
 Step:  400  Training accuracy: 0.9272
 Step:  500  Training accuracy: 0.9341
Epoch 2 / 3
 Step:    0  Training accuracy: 0.9388
 Step:  100  Training accuracy: 0.9425
 Step:  200  Training accuracy: 0.9499
 Step:  300  Training accuracy: 0.9500
 Step:  400  Training accuracy: 0.9539
 Step:  500  Training accuracy: 0.9599
Epoch 3 / 3
 Step:    0  Training accuracy: 0.9614
 Step:  100  Training accuracy: 0.9642
 Step:  200  Training accuracy: 0.9689
 Step:  300  Training accuracy: 0.9670
 Step:  400  Training accuracy: 0.9679
 Step:  500  Training accuracy: 0.9724
-------------------------------------------------------------
FINISHED TRAINING
Run `tensorboard --logdir=/home/mharvey/ml/