# Training Process

In this section, I will tell you how to build and train a neural network which recognizes handwritten digits. We use the MNIST dataset which is a collection of a training set of 60,000 examples, and a test set of 10,000 examples. It is a good database for those who like to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. We will be solving this classification problem with less than 120 lines of Python / TensorFlow / TF-Slim codes.

Our neural network takes in handwritten digits and classifies them, i.e. states if it recognizes them as a 0, a 1, a 2 and so on up to a 9. Each image in the MNIST dataset is a 28x28 pixel greyscale image.

Below is our main code for training the data. I have tried to make it as short and clear as possible. In this workshop, we will be using LeNet as the network. 

So, here we go.

First of all, let's import our required modules. We need to import Tensorflow as the main platform. We also import `MNIST` which is the module for reading data in our desired fashion as well as `load_batch` that loads the data in batches of favorable size. The latter helps to not to go beyond our memory and computational facilities.  


In [1]:
!wget -c -np -r http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie
!cp -r ./ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/* .
!ls -R


--2017-11-30 08:55:47--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie
Resolving ict.icrar.org (ict.icrar.org)... 130.95.227.213
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/ [following]
--2017-11-30 08:55:48--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2183 (2.1K) [text/html]
Saving to: ‘ict.icrar.org/store/staff/rdodson/ML_WS/Ellie.1’


2017-11-30 08:55:48 (263 MB/s) - ‘ict.icrar.org/store/staff/rdodson/ML_WS/Ellie.1’ saved [2183/2183]

Loading robots.txt; please ignore errors.
--2017-11-30 08:55:48--  http://ict.icrar.org/robots.txt
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-11-30 08:55:4

connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:55:57--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/?C=D;O=D
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:55:57--  http://ict.icrar.org/store/staff/rdodson/ML_WS/?C=N;O=D
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:55:58--  http://ict.icrar.org/store/staff/rdodson/ML_WS/?C=M;O=A
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully 

connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:56:06--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/datasets/download_and_convert_mnist.py
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:56:06--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/datasets/mnist.py
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:56:07--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/log/?C=N;O=D
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requeste

416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:56:14--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/data/?C=N;O=A
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:56:15--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/data/?C=M;O=D
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2017-11-30 08:56:15--  http://ict.icrar.org/store/staff/rdodson/ML_WS/Ellie/data/?C=S;O=D
Connecting to ict.icrar.org (ict.icrar.org)|130.95.227.213|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing t

In [0]:
import tensorflow as tf

from datasets import mnist

from model import lenet, load_batch

I have tried other libraries before like Tensorflow and Keras. Both have their pros and cons, but we will be moving forward with another flexible library called slim. Slim has a good support and a lot of pretrained models like ResNet, VGG, Inception, and others. Actually, slim is a very clean and lightweight wrapper around Tensorflow which will keep our script much neater than using Tensorflow itself which is a very low-level library.

Slim comes with Tensorflow so no need to install it separately. A big advantage, yeah?!

In [0]:
slim = tf.contrib.slim

Let's facilitate the script by allocating some flags for the data directory, data loading batch size, etc. 

In [0]:
flags = tf.app.flags
flags.DEFINE_string('data_dir', './data/',
                    'Directory with the mnist data.')
flags.DEFINE_integer('batch_size', 5, 'Batch size.')
flags.DEFINE_integer('num_batches', None,
                     'Num of batches to train (epochs).')
flags.DEFINE_string('log_dir', './log',
                    'Directory with the log data.')
FLAGS = flags.FLAGS

We load the training dataset using `mnist.get_split`. This function has systematic instructions for reading data from `TFRecords`.

In [0]:
dataset = mnist.get_split('train', FLAGS.data_dir)

Load batches of the dataset. The important arguments for this function are the dataset from the step above and the batch size value which we have already provided through the flags. For this specific workshop, we choose the batch size to be 100. Feel free to change the batch size but be aware of your memory capability! Don't forget that 100 here means 100 images + 100 labels.

In [0]:
images, labels = load_batch(
    dataset,
    FLAGS.batch_size,
    is_training=True)

Here you see the training digits being fed into the deep neural network, 100 at a time. Then run the batches of images through the neural network model.

In [0]:
predictions = lenet(images)

Then we encode the labels by "One-hot" encoding. This method means that we represent the label "6", for instance, by using a vector of 10 values, all zeros but the 6th value which is 1 (see the image below). 
It is handy here because the format is very similar to how our neural network outputs predictions, also as a vector of 10 values.

In [0]:
one_hot_labels = slim.one_hot_encoding(
    labels,
    dataset.num_classes)

To drive the training, we will define a loss function, i.e. a value representing how badly the system recognizes the digits and try to minimise it. The statistical practice requires choosing a suitable loss function in the context of a particular applied problem. With `tf.summary` we write the loss values as summaries to be plotted later on Tensorboard.

In [9]:
slim.losses.softmax_cross_entropy(
    predictions,
    one_hot_labels)

total_loss = slim.losses.get_total_loss()
tf.summary.scalar('loss', total_loss)

Instructions for updating:
Use tf.losses.softmax_cross_entropy instead. Note that the order of the logits and labels arguments has been changed.
Instructions for updating:
Use tf.losses.compute_weighted_loss instead.
Instructions for updating:
Use tf.losses.add_loss instead.
Instructions for updating:
Use tf.losses.get_total_loss instead.
Instructions for updating:
Use tf.losses.get_losses instead.
Instructions for updating:
Use tf.losses.get_regularization_losses instead.


<tf.Tensor 'loss:0' shape=() dtype=string>

This is where the TensorFlow magic happens. You select an optimiser (there are many available such as `GradientDescentOptimizer`, `MomentumOptimizer`, `AdamOptimizer`, etc.) and ask it to minimise the cross-entropy loss. In this step, TensorFlow computes the partial derivatives of the loss function relatively to all the weights and all the biases (the gradient). We need to specify the optimizer. Here, we use `RMSProp` as the optimizer that utilizes the magnitude of recent gradients to normalize the gradients.

In [0]:
optimizer = tf.train.RMSPropOptimizer(0.001, 0.9)

Before starting train loop, I should note that our train loop needs a train operation that we call it `train_op`.
This is a crucial `Operation` that:

(a) computes the loss,

(b) applies the gradients to update the weights and

(c) returns the value of the loss. 

`slim.learning.create_train_op` creates such an `Operation`.

In [11]:
train_op = slim.learning.create_train_op(
    total_loss,
    optimizer,
    summarize_gradients=True)

Instructions for updating:
Please switch to tf.train.get_or_create_global_step


Finally, it is time to run the training loop. All the TensorFlow instructions up to this point have been preparing a computation graph in memory but nothing has been computed yet. Now in this stage it's better to run evaluation part first. 

"Training" the neural network actually means using training images and labels to adjust weights and biases so as to minimise the cross-entropy loss function. Now, we are all set to start training loop. Use `slim.learning.train` for that. As you see this function uses the operation `train_op` which we have already made in the previous step.

In [0]:
slim.learning.train(
    train_op,
    FLAGS.log_dir,
    save_summaries_secs=20)

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./log/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global step 1: loss = 2.4864 (0.230 sec/step)
INFO:tensorflow:global step 2: loss = 2.3287 (0.028 sec/step)
INFO:tensorflow:global step 3: loss = 2.1605 (0.020 sec/step)
INFO:tensorflow:global step 4: loss = 2.1992 (0.022 sec/step)
INFO:tensorflow:global step 5: loss = 2.4702 (0.021 sec/step)
INFO:tensorflow:global step 6: loss = 2.1694 (0.021 sec/step)
INFO:tensorflow:global step 7: loss = 2.2946 (0.025 sec/step)
INFO:tensorflow:global step 8: loss = 2.5279 (0.020 sec/step)
INFO:tensorflow:global step 9: loss = 2.2254 (0.021 sec/step)
INFO:tensorflow:global step 10: loss = 2.2740 (0.019 sec/step)
INFO:tensorflow:global step 11: loss = 2.2076 (0.022 sec/step)
INFO:tensorflow:global step 12: loss = 2.4333 (0.020 sec/step)
INFO:tensorflow:global step 13: loss = 2.2090 (

INFO:tensorflow:global step 53: loss = 2.1533 (0.021 sec/step)
INFO:tensorflow:global step 54: loss = 2.0096 (0.022 sec/step)
INFO:tensorflow:global step 55: loss = 2.5490 (0.022 sec/step)
INFO:tensorflow:global step 56: loss = 2.0941 (0.021 sec/step)
INFO:tensorflow:global step 57: loss = 2.0626 (0.020 sec/step)
INFO:tensorflow:global step 58: loss = 2.0256 (0.022 sec/step)
INFO:tensorflow:global step 59: loss = 2.0333 (0.021 sec/step)
INFO:tensorflow:global step 60: loss = 2.1240 (0.020 sec/step)
INFO:tensorflow:global step 61: loss = 2.5375 (0.021 sec/step)
INFO:tensorflow:global step 62: loss = 2.0690 (0.022 sec/step)
INFO:tensorflow:global step 63: loss = 2.0492 (0.020 sec/step)
INFO:tensorflow:global step 64: loss = 2.1520 (0.026 sec/step)
INFO:tensorflow:global step 65: loss = 2.1766 (0.024 sec/step)
INFO:tensorflow:global step 66: loss = 2.1530 (0.022 sec/step)
INFO:tensorflow:global step 67: loss = 1.8145 (0.022 sec/step)
INFO:tensorflow:global step 68: loss = 2.1435 (0.020 se

INFO:tensorflow:global step 108: loss = 1.5433 (0.035 sec/step)
INFO:tensorflow:global step 109: loss = 1.4513 (0.027 sec/step)
INFO:tensorflow:global step 110: loss = 1.4933 (0.033 sec/step)
INFO:tensorflow:global step 111: loss = 1.0821 (0.031 sec/step)
INFO:tensorflow:global step 112: loss = 0.7160 (0.031 sec/step)
INFO:tensorflow:global step 113: loss = 1.0162 (0.031 sec/step)
INFO:tensorflow:global step 114: loss = 1.0199 (0.031 sec/step)
INFO:tensorflow:global step 115: loss = 0.3885 (0.032 sec/step)
INFO:tensorflow:global step 116: loss = 1.4441 (0.030 sec/step)
INFO:tensorflow:global step 117: loss = 1.0752 (0.029 sec/step)
INFO:tensorflow:global step 118: loss = 0.1914 (0.029 sec/step)
INFO:tensorflow:global step 119: loss = 0.8358 (0.030 sec/step)
INFO:tensorflow:global step 120: loss = 1.0660 (0.037 sec/step)
INFO:tensorflow:global step 121: loss = 1.1599 (0.034 sec/step)
INFO:tensorflow:global step 122: loss = 0.6038 (0.033 sec/step)
INFO:tensorflow:global step 123: loss = 

INFO:tensorflow:global step 163: loss = 0.0457 (0.032 sec/step)
INFO:tensorflow:global step 164: loss = 0.2062 (0.031 sec/step)
INFO:tensorflow:global step 165: loss = 0.6588 (0.030 sec/step)
INFO:tensorflow:global step 166: loss = 0.2319 (0.028 sec/step)
INFO:tensorflow:global step 167: loss = 0.0477 (0.029 sec/step)
INFO:tensorflow:global step 168: loss = 1.3552 (0.029 sec/step)
INFO:tensorflow:global step 169: loss = 0.4578 (0.031 sec/step)
INFO:tensorflow:global step 170: loss = 1.1459 (0.035 sec/step)
INFO:tensorflow:global step 171: loss = 0.5539 (0.044 sec/step)
INFO:tensorflow:global step 172: loss = 0.7273 (0.033 sec/step)
INFO:tensorflow:global step 173: loss = 0.7337 (0.037 sec/step)
INFO:tensorflow:global step 174: loss = 0.7945 (0.029 sec/step)
INFO:tensorflow:global step 175: loss = 0.2171 (0.028 sec/step)
INFO:tensorflow:global step 176: loss = 0.3682 (0.030 sec/step)
INFO:tensorflow:global step 177: loss = 0.4748 (0.029 sec/step)
INFO:tensorflow:global step 178: loss = 

INFO:tensorflow:global step 218: loss = 0.1022 (0.028 sec/step)
INFO:tensorflow:global step 219: loss = 0.3716 (0.028 sec/step)
INFO:tensorflow:global step 220: loss = 0.7992 (0.029 sec/step)
INFO:tensorflow:global step 221: loss = 0.8264 (0.029 sec/step)
INFO:tensorflow:global step 222: loss = 0.3033 (0.029 sec/step)
INFO:tensorflow:global step 223: loss = 0.5777 (0.030 sec/step)
INFO:tensorflow:global step 224: loss = 0.2111 (0.028 sec/step)
INFO:tensorflow:global step 225: loss = 0.3514 (0.030 sec/step)
INFO:tensorflow:global step 226: loss = 0.2413 (0.034 sec/step)
INFO:tensorflow:global step 227: loss = 0.0484 (0.029 sec/step)
INFO:tensorflow:global step 228: loss = 0.2347 (0.031 sec/step)
INFO:tensorflow:global step 229: loss = 0.1794 (0.030 sec/step)
INFO:tensorflow:global step 230: loss = 0.1096 (0.030 sec/step)
INFO:tensorflow:global step 231: loss = 0.3333 (0.026 sec/step)
INFO:tensorflow:global step 232: loss = 0.0342 (0.029 sec/step)
INFO:tensorflow:global step 233: loss = 

INFO:tensorflow:global step 273: loss = 2.7698 (0.030 sec/step)
INFO:tensorflow:global step 274: loss = 0.5794 (0.028 sec/step)
INFO:tensorflow:global step 275: loss = 0.5081 (0.030 sec/step)
INFO:tensorflow:global step 276: loss = 0.0367 (0.036 sec/step)
INFO:tensorflow:global step 277: loss = 1.9344 (0.028 sec/step)
INFO:tensorflow:global step 278: loss = 0.0550 (0.030 sec/step)
INFO:tensorflow:global step 279: loss = 0.2350 (0.030 sec/step)
INFO:tensorflow:global step 280: loss = 0.0148 (0.028 sec/step)
INFO:tensorflow:global step 281: loss = 0.3525 (0.027 sec/step)
INFO:tensorflow:global step 282: loss = 0.0132 (0.028 sec/step)
INFO:tensorflow:global step 283: loss = 0.0477 (0.030 sec/step)
INFO:tensorflow:global step 284: loss = 1.0694 (0.029 sec/step)
INFO:tensorflow:global step 285: loss = 0.0520 (0.030 sec/step)
INFO:tensorflow:global step 286: loss = 0.1428 (0.028 sec/step)
INFO:tensorflow:global step 287: loss = 0.0177 (0.034 sec/step)
INFO:tensorflow:global step 288: loss = 

INFO:tensorflow:global step 328: loss = 0.0075 (0.028 sec/step)
INFO:tensorflow:global step 329: loss = 0.0007 (0.028 sec/step)
INFO:tensorflow:global step 330: loss = 0.3379 (0.029 sec/step)
INFO:tensorflow:global step 331: loss = 0.0014 (0.026 sec/step)
INFO:tensorflow:global step 332: loss = 0.0143 (0.027 sec/step)
INFO:tensorflow:global step 333: loss = 0.3713 (0.029 sec/step)
INFO:tensorflow:global step 334: loss = 1.0393 (0.034 sec/step)
INFO:tensorflow:global step 335: loss = 0.0026 (0.028 sec/step)
INFO:tensorflow:global step 336: loss = 0.0200 (0.028 sec/step)
INFO:tensorflow:global step 337: loss = 0.0006 (0.029 sec/step)
INFO:tensorflow:global step 338: loss = 0.3842 (0.027 sec/step)
INFO:tensorflow:global step 339: loss = 0.0064 (0.028 sec/step)
INFO:tensorflow:global step 340: loss = 0.0159 (0.030 sec/step)
INFO:tensorflow:global step 341: loss = 0.0168 (0.029 sec/step)
INFO:tensorflow:global step 342: loss = 0.1549 (0.027 sec/step)
INFO:tensorflow:global step 343: loss = 

INFO:tensorflow:global step 383: loss = 0.0040 (0.029 sec/step)
INFO:tensorflow:global step 384: loss = 0.1355 (0.028 sec/step)
INFO:tensorflow:global step 385: loss = 0.1175 (0.026 sec/step)
INFO:tensorflow:global step 386: loss = 2.2660 (0.030 sec/step)
INFO:tensorflow:global step 387: loss = 0.0002 (0.029 sec/step)
INFO:tensorflow:global step 388: loss = 0.0229 (0.028 sec/step)
INFO:tensorflow:global step 389: loss = 0.0669 (0.027 sec/step)
INFO:tensorflow:global step 390: loss = 0.0078 (0.029 sec/step)
INFO:tensorflow:global step 391: loss = 0.2718 (0.036 sec/step)
INFO:tensorflow:global step 392: loss = 0.0064 (0.025 sec/step)
INFO:tensorflow:global step 393: loss = 0.0237 (0.029 sec/step)
INFO:tensorflow:global step 394: loss = 0.4880 (0.030 sec/step)
INFO:tensorflow:global step 395: loss = 0.3797 (0.027 sec/step)
INFO:tensorflow:global step 396: loss = 0.0764 (0.027 sec/step)
INFO:tensorflow:global step 397: loss = 0.1783 (0.028 sec/step)
INFO:tensorflow:global step 398: loss = 