Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
import os

First reload the data we generated in `1_notmnist.ipynb`.

In [2]:
# Create data directory path
dpath = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
dpath = os.path.join(dpath, 'data')
# create pickle data file path
pickle_file = os.path.join(dpath,'notMNIST.pickle')

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

Training set (500000, 28, 28) (500000,)
Validation set (29000, 28, 28) (29000,)
Test set (18000, 28, 28) (18000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (500000, 784) (500000, 10)
Validation set (29000, 784) (29000, 10)
Test set (18000, 784) (18000, 10)


In [4]:
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
              / predictions.shape[0])

Evaluate up to this point for all computations. After this point only evaluate the graphs you are interested in re-calculating and then run the relevant training.

---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

#### First we work on logistic regression

 - We use the minibatch implementation from assignment 2.

In [9]:
# Create TensorFlow graph

batch_size = 128
# regularisation constant
gamma = 0.01

graph1 = tf.Graph()
with graph1.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32,
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
  
    # Variables.
    weights = tf.Variable(
         tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
  
    # Training computation.
    logits = tf.matmul(tf_train_dataset, weights) + biases
    
    # tf.reduce_mean because we take the average cross entropy over the batch.
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    # add regularisation to loss
    # notes: regularise both weights and biases
    loss = loss + gamma * (
    tf.nn.l2_loss(weights) + tf.nn.l2_loss(biases)
    )
        
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [6]:
# run tensorFlow graph.

num_steps = 3001

with tf.Session(graph=graph1) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
    
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 47.305332
Minibatch accuracy: 14.1%
Validation accuracy: 12.3%
Minibatch loss at step 500: 0.836007
Minibatch accuracy: 80.5%
Validation accuracy: 81.3%
Minibatch loss at step 1000: 0.762268
Minibatch accuracy: 78.1%
Validation accuracy: 81.1%
Minibatch loss at step 1500: 0.754452
Minibatch accuracy: 83.6%
Validation accuracy: 80.3%
Minibatch loss at step 2000: 0.772565
Minibatch accuracy: 82.0%
Validation accuracy: 81.0%
Minibatch loss at step 2500: 0.840660
Minibatch accuracy: 78.9%
Validation accuracy: 79.2%
Minibatch loss at step 3000: 0.795423
Minibatch accuracy: 82.0%
Validation accuracy: 80.6%
Test accuracy: 87.3%


#### Now let's work on a neural network with a hidden layer

- We use the example from assignment 2:

In [11]:
batch_size = 128
hidden_nodes = 1024
# regularisation constant
gamma = 0.01

graph2 = tf.Graph()
with graph2.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes]))
    # We construct the variables representing the output layer:
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    hidden_layer = tf.matmul(tf_train_dataset, weights1) + biases1
    logits = tf.matmul(hidden_layer, weights2) + biases2
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    # add regularisation for all weights.
    loss = loss + gamma * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    logits_val = tf.matmul(hidden_layer_val, weights2) + biases2
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer_test = tf.matmul(tf_test_dataset, weights1) + biases1
    logits_test = tf.matmul(hidden_layer_test, weights2) + biases2
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [8]:
num_steps = 3001

with tf.Session(graph=graph2) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3565.484863
Minibatch accuracy: 14.8%
Validation accuracy: 37.9%
Minibatch loss at step 500: 20.734781
Minibatch accuracy: 81.2%
Validation accuracy: 80.1%
Minibatch loss at step 1000: 0.903361
Minibatch accuracy: 78.9%
Validation accuracy: 80.6%
Minibatch loss at step 1500: 0.827278
Minibatch accuracy: 81.2%
Validation accuracy: 79.4%
Minibatch loss at step 2000: 0.875641
Minibatch accuracy: 79.7%
Validation accuracy: 81.5%
Minibatch loss at step 2500: 0.885272
Minibatch accuracy: 77.3%
Validation accuracy: 78.9%
Minibatch loss at step 3000: 0.859798
Minibatch accuracy: 80.5%
Validation accuracy: 80.5%
Test accuracy: 87.3%


---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

---

#### First we work on logistic regression

In [10]:
### NOTE: Rerun graph1 build step before running ###

# run tensorFlow graph for logistic regression.

num_steps = 3001

with tf.Session(graph=graph1) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: restrict offset to [1, 500]
        offset = np.random.choice(list(range(1, 501)))
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
    
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 50.389458
Minibatch accuracy: 4.7%
Validation accuracy: 11.7%
Minibatch loss at step 500: 0.474692
Minibatch accuracy: 100.0%
Validation accuracy: 77.5%
Minibatch loss at step 1000: 0.319731
Minibatch accuracy: 100.0%
Validation accuracy: 77.3%
Minibatch loss at step 1500: 0.373724
Minibatch accuracy: 97.7%
Validation accuracy: 77.6%
Minibatch loss at step 2000: 0.410751
Minibatch accuracy: 97.7%
Validation accuracy: 77.6%
Minibatch loss at step 2500: 0.295078
Minibatch accuracy: 100.0%
Validation accuracy: 77.7%
Minibatch loss at step 3000: 0.313557
Minibatch accuracy: 99.2%
Validation accuracy: 77.8%
Test accuracy: 84.6%


#### Now let's work on a neural network with a hidden layer

In [12]:
### NOTE: Rerun graph2 build step before running ###
num_steps = 3001

with tf.Session(graph=graph2) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: restrict offset to [1, 500]
        offset = np.random.choice(list(range(1, 501)))
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3682.959961
Minibatch accuracy: 10.9%
Validation accuracy: 38.8%
Minibatch loss at step 500: 20.913164
Minibatch accuracy: 100.0%
Validation accuracy: 75.3%
Minibatch loss at step 1000: 0.465225
Minibatch accuracy: 99.2%
Validation accuracy: 76.5%
Minibatch loss at step 1500: 0.290670
Minibatch accuracy: 100.0%
Validation accuracy: 77.5%
Minibatch loss at step 2000: 0.313743
Minibatch accuracy: 100.0%
Validation accuracy: 77.8%
Minibatch loss at step 2500: 0.317977
Minibatch accuracy: 100.0%
Validation accuracy: 77.6%
Minibatch loss at step 3000: 0.304730
Minibatch accuracy: 100.0%
Validation accuracy: 77.0%
Test accuracy: 84.1%


---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

---

#### Introducing dropout for the hidden layer

In [13]:
batch_size = 128
hidden_nodes = 1024

graph3 = tf.Graph()
with graph3.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes]))
    # We construct the variables representing the output layer:
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    hidden_layer = tf.matmul(tf_train_dataset, weights1) + biases1
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    hidden_layer_d = tf.nn.dropout(hidden_layer, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer, weights2) + biases2
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer_d, weights2) + biases2
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    logits_val = tf.matmul(hidden_layer_val, weights2) + biases2
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer_test = tf.matmul(tf_test_dataset, weights1) + biases1
    logits_test = tf.matmul(hidden_layer_test, weights2) + biases2
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [14]:
num_steps = 3001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph3) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3939.137451
Minibatch accuracy: 11.7%
Validation accuracy: 36.1%
Minibatch loss at step 500: 22.897213
Minibatch accuracy: 73.4%
Validation accuracy: 70.8%
Minibatch loss at step 1000: 1.046426
Minibatch accuracy: 78.9%
Validation accuracy: 78.3%
Minibatch loss at step 1500: 0.970355
Minibatch accuracy: 75.0%
Validation accuracy: 77.0%
Minibatch loss at step 2000: 1.003586
Minibatch accuracy: 78.9%
Validation accuracy: 79.4%
Minibatch loss at step 2500: 1.014875
Minibatch accuracy: 75.0%
Validation accuracy: 75.7%
Minibatch loss at step 3000: 0.939036
Minibatch accuracy: 77.3%
Validation accuracy: 78.3%
Test accuracy: 84.9%


#### Restricting training data

In [15]:
num_steps = 3001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph3) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: restrict offset to [1, 500]
        offset = np.random.choice(list(range(1, 501)))
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3819.755859
Minibatch accuracy: 12.5%
Validation accuracy: 37.6%
Minibatch loss at step 500: 21.988455
Minibatch accuracy: 100.0%
Validation accuracy: 75.8%
Minibatch loss at step 1000: 0.479301
Minibatch accuracy: 100.0%
Validation accuracy: 77.4%
Minibatch loss at step 1500: 0.384016
Minibatch accuracy: 98.4%
Validation accuracy: 76.9%
Minibatch loss at step 2000: 0.383462
Minibatch accuracy: 99.2%
Validation accuracy: 77.2%
Minibatch loss at step 2500: 0.349100
Minibatch accuracy: 98.4%
Validation accuracy: 77.3%
Minibatch loss at step 3000: 0.338586
Minibatch accuracy: 99.2%
Validation accuracy: 76.9%
Test accuracy: 84.0%


Dropout didn't do much against overfit in this specific case.

---
Problem 4
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


### 1. We start by increasing the training steps on the regularised with dropout 1 hidden_layer network.

In [16]:
num_steps = 8001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph3) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3703.312500
Minibatch accuracy: 14.1%
Validation accuracy: 38.3%
Minibatch loss at step 500: 22.707386
Minibatch accuracy: 72.7%
Validation accuracy: 75.4%
Minibatch loss at step 1000: 1.014906
Minibatch accuracy: 79.7%
Validation accuracy: 79.1%
Minibatch loss at step 1500: 0.968754
Minibatch accuracy: 75.8%
Validation accuracy: 76.6%
Minibatch loss at step 2000: 1.008845
Minibatch accuracy: 79.7%
Validation accuracy: 79.3%
Minibatch loss at step 2500: 1.029608
Minibatch accuracy: 75.0%
Validation accuracy: 76.0%
Minibatch loss at step 3000: 0.970152
Minibatch accuracy: 78.1%
Validation accuracy: 78.2%
Minibatch loss at step 3500: 1.000505
Minibatch accuracy: 78.9%
Validation accuracy: 79.4%
Minibatch loss at step 4000: 0.640494
Minibatch accuracy: 87.5%
Validation accuracy: 80.5%
Minibatch loss at step 4500: 0.856306
Minibatch accuracy: 80.5%
Validation accuracy: 80.5%
Minibatch loss at step 5000: 1.047778
Minibatch accuracy: 78.9%
Validation acc

Increasing the number of steps only slightly increased performance!

### 2. Let's increase regularisation.

In [17]:
num_steps = 8001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.03

with tf.Session(graph=graph3) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 10184.334961
Minibatch accuracy: 8.6%
Validation accuracy: 40.5%
Minibatch loss at step 500: 1.058177
Minibatch accuracy: 75.8%
Validation accuracy: 77.9%
Minibatch loss at step 1000: 1.016949
Minibatch accuracy: 76.6%
Validation accuracy: 79.3%
Minibatch loss at step 1500: 1.112250
Minibatch accuracy: 74.2%
Validation accuracy: 77.2%
Minibatch loss at step 2000: 1.123813
Minibatch accuracy: 77.3%
Validation accuracy: 79.5%
Minibatch loss at step 2500: 1.119868
Minibatch accuracy: 75.8%
Validation accuracy: 76.1%
Minibatch loss at step 3000: 1.053246
Minibatch accuracy: 81.2%
Validation accuracy: 78.1%
Minibatch loss at step 3500: 1.110701
Minibatch accuracy: 77.3%
Validation accuracy: 79.6%
Minibatch loss at step 4000: 0.812795
Minibatch accuracy: 87.5%
Validation accuracy: 79.5%
Minibatch loss at step 4500: 1.022451
Minibatch accuracy: 81.2%
Validation accuracy: 80.1%
Minibatch loss at step 5000: 1.192332
Minibatch accuracy: 76.6%
Validation accu

#### Results:

Increasing regularisation above 0.01 didn't increase performance!

### 3. Let's double the width of the hidden layer:

In [18]:
batch_size = 128
hidden_nodes = 2*1024

graph4 = tf.Graph()
with graph4.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes]))
    # We construct the variables representing the output layer:
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    hidden_layer = tf.matmul(tf_train_dataset, weights1) + biases1
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    hidden_layer_d = tf.nn.dropout(hidden_layer, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer, weights2) + biases2
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer_d, weights2) + biases2
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    logits_val = tf.matmul(hidden_layer_val, weights2) + biases2
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer_test = tf.matmul(tf_test_dataset, weights1) + biases1
    logits_test = tf.matmul(hidden_layer_test, weights2) + biases2
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [19]:
num_steps = 8001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph4) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 7352.192383
Minibatch accuracy: 10.2%
Validation accuracy: 30.7%
Minibatch loss at step 500: 49.886929
Minibatch accuracy: 70.3%
Validation accuracy: 69.4%
Minibatch loss at step 1000: 1.375002
Minibatch accuracy: 75.0%
Validation accuracy: 74.6%
Minibatch loss at step 1500: 1.038224
Minibatch accuracy: 75.0%
Validation accuracy: 76.6%
Minibatch loss at step 2000: 1.017531
Minibatch accuracy: 76.6%
Validation accuracy: 79.7%
Minibatch loss at step 2500: 1.013178
Minibatch accuracy: 75.8%
Validation accuracy: 75.0%
Minibatch loss at step 3000: 0.978678
Minibatch accuracy: 75.8%
Validation accuracy: 77.4%
Minibatch loss at step 3500: 0.978029
Minibatch accuracy: 78.1%
Validation accuracy: 79.7%
Minibatch loss at step 4000: 0.659742
Minibatch accuracy: 87.5%
Validation accuracy: 80.2%
Minibatch loss at step 4500: 0.861175
Minibatch accuracy: 79.7%
Validation accuracy: 80.5%
Minibatch loss at step 5000: 1.059662
Minibatch accuracy: 79.7%
Validation acc

The accuracy of the network did not significantly increase with the increase of the hidden nodes.

### 4. Let's try 2 hidden layers

In [20]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph5a = tf.Graph()
with graph5a.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.matmul(tf_train_dataset, weights1) + biases1
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    hidden_layer2 = tf.matmul(hidden_layer1_d, weights2) + biases2
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
        tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) +
        tf.nn.l2_loss(weights3))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    hidden_layer2_val = tf.matmul(hidden_layer1_val, weights2) + biases2
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.matmul(tf_test_dataset, weights1) + biases1
    hidden_layer2_test = tf.matmul(hidden_layer1_test, weights2) + biases2
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [21]:
num_steps = 36001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph5a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 2000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 19410.802734
Minibatch accuracy: 10.2%
Validation accuracy: 14.7%
Minibatch loss at step 2000: 4462.788086
Minibatch accuracy: 74.2%
Validation accuracy: 78.4%
Minibatch loss at step 4000: 4186.888672
Minibatch accuracy: 79.7%
Validation accuracy: 78.7%
Minibatch loss at step 6000: 4059.929688
Minibatch accuracy: 78.9%
Validation accuracy: 79.5%
Minibatch loss at step 8000: 3852.190674
Minibatch accuracy: 75.0%
Validation accuracy: 79.5%
Minibatch loss at step 10000: 3683.679443
Minibatch accuracy: 80.5%
Validation accuracy: 79.4%
Minibatch loss at step 12000: 3528.583984
Minibatch accuracy: 78.9%
Validation accuracy: 78.3%
Minibatch loss at step 14000: 3403.141113
Minibatch accuracy: 71.9%
Validation accuracy: 78.4%
Minibatch loss at step 16000: 3244.826904
Minibatch accuracy: 76.6%
Validation accuracy: 79.5%
Minibatch loss at step 18000: 3101.539307
Minibatch accuracy: 75.0%
Validation accuracy: 79.7%
Minibatch loss at step 20000: 2992.891357
Min

The network accuracy increased slightly.

### 5. Let's add relu actication functions

In [22]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph6a = tf.Graph()
with graph6a.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.003).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [23]:
num_steps = 36001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph6a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 2000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 10236.666016
Minibatch accuracy: 4.7%
Validation accuracy: 21.2%
Minibatch loss at step 2000: 3910.544678
Minibatch accuracy: 60.2%
Validation accuracy: 62.1%
Minibatch loss at step 4000: 3462.163818
Minibatch accuracy: 67.2%
Validation accuracy: 60.7%
Minibatch loss at step 6000: 3070.966309
Minibatch accuracy: 61.7%
Validation accuracy: 60.9%
Minibatch loss at step 8000: 2724.812500
Minibatch accuracy: 60.2%
Validation accuracy: 61.6%
Minibatch loss at step 10000: 2415.524658
Minibatch accuracy: 60.2%
Validation accuracy: 65.7%
Minibatch loss at step 12000: 2141.513428
Minibatch accuracy: 70.3%
Validation accuracy: 67.2%
Minibatch loss at step 14000: 1899.520630
Minibatch accuracy: 68.8%
Validation accuracy: 69.3%
Minibatch loss at step 16000: 1685.882568
Minibatch accuracy: 73.4%
Validation accuracy: 71.4%
Minibatch loss at step 18000: 1494.413208
Minibatch accuracy: 74.2%
Validation accuracy: 71.9%
Minibatch loss at step 20000: 1325.702759
Mini

Relu activation functions didn't change performance.

### 6. Try again without dropout !?

In [24]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph6b = tf.Graph()
with graph6b.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.003).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [25]:
num_steps = 24001
# dropout layer keep probability
keep_probl = 0.05 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph6b) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 2000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 8685.554688
Minibatch accuracy: 4.7%
Validation accuracy: 14.6%
Minibatch loss at step 2000: 3947.577393
Minibatch accuracy: 71.9%
Validation accuracy: 75.7%
Minibatch loss at step 4000: 3472.562744
Minibatch accuracy: 82.0%
Validation accuracy: 73.9%
Minibatch loss at step 6000: 3076.671875
Minibatch accuracy: 78.9%
Validation accuracy: 71.4%
Minibatch loss at step 8000: 2728.282715
Minibatch accuracy: 75.8%
Validation accuracy: 72.2%
Minibatch loss at step 10000: 2420.081055
Minibatch accuracy: 66.4%
Validation accuracy: 72.9%
Minibatch loss at step 12000: 2145.671387
Minibatch accuracy: 71.1%
Validation accuracy: 74.7%
Minibatch loss at step 14000: 1902.915771
Minibatch accuracy: 77.3%
Validation accuracy: 76.3%
Minibatch loss at step 16000: 1687.582886
Minibatch accuracy: 78.1%
Validation accuracy: 76.0%
Minibatch loss at step 18000: 1496.763306
Minibatch accuracy: 82.0%
Validation accuracy: 77.1%
Minibatch loss at step 20000: 1327.720581
Minib

Performance got marginally worse!

### 7. Let's try using a variable learning rate !

In [26]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph7a = tf.Graph()
with graph7a.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [27]:
num_steps = 64001
# dropout layer keep probability
keep_probl = 0.05 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph7a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 9047.990234
Minibatch accuracy: 8.6%
Validation accuracy: 20.2%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 1044.396362
Minibatch accuracy: 84.4%
Validation accuracy: 71.0%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 324.970398
Minibatch accuracy: 82.8%
Validation accuracy: 79.7%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 126.331551
Minibatch accuracy: 85.2%
Validation accuracy: 82.4%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 59.028793
Minibatch accuracy: 84.4%
Validation accuracy: 83.9%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 32.146633
Minibatch accuracy: 82.0%
Validation accuracy: 84.7%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 19.722095
Minibatch accuracy: 80.5%
Validation accuracy: 85.0%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 13.2

As we can see an exponential learning rate significantly increased our results!!!

### 8. Let's introduce dropout

In [28]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph7b = tf.Graph()
with graph7b.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [29]:
num_steps = 64001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph7b) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 11545.306641
Minibatch accuracy: 9.4%
Validation accuracy: 18.8%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 1039.421753
Minibatch accuracy: 44.5%
Validation accuracy: 25.1%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 323.846863
Minibatch accuracy: 61.7%
Validation accuracy: 69.0%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 125.893112
Minibatch accuracy: 79.7%
Validation accuracy: 78.0%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 58.889122
Minibatch accuracy: 82.8%
Validation accuracy: 80.7%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 32.254601
Minibatch accuracy: 75.0%
Validation accuracy: 82.0%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 19.747429
Minibatch accuracy: 76.6%
Validation accuracy: 82.8%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 13.

Slightly worse performance compared to without dropout

### 9. Let's add a 3rd hidden layer (relu without dropout)

In [30]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256
hidden_nodes3 = 64

graph8a = tf.Graph()
with graph8a.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rdd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the output layer:
    weights4 = tf.Variable(
        tf.truncated_normal([hidden_nodes3, num_labels], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    # keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2, weights3) + biases3)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer3, weights4) + biases4
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    logits_val = tf.matmul(hidden_layer3_val, weights4) + biases4
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    logits_test = tf.matmul(hidden_layer3_test, weights4) + biases4
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [31]:
num_steps = 64001
# dropout layer keep probability
#keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph8a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     #keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 44.204887
Minibatch accuracy: 14.1%
Validation accuracy: 17.8%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 10.375601
Minibatch accuracy: 90.6%
Validation accuracy: 85.0%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 3.801160
Minibatch accuracy: 86.7%
Validation accuracy: 85.2%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 1.837403
Minibatch accuracy: 88.3%
Validation accuracy: 85.3%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 1.301690
Minibatch accuracy: 83.6%
Validation accuracy: 85.4%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 1.164474
Minibatch accuracy: 85.2%
Validation accuracy: 85.4%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 1.032391
Minibatch accuracy: 82.8%
Validation accuracy: 85.5%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 0.874629
Minib

We observe a small improvement

### 10. Let's use 4 hidden layers (relu without dropout).

In [32]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 512
hidden_nodes3 = 256
hidden_nodes4 = 64

graph9a = tf.Graph()
with graph9a.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=0.1))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    # keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2, weights3) + biases3)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3, weights4) + biases4)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [33]:
num_steps = 48001
# dropout layer keep probability
#keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph9a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     #keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 60.033215
Minibatch accuracy: 14.8%
Validation accuracy: 15.9%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 13.990759
Minibatch accuracy: 92.2%
Validation accuracy: 85.3%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 4.937146
Minibatch accuracy: 87.5%
Validation accuracy: 85.5%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 2.302892
Minibatch accuracy: 88.3%
Validation accuracy: 85.7%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 1.538446
Minibatch accuracy: 84.4%
Validation accuracy: 85.7%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 1.287957
Minibatch accuracy: 85.2%
Validation accuracy: 85.8%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 1.113971
Minibatch accuracy: 83.6%
Validation accuracy: 85.9%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 0.955568
Minib

**Our best result so far !!!**

In [34]:
# let's use some diferent parameters

num_steps = 48001
# dropout layer keep probability
#keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.02
# learning rate (initial)
learning_rate_i = 0.01

with tf.Session(graph=graph9a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     #keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 116.976448
Minibatch accuracy: 7.8%
Validation accuracy: 11.3%
Current learning rate: 0.00999947264790535
Minibatch loss at step 4000: 27.595068
Minibatch accuracy: 89.8%
Validation accuracy: 84.0%
Current learning rate: 0.008099572733044624
Minibatch loss at step 8000: 9.375543
Minibatch accuracy: 83.6%
Validation accuracy: 84.0%
Current learning rate: 0.006560653448104858
Minibatch loss at step 12000: 4.188735
Minibatch accuracy: 85.9%
Validation accuracy: 83.9%
Current learning rate: 0.005314128939062357
Minibatch loss at step 16000: 2.550613
Minibatch accuracy: 82.0%
Validation accuracy: 83.8%
Current learning rate: 0.004304444417357445
Minibatch loss at step 20000: 1.989176
Minibatch accuracy: 82.0%
Validation accuracy: 83.8%
Current learning rate: 0.003486599773168564
Minibatch loss at step 24000: 1.674007
Minibatch accuracy: 79.7%
Validation accuracy: 83.8%
Current learning rate: 0.0028241456020623446
Minibatch loss at step 28000: 1.382593
M

Higher regularisation and reduced learning rate didn't help

### 11. Let's try 4 hidden layers with dropout

In [35]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 512
hidden_nodes3 = 256
hidden_nodes4 = 64

graph10a = tf.Graph()
with graph10a.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=0.1))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [36]:
num_steps = 64001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph10a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 65.865868
Minibatch accuracy: 11.7%
Validation accuracy: 13.3%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 14.213797
Minibatch accuracy: 87.5%
Validation accuracy: 81.7%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 5.097067
Minibatch accuracy: 83.6%
Validation accuracy: 83.5%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 2.446087
Minibatch accuracy: 86.7%
Validation accuracy: 84.1%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 1.634861
Minibatch accuracy: 83.6%
Validation accuracy: 84.3%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 1.434211
Minibatch accuracy: 82.8%
Validation accuracy: 84.6%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 1.275495
Minibatch accuracy: 80.5%
Validation accuracy: 84.7%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 1.117421
Minib

In [37]:
num_steps = 64001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.05

with tf.Session(graph=graph10a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 65.649063
Minibatch accuracy: 15.6%
Validation accuracy: 10.9%
Current learning rate: 0.04999736696481705
Minibatch loss at step 4000: 2.216931
Minibatch accuracy: 91.4%
Validation accuracy: 83.9%
Current learning rate: 0.04049786552786827
Minibatch loss at step 8000: 0.997186
Minibatch accuracy: 85.2%
Validation accuracy: 84.8%
Current learning rate: 0.03280326724052429
Minibatch loss at step 12000: 0.799263
Minibatch accuracy: 87.5%
Validation accuracy: 85.2%
Current learning rate: 0.026570646092295647
Minibatch loss at step 16000: 0.902531
Minibatch accuracy: 82.8%
Validation accuracy: 85.3%
Current learning rate: 0.021522222086787224
Minibatch loss at step 20000: 1.065320
Minibatch accuracy: 82.0%
Validation accuracy: 85.5%
Current learning rate: 0.01743300072848797
Minibatch loss at step 24000: 1.002119
Minibatch accuracy: 82.8%
Validation accuracy: 85.6%
Current learning rate: 0.014120727777481079
Minibatch loss at step 28000: 0.959251
Miniba

In [38]:
# let's decrease exponential decay

batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 512
hidden_nodes3 = 256
hidden_nodes4 = 64

graph10b = tf.Graph()
with graph10b.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=0.1))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.95)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [39]:
num_steps = 80001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.05

with tf.Session(graph=graph10b) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 67.260567
Minibatch accuracy: 5.5%
Validation accuracy: 13.3%
Current learning rate: 0.04999871551990509
Minibatch loss at step 4000: 2.019092
Minibatch accuracy: 88.3%
Validation accuracy: 84.0%
Current learning rate: 0.04512384161353111
Minibatch loss at step 8000: 0.956087
Minibatch accuracy: 83.6%
Validation accuracy: 84.9%
Current learning rate: 0.040724266320466995
Minibatch loss at step 12000: 0.727410
Minibatch accuracy: 86.7%
Validation accuracy: 85.3%
Current learning rate: 0.03675365075469017
Minibatch loss at step 16000: 0.878806
Minibatch accuracy: 85.9%
Validation accuracy: 85.3%
Current learning rate: 0.03317016735672951
Minibatch loss at step 20000: 1.062245
Minibatch accuracy: 82.8%
Validation accuracy: 85.5%
Current learning rate: 0.02993607521057129
Minibatch loss at step 24000: 0.980987
Minibatch accuracy: 83.6%
Validation accuracy: 85.6%
Current learning rate: 0.02701730839908123
Minibatch loss at step 28000: 0.932819
Minibatch

Accuracy hasn't increased more with further training but it is on par with our best results!

### 12. Let's try momentum optimiser

In [40]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 512
hidden_nodes3 = 256
hidden_nodes4 = 64

graph11a = tf.Graph()
with graph11a.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=0.1))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    #flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.95)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(ilrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [41]:
num_steps = 80001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial) - calculate within loop
#learning_rate_i = 0.5

with tf.Session(graph=graph11a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Pre calculate from wolfram alpha
        # https://www.wolframalpha.com/input/?i=plot+(0.1)*(x%2F2000)%5E2*e%5E(-x%2F7000)+%7Bx,0,80000%7D
        learning_rate_i = 0.01 * ((step/2000)**2)*np.exp(-step/7000)
        #print(learning_rate_i)
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(ilrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 65.677353
Minibatch accuracy: 11.7%
Validation accuracy: 14.0%
Current learning rate: 0.0
Minibatch loss at step 4000: 0.885373
Minibatch accuracy: 86.7%
Validation accuracy: 83.3%
Current learning rate: 0.02258872427046299
Minibatch loss at step 8000: 1.051952
Minibatch accuracy: 82.8%
Validation accuracy: 82.3%
Current learning rate: 0.05102504789829254
Minibatch loss at step 12000: 1.017034
Minibatch accuracy: 82.0%
Validation accuracy: 82.9%
Current learning rate: 0.06483323127031326
Minibatch loss at step 16000: 1.055306
Minibatch accuracy: 80.5%
Validation accuracy: 83.2%
Current learning rate: 0.06508889049291611
Minibatch loss at step 20000: 1.249661
Minibatch accuracy: 76.6%
Validation accuracy: 83.1%
Current learning rate: 0.057432617992162704
Minibatch loss at step 24000: 1.143483
Minibatch accuracy: 78.9%
Validation accuracy: 83.0%
Current learning rate: 0.04670386761426926
Minibatch loss at step 28000: 1.055857
Minibatch accuracy: 80.5

Good performance, close to our best results!

### 13 Let's try some external examples:

In [42]:
# taken from
# https://github.com/rndbrtrnd/udacity-deep-learning/blob/master/3_regularization.ipynb

batch_size = 128
num_hidden_nodes1 = 1024
num_hidden_nodes2 = 100
beta_regul = 1e-3

graphex1 = tf.Graph()
with graphex1.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  global_step = tf.Variable(0)

  # Variables.
  weights1 = tf.Variable(
    tf.truncated_normal(
        [image_size * image_size, num_hidden_nodes1],
        stddev=np.sqrt(2.0 / (image_size * image_size)))
    )
  biases1 = tf.Variable(tf.zeros([num_hidden_nodes1]))
  weights2 = tf.Variable(
    tf.truncated_normal([num_hidden_nodes1, num_hidden_nodes2], stddev=np.sqrt(2.0 / num_hidden_nodes1)))
  biases2 = tf.Variable(tf.zeros([num_hidden_nodes2]))
  weights3 = tf.Variable(
    tf.truncated_normal([num_hidden_nodes2, num_labels], stddev=np.sqrt(2.0 / num_hidden_nodes2)))
  biases3 = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  lay1_train = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
  lay2_train = tf.nn.relu(tf.matmul(lay1_train, weights2) + biases2)
  logits = tf.matmul(lay2_train, weights3) + biases3
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) + \
      beta_regul * (tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
  
  # Optimizer.
  learning_rate = tf.train.exponential_decay(0.5, global_step, 1000, 0.65, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  lay1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
  lay2_valid = tf.nn.relu(tf.matmul(lay1_valid, weights2) + biases2)
  valid_prediction = tf.nn.softmax(tf.matmul(lay2_valid, weights3) + biases3)
  lay1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
  lay2_test = tf.nn.relu(tf.matmul(lay1_test, weights2) + biases2)
  test_prediction = tf.nn.softmax(tf.matmul(lay2_test, weights3) + biases3)

In [43]:
num_steps = 9001

with tf.Session(graph=graphex1) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3.209594
Minibatch accuracy: 14.1%
Validation accuracy: 27.9%
Minibatch loss at step 500: 0.963903
Minibatch accuracy: 85.9%
Validation accuracy: 85.9%
Minibatch loss at step 1000: 0.769660
Minibatch accuracy: 88.3%
Validation accuracy: 86.9%
Minibatch loss at step 1500: 0.616891
Minibatch accuracy: 89.8%
Validation accuracy: 87.3%
Minibatch loss at step 2000: 0.713609
Minibatch accuracy: 85.9%
Validation accuracy: 88.1%
Minibatch loss at step 2500: 0.586468
Minibatch accuracy: 88.3%
Validation accuracy: 88.4%
Minibatch loss at step 3000: 0.552738
Minibatch accuracy: 89.1%
Validation accuracy: 88.9%
Minibatch loss at step 3500: 0.548657
Minibatch accuracy: 89.1%
Validation accuracy: 89.3%
Minibatch loss at step 4000: 0.373092
Minibatch accuracy: 94.5%
Validation accuracy: 89.4%
Minibatch loss at step 4500: 0.437540
Minibatch accuracy: 92.2%
Validation accuracy: 89.7%
Minibatch loss at step 5000: 0.527299
Minibatch accuracy: 88.3%
Validation accurac

This accuracy beats anything we have done up to now - with a lot less complexity !!!

 - Let's try to replicate it!
 
 
 
 ### 14 Let's re-build a 2 hidden layer network.

In [44]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 128

graph6c = tf.Graph()
with graph6c.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 7000, 0.65)
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)
    
    #optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        #loss, global_step=gstep)
    

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [45]:
num_steps = 20001
# dropout layer keep probability - not used in this computation
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.001
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph6c) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 2000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 15.640331
Minibatch accuracy: 8.6%
Validation accuracy: 38.7%
Current learning rate: 0.019998770207166672
Minibatch loss at step 2000: 1.507531
Minibatch accuracy: 85.9%
Validation accuracy: 86.9%
Current learning rate: 0.0176827535033226
Minibatch loss at step 4000: 0.779524
Minibatch accuracy: 93.0%
Validation accuracy: 88.5%
Current learning rate: 0.01563495211303234
Minibatch loss at step 6000: 0.646230
Minibatch accuracy: 93.8%
Validation accuracy: 89.1%
Current learning rate: 0.013824302703142166
Minibatch loss at step 8000: 0.559870
Minibatch accuracy: 91.4%
Validation accuracy: 89.7%
Current learning rate: 0.012223341502249241
Minibatch loss at step 10000: 0.482934
Minibatch accuracy: 92.2%
Validation accuracy: 90.0%
Current learning rate: 0.010807783342897892
Minibatch loss at step 12000: 0.373444
Minibatch accuracy: 92.2%
Validation accuracy: 90.1%
Current learning rate: 0.00955615658313036
Minibatch loss at step 14000: 0.499577
Minibatch

It appears our problem was a big regularisation rate!

Momentum optimiser also helps !

### 15. Let's go back at 4 layers

In [46]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256
hidden_nodes3 = 64
hidden_nodes4 = 16

graph11b = tf.Graph()
with graph11b.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [47]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.8 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.00001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.02

with tf.Session(graph=graph11b) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.

Initialized
Minibatch loss at step 0: 2.387257
Minibatch accuracy: 4.7%
Validation accuracy: 9.1%
Current learning rate: 0.019999278709292412
Minibatch loss at step 4000: 0.277373
Minibatch accuracy: 95.3%
Validation accuracy: 88.4%
Current learning rate: 0.01731988601386547
Minibatch loss at step 8000: 0.375719
Minibatch accuracy: 89.8%
Validation accuracy: 89.5%
Current learning rate: 0.014999459497630596
Minibatch loss at step 12000: 0.252970
Minibatch accuracy: 93.8%
Validation accuracy: 90.1%
Current learning rate: 0.012989913113415241
Minibatch loss at step 16000: 0.368733
Minibatch accuracy: 91.4%
Validation accuracy: 90.5%
Current learning rate: 0.011249594390392303
Minibatch loss at step 20000: 0.353610
Minibatch accuracy: 89.8%
Validation accuracy: 90.9%
Current learning rate: 0.009742435067892075
Minibatch loss at step 24000: 0.288595
Minibatch accuracy: 91.4%
Validation accuracy: 91.1%
Current learning rate: 0.008437196724116802
Minibatch loss at step 28000: 0.291545
Miniba

**Our best result so far!!**

Let's experiment some more!

In [48]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256
hidden_nodes3 = 64
hidden_nodes4 = 16

graph11c = tf.Graph()
with graph11c.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [49]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph11c) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.507250
Minibatch accuracy: 7.8%
Validation accuracy: 12.5%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.234791
Minibatch accuracy: 95.3%
Validation accuracy: 88.5%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.326435
Minibatch accuracy: 91.4%
Validation accuracy: 90.0%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.240333
Minibatch accuracy: 90.6%
Validation accuracy: 90.5%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.283140
Minibatch accuracy: 89.1%
Validation accuracy: 90.9%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.263084
Minibatch accuracy: 90.6%
Validation accuracy: 91.1%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.279186
Minibatch accuracy: 90.6%
Validation accuracy: 91.4%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.214593
Mini

A new record in our accuracy scores!

In [50]:
batch_size = 128
hidden_nodes1 = 4096
hidden_nodes2 = 1024
hidden_nodes3 = 256
hidden_nodes4 = 64

graph11d = tf.Graph()
with graph11d.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [51]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph11d) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.365612
Minibatch accuracy: 11.7%
Validation accuracy: 17.8%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.181067
Minibatch accuracy: 95.3%
Validation accuracy: 89.6%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.263961
Minibatch accuracy: 91.4%
Validation accuracy: 90.9%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.162937
Minibatch accuracy: 95.3%
Validation accuracy: 91.5%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.224015
Minibatch accuracy: 93.8%
Validation accuracy: 91.9%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.235155
Minibatch accuracy: 93.0%
Validation accuracy: 92.2%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.205722
Minibatch accuracy: 92.2%
Validation accuracy: 92.4%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.145824
Min

**Our best result yet !!**

### 16 ...

In [52]:
batch_size = 128
hidden_nodes1 = 784
hidden_nodes2 = 1568
hidden_nodes3 = 500
hidden_nodes4 = 50

graph11e = tf.Graph()
with graph11e.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [53]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph11e) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.315355
Minibatch accuracy: 14.1%
Validation accuracy: 11.8%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.233346
Minibatch accuracy: 93.0%
Validation accuracy: 89.0%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.303109
Minibatch accuracy: 92.2%
Validation accuracy: 90.3%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.167816
Minibatch accuracy: 94.5%
Validation accuracy: 90.9%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.285912
Minibatch accuracy: 91.4%
Validation accuracy: 91.4%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.231068
Minibatch accuracy: 92.2%
Validation accuracy: 91.6%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.251174
Minibatch accuracy: 93.0%
Validation accuracy: 91.8%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.237505
Min

Close to our best performance!

Let's try some more:

In [54]:
batch_size = 128
hidden_nodes1 = 1568
hidden_nodes2 = 3136
hidden_nodes3 = 500
hidden_nodes4 = 50

graph11f = tf.Graph()
with graph11f.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [55]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph11f) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.503897
Minibatch accuracy: 5.5%
Validation accuracy: 8.9%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.161727
Minibatch accuracy: 95.3%
Validation accuracy: 89.5%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.293331
Minibatch accuracy: 92.2%
Validation accuracy: 90.7%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.194031
Minibatch accuracy: 94.5%
Validation accuracy: 91.3%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.245192
Minibatch accuracy: 94.5%
Validation accuracy: 91.8%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.228671
Minibatch accuracy: 93.8%
Validation accuracy: 92.0%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.188091
Minibatch accuracy: 95.3%
Validation accuracy: 92.3%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.178602
Minib

Re-achieving our top performance

Let's try some more:

In [56]:
batch_size = 128
hidden_nodes1 = 1568
hidden_nodes2 = 3136
hidden_nodes3 = 1000
hidden_nodes4 = 100

graph11g = tf.Graph()
with graph11g.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [57]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph11g) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.480237
Minibatch accuracy: 7.0%
Validation accuracy: 12.3%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.219135
Minibatch accuracy: 92.2%
Validation accuracy: 89.5%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.286077
Minibatch accuracy: 92.2%
Validation accuracy: 90.9%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.158314
Minibatch accuracy: 94.5%
Validation accuracy: 91.4%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.247719
Minibatch accuracy: 93.8%
Validation accuracy: 91.9%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.189712
Minibatch accuracy: 95.3%
Validation accuracy: 92.1%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.205624
Minibatch accuracy: 93.8%
Validation accuracy: 92.3%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.143715
Mini

Our new high score !!

### 17 Let's try to change the regularisation weights:

In [58]:
batch_size = 128
hidden_nodes1 = 1568
hidden_nodes2 = 3136
hidden_nodes3 = 500
hidden_nodes4 = 50

graph12a = tf.Graph()
with graph12a.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
        (tf.nn.l2_loss(weights1)/(image_size * image_size * hidden_nodes1)) 
        + (tf.nn.l2_loss(weights2)/(hidden_nodes1 * hidden_nodes2))
        + (tf.nn.l2_loss(weights3)/(hidden_nodes2 * hidden_nodes3))
        + (tf.nn.l2_loss(weights4)/(hidden_nodes3 * hidden_nodes4))
        + (tf.nn.l2_loss(weights5)/(hidden_nodes4 * num_labels)))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [59]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.8 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.00001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.008

with tf.Session(graph=graph12a) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.455624
Minibatch accuracy: 7.8%
Validation accuracy: 12.7%
Current learning rate: 0.007999712601304054
Minibatch loss at step 4000: 0.306253
Minibatch accuracy: 91.4%
Validation accuracy: 88.2%
Current learning rate: 0.006927954498678446
Minibatch loss at step 8000: 0.333061
Minibatch accuracy: 89.8%
Validation accuracy: 89.9%
Current learning rate: 0.005999784450978041
Minibatch loss at step 12000: 0.216465
Minibatch accuracy: 93.8%
Validation accuracy: 90.6%
Current learning rate: 0.005195965524762869
Minibatch loss at step 16000: 0.308920
Minibatch accuracy: 89.8%
Validation accuracy: 91.1%
Current learning rate: 0.004499838221818209
Minibatch loss at step 20000: 0.314924
Minibatch accuracy: 89.8%
Validation accuracy: 91.4%
Current learning rate: 0.0038969742599874735
Minibatch loss at step 24000: 0.263867
Minibatch accuracy: 89.8%
Validation accuracy: 91.7%
Current learning rate: 0.0033748787827789783
Minibatch loss at step 28000: 0.265719
Mi

Close to top notch performance !!

Note: Need to verify from theory standpoint if this approach has impact - and what kind of impact?