Machine Learning and Deep Learning with Tensorflow
=============

Regularization, dropout, and more layers
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.
The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in _notmist.ipynb_.

In [2]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [4]:
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

def weight_variable(shape, stddev, name):
    initial = tf.truncated_normal(stddev=stddev, shape=shape)
    return tf.Variable(initial, name = name)

def bias_variable(shape, bias_init, name):
    initial = tf.constant(bias_init, shape=shape)
    return tf.Variable(initial, name = name)

---
Part 1
---------

Let's introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor t using nn.l2_loss(t). The right amount of regularization should improve your validation / test accuracy. We will introduce Dropout on the hidden layer of the neural network. Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

---

In [16]:
# Some constants used throughout the model
batch_size = 128
hidden_nodes = 1024
bias_init = 0.1
norm_init = 0.1
learning_rate = 0.3
beta = 0.001

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch
    tf_train_dataset = tf.placeholder(tf.float32,
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables initialization
    weights_logit = weight_variable([image_size * image_size, hidden_nodes], norm_init, name = "weights_logit")
    weights_linear = weight_variable([hidden_nodes, num_labels], norm_init, name = "weights_linear")
    biases_logit = bias_variable([hidden_nodes], bias_init, name = "biases_logit")
    biases_linear = bias_variable([num_labels], bias_init, name = "biases_linear")

    # Training computation
    logits = tf.matmul(tf_train_dataset, weights_logit) + biases_logit
    hidden_outs = tf.nn.relu(logits)
    
    # Adding dropout to hiden layer
    keep_prob = tf.placeholder(tf.float32)
    hidden_drop = tf.nn.dropout(hidden_outs, keep_prob)
    
    linear = tf.matmul(hidden_drop, weights_linear) + biases_linear
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(linear, tf_train_labels))
    
    # L2 regularization
    loss_reg = loss + beta*(tf.nn.l2_loss(weights_logit) + tf.nn.l2_loss(weights_linear))

    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss_reg)

    # Predictions for the training, validation, and test data
    train_prediction = tf.nn.softmax(linear)
    valid_prediction = tf.nn.softmax(
        tf.matmul(
            tf.nn.relu(tf.matmul(tf_valid_dataset, weights_logit) + biases_logit), weights_linear) + biases_linear) 
    test_prediction = tf.nn.softmax(
        tf.matmul(
            tf.nn.relu(tf.matmul(tf_test_dataset, weights_logit) + biases_logit), weights_linear) + biases_linear)

In [17]:
num_steps = 3001
step = 0
prob = 0.5

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : prob}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 6.659443
Minibatch accuracy: 3.1%
Validation accuracy: 26.6%
Test accuracy: 28.4%
Minibatch loss at step 500: 0.529406
Minibatch accuracy: 83.6%
Validation accuracy: 84.6%
Test accuracy: 91.4%
Minibatch loss at step 1000: 0.615670
Minibatch accuracy: 85.2%
Validation accuracy: 85.4%
Test accuracy: 92.2%
Minibatch loss at step 1500: 0.337439
Minibatch accuracy: 89.8%
Validation accuracy: 86.2%
Test accuracy: 93.0%
Minibatch loss at step 2000: 0.362673
Minibatch accuracy: 93.0%
Validation accuracy: 86.7%
Test accuracy: 93.1%
Minibatch loss at step 2500: 0.440786
Minibatch accuracy: 85.2%
Validation accuracy: 86.9%
Test accuracy: 93.5%
Minibatch loss at step 3000: 0.472700
Minibatch accuracy: 86.7%
Validation accuracy: 87.0%
Test accuracy: 93.3%


So, my accuracy on the test set hovered around 93% using L2 regularization and drop-out on the hiden layer. I decreased the learning rate from 0.5 to 0.1 but that only marginally improved my results. It is now time to add layers, exponential decay of the learning rate and better weight initialization.

---
Part 2
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).
 
 ---


In [65]:
# Some constants used throughout the model
batch_size = 128
h1_nodes = 1024
h2_nodes = 512
h3_nodes = 256
h4_nodes = 128
bias_init = 0.001
init_learning_rate = 0.5
decay_rate = 0.8
beta = 0.001
momentum = 0.1

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch
    tf_train_dataset = tf.placeholder(tf.float32,
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    global_step = tf.Variable(0)

    # Variables initialization using Glorot weight initialization
    # Hidden Layer
    weights_h1 = tf.Variable(tf.truncated_normal([image_size*image_size, h1_nodes],
                                               stddev=np.sqrt(2.0 / (image_size*image_size))))
    weights_h2 = tf.Variable(tf.truncated_normal([h1_nodes, h2_nodes], 
                                               stddev=np.sqrt(2.0 / h1_nodes)))
    weights_h3 = tf.Variable(tf.truncated_normal([h2_nodes, h3_nodes], 
                                               stddev=np.sqrt(2.0 / h2_nodes)))
    weights_h4 = tf.Variable(tf.truncated_normal([h3_nodes, h4_nodes], 
                                               stddev=np.sqrt(2.0 / h3_nodes)))    
    biases_h1 = tf.Variable(tf.constant(bias_init, shape=[h1_nodes]))
    biases_h2 = tf.Variable(tf.constant(bias_init, shape=[h2_nodes]))
    biases_h3 = tf.Variable(tf.constant(bias_init, shape=[h3_nodes]))
    biases_h4 = tf.Variable(tf.constant(bias_init, shape=[h4_nodes]))
    # Linear Layer
    weights_linear = tf.Variable(tf.truncated_normal([h4_nodes, num_labels], 
                                               stddev=np.sqrt(2.0 / h4_nodes)))
    biases_linear = tf.Variable(tf.constant(bias_init, shape=[num_labels]))
    
    
    # Adding dropout to hiden layer
    keep_prob = tf.placeholder(tf.float32)

    # Training computation with drop out on the two first layers
    hidden_out_1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights_h1) + biases_h1)
    hidden_drop_1 = tf.nn.dropout(hidden_out_1, keep_prob)
    hidden_out_2 = tf.nn.relu(tf.matmul(hidden_drop_1, weights_h2) + biases_h2)
    hidden_drop_2 = tf.nn.dropout(hidden_out_2, keep_prob)
    hidden_out_3 = tf.nn.relu(tf.matmul(hidden_drop_2, weights_h3) + biases_h3)
    hidden_drop_3 = tf.nn.dropout(hidden_out_3, keep_prob)
    hidden_out_4 = tf.nn.relu(tf.matmul(hidden_drop_3, weights_h4) + biases_h4)
    hidden_drop_4 = tf.nn.dropout(hidden_out_4, keep_prob)
    linear = tf.matmul(hidden_drop_4, weights_linear) + biases_linear
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(linear, tf_train_labels))
    
    # L2 regularization 
    loss_reg = loss + beta*(
        tf.nn.l2_loss(weights_h1) + tf.nn.l2_loss(weights_h2) + tf.nn.l2_loss(weights_h3) +
        tf.nn.l2_loss(weights_h4) + tf.nn.l2_loss(weights_linear))

    # Learning Rate and Optimizer (using Nesterov momentum)
    learning_rate = tf.train.exponential_decay(init_learning_rate, global_step, 
                                               1000, decay_rate, staircase=True)
    optimizer = tf.train.MomentumOptimizer(learning_rate = learning_rate, 
                                           use_nesterov = True, 
                                           momentum = momentum).minimize(loss_reg, global_step)

    # Predictions for the training
    train_prediction = tf.nn.softmax(linear)
    
    # Predictions for validation data   
    hidden_out_1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights_h1) + biases_h1)
    hidden_out_2_valid = tf.nn.relu(tf.matmul(hidden_out_1_valid, weights_h2) + biases_h2)
    hidden_out_3_valid = tf.nn.relu(tf.matmul(hidden_out_2_valid, weights_h3) + biases_h3)
    hidden_out_4_valid = tf.nn.relu(tf.matmul(hidden_out_3_valid, weights_h4) + biases_h4)
    linear_valid = tf.matmul(hidden_out_4_valid, weights_linear) + biases_linear
    valid_prediction = tf.nn.softmax(linear_valid)
    
    # Predictions for test data
    hidden_out_1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights_h1) + biases_h1)
    hidden_out_2_test = tf.nn.relu(tf.matmul(hidden_out_1_test, weights_h2) + biases_h2)
    hidden_out_3_test = tf.nn.relu(tf.matmul(hidden_out_2_test, weights_h3) + biases_h3)
    hidden_out_4_test = tf.nn.relu(tf.matmul(hidden_out_3_test, weights_h4) + biases_h4)
    linear_test = tf.matmul(hidden_out_4_test, weights_linear) + biases_linear
    test_prediction = tf.nn.softmax(linear_test)

In [67]:
num_steps = 40001
step = 0
prob = 0.5

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : prob}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
    
        if (step in [250, 500, 750, 1500] or step % 1000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3.272139
Minibatch accuracy: 10.2%
Validation accuracy: 12.8%
Test accuracy: 13.4%
Minibatch loss at step 250: 0.680989
Minibatch accuracy: 83.6%
Validation accuracy: 82.0%
Test accuracy: 89.1%
Minibatch loss at step 500: 0.478473
Minibatch accuracy: 86.7%
Validation accuracy: 83.9%
Test accuracy: 90.6%
Minibatch loss at step 750: 0.627182
Minibatch accuracy: 82.8%
Validation accuracy: 83.6%
Test accuracy: 90.3%
Minibatch loss at step 1000: 0.669082
Minibatch accuracy: 81.2%
Validation accuracy: 84.3%
Test accuracy: 91.0%
Minibatch loss at step 1500: 0.525709
Minibatch accuracy: 85.2%
Validation accuracy: 85.0%
Test accuracy: 91.7%
Minibatch loss at step 2000: 0.404569
Minibatch accuracy: 91.4%
Validation accuracy: 85.6%
Test accuracy: 92.4%
Minibatch loss at step 3000: 0.540352
Minibatch accuracy: 85.2%
Validation accuracy: 86.0%
Test accuracy: 92.6%
Minibatch loss at step 4000: 0.458665
Minibatch accuracy: 87.5%
Validation accuracy: 86.7%
Test ac

After step 40,000, we have an accuracy of 96.2%. I suspect that the model can continue to improve if we ran it for longer.