Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

# my additional imports
import timeit

First reload the data we generated in `1_notmnist.ipynb`.

In [2]:
pickle_file = 'notMNIST.pickle'
sanitized_pickle_file = 'notMNIST_sanitized.pickle' # additional work (it includes cleaned test and valid datasets)

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape, '\n')

with open(sanitized_pickle_file, 'rb') as f:
  save = pickle.load(f)
  sanit_valid_dataset = save['valid_dataset']
  sanit_valid_labels = save['valid_labels']
  sanit_test_dataset = save['test_dataset']
  sanit_test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('sanit_Validation set', sanit_valid_dataset.shape, sanit_valid_labels.shape)
  print('sanit_Test set', sanit_test_dataset.shape, sanit_test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,) 

sanit_Validation set (8984, 28, 28) (8984,)
sanit_Test set (8709, 28, 28) (8709,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)

sanit_valid_dataset, sanit_valid_labels = reformat(sanit_valid_dataset, sanit_valid_labels)
sanit_test_dataset, sanit_test_labels = reformat(sanit_test_dataset, sanit_test_labels)

print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape, '\n')

print('sanit_Validation set', sanit_valid_dataset.shape, sanit_valid_labels.shape)
print('sanit_Test set', sanit_test_dataset.shape, sanit_test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10) 

sanit_Validation set (8984, 784) (8984, 10)
sanit_Test set (8709, 784) (8709, 10)


In [4]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

## Solution 1

### Logistic regression

First the graph for logistic regression

In [5]:
batch_size = 128
L2_reg = 0.005 # ratio for L2 regularization

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_labels]))
  biases = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  logits = tf.matmul(tf_train_dataset, weights) + biases
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
    + L2_reg * tf.nn.l2_loss(weights))
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(
    tf.matmul(tf_valid_dataset, weights) + biases)
  test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

And the optimization part for logistic regression

In [6]:
num_steps = 8001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 32.426292
Minibatch accuracy: 9.4%
Validation accuracy: 8.1%
Minibatch loss at step 500: 10.458254
Minibatch accuracy: 69.5%
Validation accuracy: 72.6%
Minibatch loss at step 1000: 6.708092
Minibatch accuracy: 76.6%
Validation accuracy: 75.4%
Minibatch loss at step 1500: 3.944610
Minibatch accuracy: 73.4%
Validation accuracy: 77.5%
Minibatch loss at step 2000: 2.424660
Minibatch accuracy: 78.1%
Validation accuracy: 79.2%
Minibatch loss at step 2500: 1.748458
Minibatch accuracy: 82.0%
Validation accuracy: 80.6%
Minibatch loss at step 3000: 1.347484
Minibatch accuracy: 82.8%
Validation accuracy: 81.5%
Minibatch loss at step 3500: 1.001347
Minibatch accuracy: 83.6%
Validation accuracy: 82.1%
Minibatch loss at step 4000: 0.867111
Minibatch accuracy: 82.8%
Validation accuracy: 82.3%
Minibatch loss at step 4500: 0.654794
Minibatch accuracy: 85.2%
Validation accuracy: 82.6%
Minibatch loss at step 5000: 0.804653
Minibatch accuracy: 80.5%
Validation accurac

### The Neural Network

Graph for neural network

In [7]:
n_hidden = 1024 # Number of hidden nodes
batch_size = 128
L2_reg = 0.001 # ratio for L2 regularization

g = tf.Graph()
with g.as_default():
    
    # Placehodlers for training dataset minibatches
    tf_train_dataset = tf.placeholder(np.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(np.float32, shape=(batch_size, num_labels))
    
    # Constants for validation and test datasets
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Define Weights and biases as variables for hidden layer
    W_1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, n_hidden])
    )
    b_1 = tf.Variable(tf.zeros([n_hidden]))
    
    # Define Weights and biases as variables for logistic regression layer
    W_0 = tf.Variable(
        tf.truncated_normal([n_hidden, num_labels]))
    b_0 = tf.Variable(tf.zeros([num_labels]))
    
    logits_1 = tf.nn.relu(tf.matmul(tf_train_dataset, W_1) + b_1) # shape will be (batch_size, n_hidden)
    
    logits_0 = tf.matmul(logits_1, W_0) + b_0 # now the shape is (batch_size, num_labels)
    
    loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_0)
            + L2_reg * (tf.nn.l2_loss(W_0) + tf.nn.l2_loss(W_1)))
    
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    
    # Predictions
    train_prediction = tf.nn.softmax(logits_0)
    valid_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, W_1) + b_1), W_0) + b_0)
    test_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, W_1) + b_1), W_0) + b_0)

And optimize the neural network

In [8]:
n_steps = 5001 # number of steps to be taken

with tf.Session(graph=g) as sess:
    tf.global_variables_initializer().run()
    print('Initialized')
    for step in range(n_steps):
        
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        
        # Prepare minibatch data
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = sess.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch  train and validation accuracy: %.1f%%, %.1f%%" 
                % (accuracy(predictions, batch_labels), 
                accuracy(valid_prediction.eval(), valid_labels)))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 652.911011
Minibatch  train and validation accuracy: 10.2%, 29.7%
Minibatch loss at step 500: 194.107559
Minibatch  train and validation accuracy: 77.3%, 80.1%
Minibatch loss at step 1000: 115.058029
Minibatch  train and validation accuracy: 82.8%, 81.6%
Minibatch loss at step 1500: 69.195892
Minibatch  train and validation accuracy: 80.5%, 82.5%
Minibatch loss at step 2000: 41.897491
Minibatch  train and validation accuracy: 83.6%, 84.3%
Minibatch loss at step 2500: 25.331303
Minibatch  train and validation accuracy: 89.1%, 85.9%
Minibatch loss at step 3000: 15.448248
Minibatch  train and validation accuracy: 87.5%, 86.3%
Minibatch loss at step 3500: 9.673589
Minibatch  train and validation accuracy: 88.3%, 87.0%
Minibatch loss at step 4000: 5.992871
Minibatch  train and validation accuracy: 88.3%, 87.0%
Minibatch loss at step 4500: 3.769463
Minibatch  train and validation accuracy: 89.1%, 87.5%
Minibatch loss at step 5000: 2.547562
Minibatch  tra

---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

---

## Solution 2

In [9]:
n_hidden = 1024 # Number of hidden nodes
batch_size = 128
L2_reg = 0.000 # ratio for L2 regularization (use no regularization here)

g = tf.Graph()
with g.as_default():
    
    # Placehodlers for training dataset minibatches
    tf_train_dataset = tf.placeholder(np.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(np.float32, shape=(batch_size, num_labels))
    
    # Constants for validation and test datasets
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Define Weights and biases as variables for hidden layer
    W_1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, n_hidden])
    )
    b_1 = tf.Variable(tf.zeros([n_hidden]))
    
    # Define Weights and biases as variables for logistic regression layer
    W_0 = tf.Variable(
        tf.truncated_normal([n_hidden, num_labels]))
    b_0 = tf.Variable(tf.zeros([num_labels]))
    
    logits_1 = tf.nn.relu(tf.matmul(tf_train_dataset, W_1) + b_1) # shape will be (batch_size, n_hidden)
    
    logits_0 = tf.matmul(logits_1, W_0) + b_0 # now the shape is (batch_size, num_labels)
    
    loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_0)
            + L2_reg * (tf.nn.l2_loss(W_0) + tf.nn.l2_loss(W_1)))
    
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    
    # Predictions
    train_prediction = tf.nn.softmax(logits_0)
    valid_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, W_1) + b_1), W_0) + b_0)
    test_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, W_1) + b_1), W_0) + b_0)

In [10]:
n_steps = 5001 # number of steps to be taken
mod_batches = 3 # just use first a few batches over and over again

with tf.Session(graph=g) as sess:
    tf.global_variables_initializer().run()
    print('Initialibzed')
    for step in range(n_steps):
        
        offset = (step * batch_size) % (mod_batches * batch_size - batch_size) # offset changed
        
        # Prepare minibatch data
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = sess.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch  train and validation accuracy: %.1f%%, %.1f%%" 
                % (accuracy(predictions, batch_labels), 
                accuracy(valid_prediction.eval(), valid_labels)))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialibzed
Minibatch loss at step 0: 342.359314
Minibatch  train and validation accuracy: 7.8%, 27.7%
Minibatch loss at step 500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 1000: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 1500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 2000: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 2500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 3000: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 3500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 4000: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 4500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 69.3%
Minibatch loss at step 5000: 0.000000
Minibatch  tr

Model overfits to the training batches used and the test accuracy drops.

---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

---

## Slution 3

Neural network with dropout applied to hidden layer

In [11]:
n_hidden = 1024 # Number of hidden nodes
batch_size = 128
L2_reg = 0.000 # ratio for L2 regularization (use no regularization here)

g = tf.Graph()
with g.as_default():
    
    # Placehodlers for training dataset minibatches
    tf_train_dataset = tf.placeholder(np.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(np.float32, shape=(batch_size, num_labels))
    
    # Constants for validation and test datasets
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Define Weights and biases as variables for hidden layer
    W_1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, n_hidden])
    )
    b_1 = tf.Variable(tf.zeros([n_hidden]))
    
    # Define Weights and biases as variables for logistic regression layer
    W_0 = tf.Variable(
        tf.truncated_normal([n_hidden, num_labels]))
    b_0 = tf.Variable(tf.zeros([num_labels]))
    
    # with dropout applied
    logits_1 = tf.nn.relu(tf.matmul(tf_train_dataset, W_1) + b_1) # shape will be (batch_size, n_hidden)
        
    dropout_1 = tf.nn.dropout(logits_1, 0.5)
    
    logits_0 = tf.matmul(dropout_1, W_0) + b_0 # now the shape is (batch_size, num_labels)
    
    loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_0)
            + L2_reg * (tf.nn.l2_loss(W_0) + tf.nn.l2_loss(W_1)))
    
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    
    # Predictions
    train_prediction = tf.nn.softmax(tf.matmul(logits_1, W_0) + b_0)
    valid_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, W_1) + b_1), W_0) + b_0)
    test_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, W_1) + b_1), W_0) + b_0)

In [12]:
n_steps = 5001 # number of steps to be taken
mod_batches = 3 # just use first a few batches over and over again

with tf.Session(graph=g) as sess:
    tf.global_variables_initializer().run()
    print('Initialibzed')
    for step in range(n_steps):
        
        offset = (step * batch_size) % (mod_batches * batch_size - batch_size) # offset changed
        
        # Prepare minibatch data
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = sess.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch  train and validation accuracy: %.1f%%, %.1f%%" 
                % (accuracy(predictions, batch_labels), 
                accuracy(valid_prediction.eval(), valid_labels)))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialibzed
Minibatch loss at step 0: 511.439087
Minibatch  train and validation accuracy: 7.8%, 21.6%
Minibatch loss at step 500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 73.4%
Minibatch loss at step 1000: 0.000000
Minibatch  train and validation accuracy: 100.0%, 73.2%
Minibatch loss at step 1500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 73.5%
Minibatch loss at step 2000: 0.000000
Minibatch  train and validation accuracy: 100.0%, 73.3%
Minibatch loss at step 2500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 73.9%
Minibatch loss at step 3000: 0.000000
Minibatch  train and validation accuracy: 100.0%, 73.8%
Minibatch loss at step 3500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 73.5%
Minibatch loss at step 4000: 0.000000
Minibatch  train and validation accuracy: 100.0%, 74.3%
Minibatch loss at step 4500: 0.000000
Minibatch  train and validation accuracy: 100.0%, 74.1%
Minibatch loss at step 5000: 0.000000
Minibatch  tr

Using dropout, for a model with few training batches, improves the test accuracy dramatically (~5.3%).

---
Problem 4
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


## Solution 4

### Solution 4.a Model with 1 hidden layer

Our best test accuracy was ~94% for a network with 1 hidden layer.

Let's build a neural network with 2 hidden units

In [13]:
n_hidden = 1024 # Number of hidden nodes
batch_size = 500
L2_reg = 0.001 # ratio for L2 regularization

g = tf.Graph()
with g.as_default():
    
    # Placehodlers for training dataset minibatches
    tf_train_dataset = tf.placeholder(np.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(np.float32, shape=(batch_size, num_labels))
    
    # Constants for validation and test datasets
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Use learning rate decay
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(0.5, global_step, 1000, 0.95, staircase=True)
    
    # Define Weights and biases as variables for hidden layer
    W_1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, n_hidden])
    )
    b_1 = tf.Variable(tf.zeros([n_hidden]))
    
    # Define Weights and biases as variables for logistic regression layer
    W_0 = tf.Variable(
        tf.truncated_normal([n_hidden, num_labels]))
    b_0 = tf.Variable(tf.zeros([num_labels]))
    
    logits_1 = tf.nn.relu(tf.matmul(tf_train_dataset, W_1) + b_1) # shape will be (batch_size, n_hidden)
    
    dropout_1 = tf.nn.dropout(logits_1, 0.5)
    
    logits_0 = tf.matmul(dropout_1, W_0) + b_0 # now the shape is (batch_size, num_labels)
    
    loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_0)
            + L2_reg * (tf.nn.l2_loss(W_0) + tf.nn.l2_loss(W_1)))
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
    # Predictions
    train_prediction = tf.nn.softmax(tf.matmul(logits_1, W_0) + b_0)
    valid_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, W_1) + b_1), W_0) + b_0)
    test_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, W_1) + b_1), W_0) + b_0)

Let's do the optimization with early stopping

In [14]:
n_epochs = 200 # number of epochs

# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2
improvement_threshold = 0.995 # a relative improvement of 
                                  # this much is considered significant
best_valid_loss = np.inf
start_time = timeit.default_timer()

n_train_batches = train_dataset.shape[0] // batch_size

valid_freq = min(n_train_batches, patience // 2)

with tf.Session(graph=g) as sess:
    tf.global_variables_initializer().run()
    print('Initialized')
    done_looping = False
    epoch = 0
    while (epoch < n_epochs) and (not done_looping):
        epoch = epoch + 1
        for minibatch_index in range(n_train_batches):
            
            batch_data = \
                train_dataset[minibatch_index * batch_size:(minibatch_index + 1) * batch_size, :]
            batch_labels = \
                train_labels[minibatch_index * batch_size:(minibatch_index + 1) * batch_size, :]
            feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
            _, l, predictions = sess.run(
                [optimizer, loss, train_prediction], feed_dict=feed_dict)
        
            iter = (epoch - 1) * n_train_batches + minibatch_index # cumulative iteration number
        
            if (iter + 1) % valid_freq == 0:
                this_valid_loss = 1. - (accuracy(valid_prediction.eval(), valid_labels) / 100.)
            
                print("Minibatch loss at epoch %i and iter %i: %f and learning rate: %f" % 
                      (epoch, iter, l, sess.run(learning_rate)))
                print("Minibatch  train and validation accuracy: %.1f%%, %.1f%%" 
                    % (accuracy(predictions, batch_labels), 
                    accuracy(valid_prediction.eval(), valid_labels)))
            
                if this_valid_loss < best_valid_loss:
                    if this_valid_loss < best_valid_loss * improvement_threshold:
                        patience = max(patience, iter * patience_increase)
                    
                    best_valid_loss = this_valid_loss
                    
                    params = (sess.run(W_0), sess.run(b_0),
                              sess.run(W_1), sess.run(b_1))
                    
                    # save the best model
                    with open('best_model_params.pkl', 'wb') as f:
                            pickle.dump(params, f)
                    print('Model saved')
        
            if patience <= iter:
                    done_looping = True
                    break
                    
    print("Final Test accuracy: %.1f%%" 
                  % accuracy(test_prediction.eval(), test_labels))
    
    with open('best_model_params.pkl', 'rb') as f:
        W_0_best, b_0_best, W_1_best, b_1_best = pickle.load(f)
        
    W_0_init, b_0_init = tf.assign(W_0, W_0_best), tf.assign(b_0, b_0_best)
    W_1_init, b_1_init = tf.assign(W_1, W_1_best), tf.assign(b_1, b_1_best)
    
    sess.run([W_0_init, b_0_init, W_1_init, b_1_init])
    
    print("Test accuracy with the best model: %.1f%%" 
                  % accuracy(test_prediction.eval(), test_labels))

    end_time = timeit.default_timer()

    print('Total run time %.4f minutes' % ((end_time - start_time) / 60.))

Initialized
Minibatch loss at epoch 1 and iter 399: 215.319443 and learning rate: 0.500000
Minibatch  train and validation accuracy: 81.2%, 80.8%
Model saved
Minibatch loss at epoch 2 and iter 799: 140.936600 and learning rate: 0.500000
Minibatch  train and validation accuracy: 78.8%, 81.4%
Model saved
Minibatch loss at epoch 3 and iter 1199: 94.633652 and learning rate: 0.475000
Minibatch  train and validation accuracy: 77.4%, 81.2%
Minibatch loss at epoch 4 and iter 1599: 63.851974 and learning rate: 0.475000
Minibatch  train and validation accuracy: 83.0%, 83.8%
Model saved
Minibatch loss at epoch 5 and iter 1999: 43.597389 and learning rate: 0.451250
Minibatch  train and validation accuracy: 83.2%, 84.6%
Model saved
Minibatch loss at epoch 6 and iter 2399: 30.428461 and learning rate: 0.451250
Minibatch  train and validation accuracy: 84.2%, 85.9%
Model saved
Minibatch loss at epoch 7 and iter 2799: 21.329872 and learning rate: 0.451250
Minibatch  train and validation accuracy: 86.

Minibatch  train and validation accuracy: 91.4%, 89.4%
Minibatch loss at epoch 60 and iter 23999: 0.461299 and learning rate: 0.145994
Minibatch  train and validation accuracy: 91.6%, 89.4%
Minibatch loss at epoch 61 and iter 24399: 0.458182 and learning rate: 0.145994
Minibatch  train and validation accuracy: 91.6%, 89.4%
Minibatch loss at epoch 62 and iter 24799: 0.452661 and learning rate: 0.145994
Minibatch  train and validation accuracy: 91.6%, 89.4%
Minibatch loss at epoch 63 and iter 25199: 0.448144 and learning rate: 0.138695
Minibatch  train and validation accuracy: 91.4%, 89.4%
Minibatch loss at epoch 64 and iter 25599: 0.469139 and learning rate: 0.138695
Minibatch  train and validation accuracy: 91.4%, 89.4%
Minibatch loss at epoch 65 and iter 25999: 0.469924 and learning rate: 0.131760
Minibatch  train and validation accuracy: 91.4%, 89.4%
Minibatch loss at epoch 66 and iter 26399: 0.463965 and learning rate: 0.131760
Minibatch  train and validation accuracy: 91.6%, 89.5%


---
### Solution 4.b Model with two hidden layers

Now let's try a model with 2 hidden layers. Weights with truncated normal distribution won't work here, thus we will initialize the weights with [Xavier initialization](http://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Fall.2016/pdfs/1111/AISTATS2010_Glorot.pdf).

In [15]:
n_hidden_2 = 1024 # Number of hidden nodes for hidden layer 2
n_hidden_1 = 500  # Number of hidden nodes for hidden layer 1
batch_size = 128
L2_reg = 0.000 # ratio for L2 regularization

g = tf.Graph()
with g.as_default():
    
    # Placehodlers for training dataset minibatches
    tf_train_dataset = tf.placeholder(np.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(np.float32, shape=(batch_size, num_labels))
    
    # Constants for validation and test datasets
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Use learning rate decay
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(0.05, global_step, 5000, 0.95, staircase=True)
    
    # Define Weights and biases as variables for hidden layers
    W_2 = tf.get_variable("W2", shape=[image_size * image_size, n_hidden_2],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_2 = tf.Variable(tf.zeros([n_hidden_2]))
    
    W_1 = tf.get_variable("W1", shape=[n_hidden_2, n_hidden_1],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_1 = tf.Variable(tf.zeros([n_hidden_1]))
    
    # Define Weights and biases as variables for logistic regression layer
    W_0 = tf.get_variable("W0", shape=[n_hidden_1, num_labels],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_0 = tf.Variable(tf.zeros([num_labels]))
    
    logits_2 = tf.nn.relu(tf.matmul(tf_train_dataset, W_2) + b_2) # shape will be (batch_size, n_hidden_2)
    
    dropout_2 = tf.nn.dropout(logits_2, 0.5)
    
    logits_1 = tf.nn.relu(tf.matmul(dropout_2, W_1) + b_1) # shape will be (batch_size, n_hidden_1)
    
    dropout_1 = tf.nn.dropout(logits_1, 0.5)
    
    logits_0 = tf.matmul(dropout_1, W_0) + b_0 # now the shape is (batch_size, num_labels)
    
    loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_0)
            + L2_reg * (tf.nn.l2_loss(W_0) + tf.nn.l2_loss(W_1) + tf.nn.l2_loss(W_2)))
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
    # Predictions
    def get_logits(dataset):
        get_logits_2 = tf.nn.relu(tf.matmul(dataset, W_2) + b_2)
        get_logits_1 = tf.nn.relu(tf.matmul(get_logits_2, W_1) + b_1)
        get_logits_0 = tf.matmul(get_logits_1, W_0) + b_0
        return get_logits_0

    train_prediction = tf.nn.softmax(get_logits(tf_train_dataset))
    valid_prediction = tf.nn.softmax(get_logits(tf_valid_dataset))
    test_prediction = tf.nn.softmax(get_logits(tf_test_dataset))

In [16]:
n_epochs = 200 # number of epochs

# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2
improvement_threshold = 0.995 # a relative improvement of 
                                  # this much is considered significant
best_valid_loss = np.inf
start_time = timeit.default_timer()

n_train_batches = train_dataset.shape[0] // batch_size

valid_freq = min(n_train_batches, patience // 2)

with tf.Session(graph=g) as sess:
    tf.global_variables_initializer().run()
    print('Initialized')
    done_looping = False
    epoch = 0
    while (epoch < n_epochs) and (not done_looping):
        epoch = epoch + 1
        for minibatch_index in range(n_train_batches):
            
            batch_data = \
                train_dataset[minibatch_index * batch_size:(minibatch_index + 1) * batch_size, :]
            batch_labels = \
                train_labels[minibatch_index * batch_size:(minibatch_index + 1) * batch_size, :]
            feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
            _, l, predictions = sess.run(
                [optimizer, loss, train_prediction], feed_dict=feed_dict)
        
            iter = (epoch - 1) * n_train_batches + minibatch_index # cumulative iteration number
        
            if (iter + 1) % valid_freq == 0:
                this_valid_loss = 1. - (accuracy(valid_prediction.eval(), valid_labels) / 100.)
            
                print("Minibatch loss at epoch %i and iter %i: %f and learning rate: %f" % 
                      (epoch, iter, l, sess.run(learning_rate)))
                print("Minibatch  train and validation accuracy: %.1f%%, %.1f%%" 
                    % (accuracy(predictions, batch_labels), 
                    accuracy(valid_prediction.eval(), valid_labels)))
            
                if this_valid_loss < best_valid_loss:
                    if this_valid_loss < best_valid_loss * improvement_threshold:
                        patience = max(patience, iter * patience_increase)
                    
                    best_valid_loss = this_valid_loss
                    
                    params = (sess.run(W_0), sess.run(b_0),
                              sess.run(W_1), sess.run(b_1),
                              sess.run(W_2), sess.run(b_2))
                    
                    # save the best model
                    with open('best_model_params.pkl', 'wb') as f:
                            pickle.dump(params, f)
                    print('Model saved')
        
            if patience <= iter:
                    done_looping = True
                    break
                    
    print("Final Test accuracy: %.1f%%" 
                  % accuracy(test_prediction.eval(), test_labels))
    
    with open('best_model_params.pkl', 'rb') as f:
        W_0_best, b_0_best, W_1_best, b_1_best, W_2_best, b_2_best = pickle.load(f)
        
    W_0_init, b_0_init = tf.assign(W_0, W_0_best), tf.assign(b_0, b_0_best)
    W_1_init, b_1_init = tf.assign(W_1, W_1_best), tf.assign(b_1, b_1_best)
    W_2_init, b_2_init = tf.assign(W_2, W_2_best), tf.assign(b_2, b_2_best)
    
    sess.run([W_0_init, b_0_init, W_1_init, b_1_init, W_2_init, b_2_init])
    
    print("Test accuracy with the best model: %.1f%%" 
                  % accuracy(test_prediction.eval(), test_labels))

    end_time = timeit.default_timer()

    print('Total run time %.4f minutes' % ((end_time - start_time) / 60.))

Initialized
Minibatch loss at epoch 1 and iter 1561: 0.441507 and learning rate: 0.050000
Minibatch  train and validation accuracy: 89.1%, 84.7%
Model saved
Minibatch loss at epoch 2 and iter 3123: 0.406085 and learning rate: 0.050000
Minibatch  train and validation accuracy: 90.6%, 86.0%
Model saved
Minibatch loss at epoch 3 and iter 4685: 0.360064 and learning rate: 0.050000
Minibatch  train and validation accuracy: 90.6%, 86.6%
Model saved
Minibatch loss at epoch 4 and iter 6247: 0.351320 and learning rate: 0.047500
Minibatch  train and validation accuracy: 91.4%, 87.4%
Model saved
Minibatch loss at epoch 5 and iter 7809: 0.314280 and learning rate: 0.047500
Minibatch  train and validation accuracy: 92.2%, 87.8%
Model saved
Minibatch loss at epoch 6 and iter 9371: 0.299480 and learning rate: 0.047500
Minibatch  train and validation accuracy: 92.2%, 88.1%
Model saved
Minibatch loss at epoch 7 and iter 10933: 0.289495 and learning rate: 0.045125
Minibatch  train and validation accurac

Minibatch loss at epoch 59 and iter 92157: 0.103944 and learning rate: 0.019861
Minibatch  train and validation accuracy: 99.2%, 91.0%
Minibatch loss at epoch 60 and iter 93719: 0.079282 and learning rate: 0.019861
Minibatch  train and validation accuracy: 99.2%, 91.1%
Minibatch loss at epoch 61 and iter 95281: 0.104421 and learning rate: 0.018868
Minibatch  train and validation accuracy: 99.2%, 91.1%
Minibatch loss at epoch 62 and iter 96843: 0.139491 and learning rate: 0.018868
Minibatch  train and validation accuracy: 99.2%, 91.1%
Minibatch loss at epoch 63 and iter 98405: 0.101412 and learning rate: 0.018868
Minibatch  train and validation accuracy: 99.2%, 91.1%
Minibatch loss at epoch 64 and iter 99967: 0.132190 and learning rate: 0.018868
Minibatch  train and validation accuracy: 99.2%, 91.1%
Minibatch loss at epoch 65 and iter 101529: 0.108443 and learning rate: 0.017924
Minibatch  train and validation accuracy: 99.2%, 91.2%
Model saved
Minibatch loss at epoch 66 and iter 103091

Minibatch  train and validation accuracy: 99.2%, 91.3%
Minibatch loss at epoch 119 and iter 185877: 0.076099 and learning rate: 0.007495
Minibatch  train and validation accuracy: 99.2%, 91.3%
Minibatch loss at epoch 120 and iter 187439: 0.065206 and learning rate: 0.007495
Minibatch  train and validation accuracy: 99.2%, 91.3%
Minibatch loss at epoch 121 and iter 189001: 0.141898 and learning rate: 0.007495
Minibatch  train and validation accuracy: 99.2%, 91.2%
Minibatch loss at epoch 122 and iter 190563: 0.137284 and learning rate: 0.007120
Minibatch  train and validation accuracy: 99.2%, 91.3%
Minibatch loss at epoch 123 and iter 192125: 0.078145 and learning rate: 0.007120
Minibatch  train and validation accuracy: 99.2%, 91.3%
Minibatch loss at epoch 124 and iter 193687: 0.056069 and learning rate: 0.007120
Minibatch  train and validation accuracy: 99.2%, 91.3%
Minibatch loss at epoch 125 and iter 195249: 0.101927 and learning rate: 0.006764
Minibatch  train and validation accuracy:

Minibatch  train and validation accuracy: 99.2%, 91.3%
Minibatch loss at epoch 179 and iter 279597: 0.064470 and learning rate: 0.002977
Minibatch  train and validation accuracy: 99.2%, 91.4%
Minibatch loss at epoch 180 and iter 281159: 0.081744 and learning rate: 0.002828
Minibatch  train and validation accuracy: 99.2%, 91.4%
Minibatch loss at epoch 181 and iter 282721: 0.071899 and learning rate: 0.002828
Minibatch  train and validation accuracy: 99.2%, 91.5%
Model saved
Minibatch loss at epoch 182 and iter 284283: 0.102359 and learning rate: 0.002828
Minibatch  train and validation accuracy: 99.2%, 91.4%
Minibatch loss at epoch 183 and iter 285845: 0.054004 and learning rate: 0.002687
Minibatch  train and validation accuracy: 99.2%, 91.4%
Minibatch loss at epoch 184 and iter 287407: 0.074226 and learning rate: 0.002687
Minibatch  train and validation accuracy: 99.2%, 91.4%
Minibatch loss at epoch 185 and iter 288969: 0.066165 and learning rate: 0.002687
Minibatch  train and validati

---

### Solution 4.c Model with one hidden layer (Xavier initialization)

Let's see if it makes any difference for the network with 1 hidden layer when the weights are initialized with Xavier method.

In [17]:
# The graph

n_hidden = 1024 # Number of hidden nodes
batch_size = 500
L2_reg = 0.001 # ratio for L2 regularization

g = tf.Graph()
with g.as_default():
    
    # Placehodlers for training dataset minibatches
    tf_train_dataset = tf.placeholder(np.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(np.float32, shape=(batch_size, num_labels))
    
    # Constants for validation and test datasets
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Use learning rate decay
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(0.5, global_step, 1000, 0.95, staircase=True)
    
    # Define Weights and biases as variables for hidden layer
    W_1 = tf.get_variable("W1", shape=[image_size * image_size, n_hidden],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_1 = tf.Variable(tf.zeros([n_hidden]))
    
    # Define Weights and biases as variables for logistic regression layer
    W_0 = tf.get_variable("W0", shape=[n_hidden, num_labels],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_0 = tf.Variable(tf.zeros([num_labels]))
    
    logits_1 = tf.nn.relu(tf.matmul(tf_train_dataset, W_1) + b_1) # shape will be (batch_size, n_hidden)
    
    dropout_1 = tf.nn.dropout(logits_1, 0.5)
    
    logits_0 = tf.matmul(dropout_1, W_0) + b_0 # now the shape is (batch_size, num_labels)
    
    loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_0)
            + L2_reg * (tf.nn.l2_loss(W_0) + tf.nn.l2_loss(W_1)))
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
    # Predictions
    train_prediction = tf.nn.softmax(tf.matmul(logits_1, W_0) + b_0)
    valid_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, W_1) + b_1), W_0) + b_0)
    test_prediction = tf.nn.softmax(
        tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, W_1) + b_1), W_0) + b_0)

In [18]:
# Optimization part

n_epochs = 200 # number of epochs

# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2
improvement_threshold = 0.995 # a relative improvement of 
                                  # this much is considered significant
best_valid_loss = np.inf
start_time = timeit.default_timer()

n_train_batches = train_dataset.shape[0] // batch_size

valid_freq = min(n_train_batches, patience // 2)

with tf.Session(graph=g) as sess:
    tf.global_variables_initializer().run()
    print('Initialized')
    done_looping = False
    epoch = 0
    while (epoch < n_epochs) and (not done_looping):
        epoch = epoch + 1
        for minibatch_index in range(n_train_batches):
            
            batch_data = \
                train_dataset[minibatch_index * batch_size:(minibatch_index + 1) * batch_size, :]
            batch_labels = \
                train_labels[minibatch_index * batch_size:(minibatch_index + 1) * batch_size, :]
            feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
            _, l, predictions = sess.run(
                [optimizer, loss, train_prediction], feed_dict=feed_dict)
        
            iter = (epoch - 1) * n_train_batches + minibatch_index # cumulative iteration number
        
            if (iter + 1) % valid_freq == 0:
                this_valid_loss = 1. - (accuracy(valid_prediction.eval(), valid_labels) / 100.)
            
                print("Minibatch loss at epoch %i and iter %i: %f and learning rate: %f" % 
                      (epoch, iter, l, sess.run(learning_rate)))
                print("Minibatch  train and validation accuracy: %.1f%%, %.1f%%" 
                    % (accuracy(predictions, batch_labels), 
                    accuracy(valid_prediction.eval(), valid_labels)))
            
                if this_valid_loss < best_valid_loss:
                    if this_valid_loss < best_valid_loss * improvement_threshold:
                        patience = max(patience, iter * patience_increase)
                    
                    best_valid_loss = this_valid_loss
                    
                    params = (sess.run(W_0), sess.run(b_0),
                              sess.run(W_1), sess.run(b_1))
                    
                    # save the best model
                    with open('best_model_params.pkl', 'wb') as f:
                            pickle.dump(params, f)
                    print('Model saved')
        
            if patience <= iter:
                    done_looping = True
                    break
                    
    print("Final Test accuracy: %.1f%%" 
                  % accuracy(test_prediction.eval(), test_labels))
    
    with open('best_model_params.pkl', 'rb') as f:
        W_0_best, b_0_best, W_1_best, b_1_best = pickle.load(f)
        
    W_0_init, b_0_init = tf.assign(W_0, W_0_best), tf.assign(b_0, b_0_best)
    W_1_init, b_1_init = tf.assign(W_1, W_1_best), tf.assign(b_1, b_1_best)
    
    sess.run([W_0_init, b_0_init, W_1_init, b_1_init])
    
    print("Test accuracy with the best model: %.1f%%" 
                  % accuracy(test_prediction.eval(), test_labels))

    end_time = timeit.default_timer()

    print('Total run time %.4f minutes' % ((end_time - start_time) / 60.))

Initialized
Minibatch loss at epoch 1 and iter 399: 0.867855 and learning rate: 0.500000
Minibatch  train and validation accuracy: 84.6%, 86.0%
Model saved
Minibatch loss at epoch 2 and iter 799: 0.729239 and learning rate: 0.500000
Minibatch  train and validation accuracy: 86.8%, 87.0%
Model saved
Minibatch loss at epoch 3 and iter 1199: 0.650370 and learning rate: 0.475000
Minibatch  train and validation accuracy: 88.2%, 87.4%
Model saved
Minibatch loss at epoch 4 and iter 1599: 0.612626 and learning rate: 0.475000
Minibatch  train and validation accuracy: 88.8%, 87.6%
Model saved
Minibatch loss at epoch 5 and iter 1999: 0.567722 and learning rate: 0.451250
Minibatch  train and validation accuracy: 88.8%, 87.9%
Model saved
Minibatch loss at epoch 6 and iter 2399: 0.536323 and learning rate: 0.451250
Minibatch  train and validation accuracy: 89.8%, 88.1%
Model saved
Minibatch loss at epoch 7 and iter 2799: 0.545947 and learning rate: 0.451250
Minibatch  train and validation accuracy: 

Minibatch  train and validation accuracy: 91.8%, 89.7%
Minibatch loss at epoch 60 and iter 23999: 0.425387 and learning rate: 0.145994
Minibatch  train and validation accuracy: 92.2%, 89.8%
Model saved
Minibatch loss at epoch 61 and iter 24399: 0.437827 and learning rate: 0.145994
Minibatch  train and validation accuracy: 92.2%, 89.7%
Minibatch loss at epoch 62 and iter 24799: 0.425335 and learning rate: 0.145994
Minibatch  train and validation accuracy: 92.2%, 89.7%
Minibatch loss at epoch 63 and iter 25199: 0.416786 and learning rate: 0.138695
Minibatch  train and validation accuracy: 92.0%, 89.8%
Minibatch loss at epoch 64 and iter 25599: 0.434321 and learning rate: 0.138695
Minibatch  train and validation accuracy: 91.6%, 89.8%
Model saved
Minibatch loss at epoch 65 and iter 25999: 0.416388 and learning rate: 0.131760
Minibatch  train and validation accuracy: 91.8%, 89.6%
Minibatch loss at epoch 66 and iter 26399: 0.405490 and learning rate: 0.131760
Minibatch  train and validation

Minibatch  train and validation accuracy: 93.0%, 89.9%
Minibatch loss at epoch 120 and iter 47999: 0.423429 and learning rate: 0.042629
Minibatch  train and validation accuracy: 92.6%, 90.0%
Minibatch loss at epoch 121 and iter 48399: 0.401812 and learning rate: 0.042629
Minibatch  train and validation accuracy: 92.4%, 90.0%
Minibatch loss at epoch 122 and iter 48799: 0.407014 and learning rate: 0.042629
Minibatch  train and validation accuracy: 92.6%, 89.9%
Minibatch loss at epoch 123 and iter 49199: 0.415623 and learning rate: 0.040497
Minibatch  train and validation accuracy: 92.2%, 89.9%
Minibatch loss at epoch 124 and iter 49599: 0.437810 and learning rate: 0.040497
Minibatch  train and validation accuracy: 92.6%, 89.9%
Minibatch loss at epoch 125 and iter 49999: 0.410962 and learning rate: 0.038472
Minibatch  train and validation accuracy: 92.6%, 89.9%
Minibatch loss at epoch 126 and iter 50399: 0.407475 and learning rate: 0.038472
Minibatch  train and validation accuracy: 92.8%,

Minibatch  train and validation accuracy: 92.8%, 90.0%
Minibatch loss at epoch 180 and iter 71999: 0.405200 and learning rate: 0.012447
Minibatch  train and validation accuracy: 92.8%, 90.0%
Minibatch loss at epoch 181 and iter 72399: 0.406954 and learning rate: 0.012447
Minibatch  train and validation accuracy: 92.8%, 90.0%
Minibatch loss at epoch 182 and iter 72799: 0.420032 and learning rate: 0.012447
Minibatch  train and validation accuracy: 93.0%, 90.0%
Minibatch loss at epoch 183 and iter 73199: 0.409864 and learning rate: 0.011825
Minibatch  train and validation accuracy: 92.8%, 90.0%
Minibatch loss at epoch 184 and iter 73599: 0.401868 and learning rate: 0.011825
Minibatch  train and validation accuracy: 92.8%, 90.0%
Minibatch loss at epoch 185 and iter 73999: 0.419152 and learning rate: 0.011234
Minibatch  train and validation accuracy: 92.8%, 90.0%
Minibatch loss at epoch 186 and iter 74399: 0.408246 and learning rate: 0.011234
Minibatch  train and validation accuracy: 92.8%,

It seems applying Xavier initialization to the weights improved the test accuracy.

---

### Solution 4.d Model with three hidden layers

And now finally let's try a network with 3 hidden layers. This time we will be also checking sanitized validation and sanitized test set accuracies that we have from assignment 1.

In [19]:
# The graph

n_hidden_3 = 1024  # Number of hidden nodes for hidden layer 3
n_hidden_2 = 300   # Number of hidden nodes for hidden layer 2
n_hidden_1 = 50    # Number of hidden nodes for hidden layer 1
batch_size = 128
L2_reg = 0.000 # ratio for L2 regularization

g = tf.Graph()
with g.as_default():
    
    # Placehodlers for training dataset minibatches
    tf_train_dataset = tf.placeholder(np.float32, shape=(batch_size, image_size * image_size), name='ph1')
    tf_train_labels = tf.placeholder(np.float32, shape=(batch_size, num_labels), name='ph2')
    
    # Constants for validation and test datasets
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    tf_sanit_valid_dataset = tf.constant(sanit_valid_dataset)
    tf_sanit_test_dataset = tf.constant(sanit_test_dataset)
    
    # Use learning rate decay
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(0.05, global_step, 5000, 0.95, staircase=True)
    
    # Define Weights and biases as variables for hidden layers
    W_3 = tf.get_variable("W3", shape=[image_size * image_size, n_hidden_3],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_3 = tf.Variable(tf.zeros([n_hidden_3]))
    
    W_2 = tf.get_variable("W2", shape=[n_hidden_3, n_hidden_2],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_2 = tf.Variable(tf.zeros([n_hidden_2]))
    
    W_1 = tf.get_variable("W1", shape=[n_hidden_2, n_hidden_1],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_1 = tf.Variable(tf.zeros([n_hidden_1]))
    
    # Define Weights and biases as variables for logistic regression layer
    W_0 = tf.get_variable("W0", shape=[n_hidden_1, num_labels],
                                initializer=tf.contrib.layers.xavier_initializer())
    b_0 = tf.Variable(tf.zeros([num_labels]))
    
    logits_3 = tf.nn.relu(tf.matmul(tf_train_dataset, W_3) + b_3) # shape will be (batch_size, n_hidden_3)
    
    dropout_3 = tf.nn.dropout(logits_3, 0.5)
    
    logits_2 = tf.nn.relu(tf.matmul(dropout_3, W_2) + b_2) # shape will be (batch_size, n_hidden_2)
    
    dropout_2 = tf.nn.dropout(logits_2, 0.5)
    
    logits_1 = tf.nn.relu(tf.matmul(dropout_2, W_1) + b_1) # shape will be (batch_size, n_hidden_1)
    
    dropout_1 = tf.nn.dropout(logits_1, 0.5)
    
    logits_0 = tf.matmul(dropout_1, W_0) + b_0 # now the shape is (batch_size, num_labels)
    
    loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_0)
            + L2_reg * (tf.nn.l2_loss(W_0) + tf.nn.l2_loss(W_1) 
                        + tf.nn.l2_loss(W_2) + tf.nn.l2_loss(W_3)))
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
    # Predictions
    def get_logits(dataset):
        get_logits_3 = tf.nn.relu(tf.matmul(dataset, W_3) + b_3)
        get_logits_2 = tf.nn.relu(tf.matmul(get_logits_3, W_2) + b_2)
        get_logits_1 = tf.nn.relu(tf.matmul(get_logits_2, W_1) + b_1)
        get_logits_0 = tf.matmul(get_logits_1, W_0) + b_0
        return get_logits_0

    train_prediction = tf.nn.softmax(get_logits(tf_train_dataset))
    valid_prediction = tf.nn.softmax(get_logits(tf_valid_dataset))
    test_prediction = tf.nn.softmax(get_logits(tf_test_dataset))
    
    sanit_valid_prediction = tf.nn.softmax(get_logits(tf_sanit_valid_dataset))
    sanit_test_prediction = tf.nn.softmax(get_logits(tf_sanit_test_dataset))

In [20]:
n_epochs = 200 # number of epochs

# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2
improvement_threshold = 0.995 # a relative improvement of 
                                  # this much is considered significant
best_valid_loss = np.inf
start_time = timeit.default_timer()

n_train_batches = train_dataset.shape[0] // batch_size

valid_freq = min(n_train_batches, patience // 2)

with tf.Session(graph=g) as sess:
    tf.global_variables_initializer().run()
    print('Initialized')
    done_looping = False
    epoch = 0
    while (epoch < n_epochs) and (not done_looping):
        epoch = epoch + 1
        for minibatch_index in range(n_train_batches):
            
            batch_data = \
                train_dataset[minibatch_index * batch_size:(minibatch_index + 1) * batch_size, :]
            batch_labels = \
                train_labels[minibatch_index * batch_size:(minibatch_index + 1) * batch_size, :]
            
            feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
            _, l, predictions = sess.run(
                [optimizer, loss, train_prediction], feed_dict=feed_dict)
            
            iter = (epoch - 1) * n_train_batches + minibatch_index # cumulative iteration number
        
            if (iter + 1) % valid_freq == 0:
                this_valid_loss = 1. - (accuracy(valid_prediction.eval(), valid_labels) / 100.)
            
                print("Minibatch loss at epoch %i and iter %i: %f and learning rate: %f" % 
                      (epoch, iter, l, sess.run(learning_rate)))
                print("Minibatch  train, validation and sanitized validation accuracy: %.1f%%, %.1f%%, %.1f%%" 
                    % (accuracy(predictions, batch_labels), 
                    accuracy(valid_prediction.eval(), valid_labels),
                    accuracy(sanit_valid_prediction.eval(), sanit_valid_labels)))
            
                if this_valid_loss < best_valid_loss:
                    if this_valid_loss < best_valid_loss * improvement_threshold:
                        patience = max(patience, iter * patience_increase)
                    
                    best_valid_loss = this_valid_loss
                    
                    params = (sess.run(W_0), sess.run(b_0),
                              sess.run(W_1), sess.run(b_1),
                              sess.run(W_2), sess.run(b_2),
                              sess.run(W_3), sess.run(b_3))
                    
                    # save the best model
                    with open('best_model_params.pkl', 'wb') as f:
                            pickle.dump(params, f)
                    print('Model saved')
        
            if patience <= iter:
                    done_looping = True
                    break
                    
    print("Final Test accuracy: %.1f%%" 
                  % accuracy(test_prediction.eval(), test_labels))
    
    with open('best_model_params.pkl', 'rb') as f:
        W_0_best, b_0_best, W_1_best, b_1_best, W_2_best, b_2_best, W_3_best, b_3_best = pickle.load(f)
        
    W_0_init, b_0_init = tf.assign(W_0, W_0_best), tf.assign(b_0, b_0_best)
    W_1_init, b_1_init = tf.assign(W_1, W_1_best), tf.assign(b_1, b_1_best)
    W_2_init, b_2_init = tf.assign(W_2, W_2_best), tf.assign(b_2, b_2_best)
    W_3_init, b_3_init = tf.assign(W_3, W_3_best), tf.assign(b_3, b_3_best)
    
    sess.run([W_0_init, b_0_init, W_1_init, b_1_init, W_2_init, b_2_init, W_3_init, b_3_init])
    
    print("Test accuracy with the best model: %.1f%%" 
                  % accuracy(test_prediction.eval(), test_labels))
    
    print("Sanitized test accuracy with the best model: %.1f%%" 
                  % accuracy(sanit_test_prediction.eval(), sanit_test_labels))

    end_time = timeit.default_timer()

    print('Total run time %.4f minutes' % ((end_time - start_time) / 60.))

Initialized
Minibatch loss at epoch 1 and iter 1561: 0.630508 and learning rate: 0.050000
Minibatch  train, validation and sanitized validation accuracy: 89.8%, 84.3%, 83.3%
Model saved
Minibatch loss at epoch 2 and iter 3123: 0.466348 and learning rate: 0.050000
Minibatch  train, validation and sanitized validation accuracy: 90.6%, 85.4%, 84.5%
Model saved
Minibatch loss at epoch 3 and iter 4685: 0.451205 and learning rate: 0.050000
Minibatch  train, validation and sanitized validation accuracy: 89.8%, 86.3%, 85.4%
Model saved
Minibatch loss at epoch 4 and iter 6247: 0.441148 and learning rate: 0.047500
Minibatch  train, validation and sanitized validation accuracy: 89.8%, 86.6%, 85.8%
Model saved
Minibatch loss at epoch 5 and iter 7809: 0.377883 and learning rate: 0.047500
Minibatch  train, validation and sanitized validation accuracy: 91.4%, 87.1%, 86.3%
Model saved
Minibatch loss at epoch 6 and iter 9371: 0.371348 and learning rate: 0.047500
Minibatch  train, validation and sanitiz

Minibatch loss at epoch 49 and iter 76537: 0.185925 and learning rate: 0.023165
Minibatch  train, validation and sanitized validation accuracy: 96.9%, 90.4%, 89.7%
Minibatch loss at epoch 50 and iter 78099: 0.144480 and learning rate: 0.023165
Minibatch  train, validation and sanitized validation accuracy: 97.7%, 90.5%, 89.8%
Model saved
Minibatch loss at epoch 51 and iter 79661: 0.138634 and learning rate: 0.023165
Minibatch  train, validation and sanitized validation accuracy: 96.9%, 90.5%, 89.8%
Model saved
Minibatch loss at epoch 52 and iter 81223: 0.223114 and learning rate: 0.022006
Minibatch  train, validation and sanitized validation accuracy: 96.9%, 90.3%, 89.5%
Minibatch loss at epoch 53 and iter 82785: 0.120496 and learning rate: 0.022006
Minibatch  train, validation and sanitized validation accuracy: 97.7%, 90.5%, 89.8%
Minibatch loss at epoch 54 and iter 84347: 0.157761 and learning rate: 0.022006
Minibatch  train, validation and sanitized validation accuracy: 97.7%, 90.5%

Minibatch loss at epoch 98 and iter 153075: 0.125935 and learning rate: 0.010732
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.0%, 90.3%
Model saved
Minibatch loss at epoch 99 and iter 154637: 0.119884 and learning rate: 0.010732
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.0%, 90.3%
Minibatch loss at epoch 100 and iter 156199: 0.156949 and learning rate: 0.010195
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 90.9%, 90.2%
Minibatch loss at epoch 101 and iter 157761: 0.116754 and learning rate: 0.010195
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.0%, 90.2%
Minibatch loss at epoch 102 and iter 159323: 0.119325 and learning rate: 0.010195
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.0%, 90.3%
Minibatch loss at epoch 103 and iter 160885: 0.147021 and learning rate: 0.009686
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.1%, 

Minibatch loss at epoch 147 and iter 229613: 0.110010 and learning rate: 0.004972
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.2%, 90.5%
Minibatch loss at epoch 148 and iter 231175: 0.100133 and learning rate: 0.004723
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.2%, 90.5%
Minibatch loss at epoch 149 and iter 232737: 0.091111 and learning rate: 0.004723
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.2%, 90.5%
Minibatch loss at epoch 150 and iter 234299: 0.065397 and learning rate: 0.004723
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.1%, 90.5%
Minibatch loss at epoch 151 and iter 235861: 0.115737 and learning rate: 0.004487
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.2%, 90.5%
Minibatch loss at epoch 152 and iter 237423: 0.135689 and learning rate: 0.004487
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.2%, 90.5%
Mini

Minibatch loss at epoch 196 and iter 306151: 0.128146 and learning rate: 0.002188
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.3%, 90.7%
Minibatch loss at epoch 197 and iter 307713: 0.158499 and learning rate: 0.002188
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.3%, 90.6%
Minibatch loss at epoch 198 and iter 309275: 0.082163 and learning rate: 0.002188
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.3%, 90.7%
Minibatch loss at epoch 199 and iter 310837: 0.080006 and learning rate: 0.002079
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.4%, 90.7%
Minibatch loss at epoch 200 and iter 312399: 0.136036 and learning rate: 0.002079
Minibatch  train, validation and sanitized validation accuracy: 98.4%, 91.3%, 90.6%
Final Test accuracy: 96.5%
Test accuracy with the best model: 96.5%
Sanitized test accuracy with the best model: 96.1%
Total run time 122.5897 minutes


---