# Overview

Now that you understand the intuition behind CNNs as well as the key building components, it is time to get some in-depth hands-on experience for training a network (making it learn). The key is to understand the effects of **hyperparameters** on network training and validation accuracy.

## Restarting your Virtual Machine
​
If at any point during this assignment you accidentally execute code or do something that cannot seem to undo and need to "restart" the system (including deleting all temporary folders), go ahead and run the following single line of code. It will take about 1 minute to restart. Following this, you will have to proceed at the beginning of the assignment to re-downloaded the data and run the code you have written. Note, the code that you have already written will **not** be deleted; you simply need to start executing the code once again from the start.

In [0]:
!kill -9 -1 # Warning this restarts your machine

## Downloading the Data

Since we have just about mastered the MNIST dataset, it's time to move on to something (slightly) more challenging. In this assignemnt we will be using the CIFAR dataset. Let's download and import it now.

In [0]:
from keras.datasets import cifar10
(x, y), (x_test, y_test) = cifar10.load_data()

## Loading the Data

The CIFAR dataset consists of many small thumbnail RGB images (32 x 32 x 3) across many classes of everyday objects. In this assignment we will specifically be using a subset of CIFAR known as CIFAR-10 containing just 10 classes: airplanes, autos, birds, cats, deer, dogs, frogs, horses, ships, and trucks. For each class, 6,000 examples are provided for a total of 60,000 images (50,000 for training, 10,000 for validation)



In [0]:
print(x.shape)
print(y.shape)

import numpy as np
import pylab

# View random 16 examples
fig = pylab.figure()
for i in range(16):
    index = np.random.randint(50000)
    im = x[index]
    fig.add_subplot(4, 4, i + 1)
    pylab.imshow(im)
    pylab.axis('off')
    
pylab.show()

# Hyperparameters

Recall that while **parameters** represent the actual values of the trainable weights, the term **hyperparameters** represents values that dictate the strategy for weight training. In the previous few assignments, we have already been exposed to various *optimizer types* and initial *learning rates*. Recall that generally the syntax is:
```
train_op = tf.train.[OptimizerType](learning_rate).minimize(loss)
```
Here the `OptimizerType` represents the particular strategy of optimization (AdamOptimizer used in many previous assignments by default) and learning_rate represents a floating point value, commonly starting between 1e-2 and 2e-4. Additionally recall that the *batch size* simply referred to the number of images used in each iteration to determine the update step. While a smaller batch size will lead to faster updates (more steps per unit time) a larger batch size will lead to more accurate updates.

![Batch Size](https://raw.githubusercontent.com/CAIDMRes/images/master/assignment_06-01.png)

## Preventing Overfitting

While optimizer type, learning rate and batch size are commonly used hyperparameters to tune for maximal optimization, quite often an algorithm will in fact become *too accurate* and require separate hyperparameters or strategies to prevent overfitting. Recall that one easy strategy that we have already been exposed to in earlier lessons is the concept of *early stopping*. Several others that we will discuss include L2 regularization, dropout and batch normalization.

### L2 regularization

L2 regularization is strategy to limit the reliance on any given neuron. It is implemented by adding a second term to the loss function whereby the optimizer concurrently attempts to limit the squared magnitude of all weights in the network (in addition to trying to find weights that will result in high network accuracy / low loss).

![L2 Regularization](https://raw.githubusercontent.com/CAIDMRes/images/master/assignment_06-02.png)

Practically, this is implemented by simply taking the sum of all weights squared, then multiplying that total by a constant that indicates how much the loss function should be affected by the L2 regularization component. Recall that the `**` operator means raised to the exponent in Python. Also keep in mind that `tf.reduce_sum()` is the method within Tensorflow to take all elements in a tensor, add them all up, and return the sum total (similar to `np.sum()` in Numpy).

In [0]:
import tensorflow as tf

w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])
w2 = tf.placeholder(tf.float32, [5, 5, 16, 32])

# Define L2 regularization
l2_reg = tf.reduce_sum(w1 ** 2) + tf.reduce_sum(w2 ** 2) 

# Determine weighting of L2 regularization
l2_reg_constant = 0.1   # Use a less of L2 regularization
l2_reg_constant = 5.0   # Use a more of L2 regularization
l2_reg = l2_reg * l2_reg_constant

print(l2_reg)

### Dropout

Dropout is another strategy to limit the reliance on any given neuron. The idea is that if an algorithm's prediction is being driven almost entire by one neuron / image feature (e.g. one portion of the image) it is unlikely to be a pattern that is generalizable to other images. By randomly "turning off" neurons during the training process, the algorithm is never allowed to rely too much on any given neuron output. In the figure below, the neurons marked in red are "turned off" and do *not* contribute to the final prediction.

![Dropout](https://raw.githubusercontent.com/CAIDMRes/images/master/assignment_06-03.png)

Recall that while the technique can be used anywhere in your network, it is most commonly used in the fully connected (non-convolutional) portion.

In [0]:
# Define an aribtrary feature map and first hidden layer weights
feature_map = tf.placeholder(tf.float32, [None, 8, 8, 32])
w1 = tf.placeholder(tf.float32, [8 * 8 * 32, 128])

# Reshape and matmul
feature_map = tf.reshape(feature_map, shape=[-1, 8 * 8 * 32])
hidden = tf.matmul(feature_map, w1)
dropped = tf.nn.dropout(hidden, keep_prob=0.75) # 75% of neurons kept / 25% dropped

print(dropped.shape)

### Batch Normalizaion

Batch normalization is a technique where by the feature maps for *all* images in the batch become normalized with each other (a single mean / SD for the entire batch). As you recall, this technique makes it such that the behavior of the network in prediction for an image is slightly altered depending on the other images in the batch (e.g. an image will be interpreted slightly differently every tie it is passed to the network with a different batch). This overall increases the "diversity" of your training example.

The only thing to keep in mind is that while **during training** the mean and SD of the current batch is used at each layer, **during inference (or validation)** the *population* mean and SD is used instead. This makes sense in that during inference or validation you may occasionally want to test an image by itself in isolation, but in such a situation no other images would be available to calculate a mini-batch mean and SD. So to get around this, the algorithm will keep track of the average mean and SD of feature maps for all images used for training, and simply use these values to approximaate a "virtual" mini-batch for the purposes of batch normalization. 

In [0]:
# Define an arbitrary input image and first convolutional weights
im = tf.placeholder(tf.float32, [None, 32, 32, 3])
w1 = tf.placeholder(tf.float32, [3, 3, 3, 16])

# Convolution and batch-norm
output = tf.nn.conv2d(im, filter=w1, strides=[1, 1, 1, 1], padding='SAME')
print(output.shape)
output = tf.layers.batch_normalization(output, training=True)   # During training
print(output.shape)
output = tf.layers.batch_normalization(output, training=False)  # During validaion

print(output.shape)

# Training CIFAR-10

It's now time to put together a basic CNN to tackle the CIFAR-10 dataset. Let us first define a basic building block for our network:

Block =

* convolution with 3 x 3 filter, stride 1
* ReLU non-linearity
* convolution with 3 x 3 filter, stride 2
* ReLU non-linearity

Based off of this definition, let us create a CNN with the following architecture:

* Block 1 (filter sizes of 32) - output 16 x 16 x 32
* Block 2 (filter sizes of 64) - output 8 x 8 x 64
* Reshape - output 4096
* Hidden layer - output 128
* Logits - output 10

In [0]:
def create_model():
    """
    Method to create the following CNN architecture:
    
      - BLOCK 1 (filter depth of 32)
      - BLOCK 2 (filter depth of 64)
      - RESHAPE
      - HIDDEN LAYER (128 nodes)
      - LOGIT SCORES (10 nodes)
      
    """
    # Reset our graph to build a new one
    tf.reset_default_graph()

    # ------------------------------------------------------------------------
    # Define placeholders for our images and labels
    # ------------------------------------------------------------------------
    #
    # 1. Define images
    # 
    #  - type: float32
    #  - size: [None, 32, 32, 3] so that we can feed in as many images as we need
    # 
    # 2. Define labels
    # 
    #  - type: int64
    #  - size: [None] so that we can feed in as many labels as we need
    # 
    # ------------------------------------------------------------------------

    im = tf.placeholder(?)
    labels = tf.placeholder(?)

    # ------------------------------------------------------------------------
    # Define convolutional blocks 
    # ------------------------------------------------------------------------
    #
    # As in previous assignments, we will use the tf.get_variables(...) method
    # to create matrix variables initialized to random values.
    # 
    # ------------------------------------------------------------------------

    # Block 1 (use a filter depth of 32)
    w1 = tf.get_variable('w1', ?, dtype=tf.float32)
    w2 = tf.get_variable('w2', ?, dtype=tf.float32)
    layer = tf.nn.relu(tf.nn.conv2d(im, w1, strides=?, padding='SAME'))
    layer = tf.nn.relu(tf.nn.conv2d(layer, w2, strides=?, padding='SAME'))

    # Block 2 (use a filter depth of 64)
    w3 = tf.get_variable('w3', ?, dtype=tf.float32)
    w4 = tf.get_variable('w4', ?, dtype=tf.float32)
    layer = tf.nn.relu(tf.nn.conv2d(layer, w3, strides=?, padding='SAME'))
    layer = tf.nn.relu(tf.nn.conv2d(layer, w4, strides=?, padding='SAME'))
    # ------------------------------------------------------------------------
    # Reshape to 1D vector 
    # ------------------------------------------------------------------------
    
    flattened = tf.reshape(layer, [-1, 8 * 8 * 64])

    # ------------------------------------------------------------------------
    # Define our matmul operations
    # ------------------------------------------------------------------------

    # Hidden layer (128 nodes)
    w5 = tf.get_variable('w5', ?, dtype=tf.float32)
    h1 = tf.nn.relu(tf.matmul(flattened, w5))
    
    # Logits (10 nodes)
    w6 = tf.get_variable('w6', ?, dtype=tf.float32)
    logits = tf.matmul(h1, w6)

    # ------------------------------------------------------------------------
    # Define our softmax cross-entropy loss
    # ------------------------------------------------------------------------
    # 
    # HINT: use tf.losses.sparse_softmax_cross_entropy() as above
    #
    # ------------------------------------------------------------------------

    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # ------------------------------------------------------------------------
    # Define our optimizer
    # ------------------------------------------------------------------------

    train_op = tf.train.AdamOptimizer(0.001).minimize(loss)
    
    return im, labels, [w1, w2, w3, w4, w5, w6], logits, loss, train_op

# ------------------------------------------------------------------------
# Test our model
# ------------------------------------------------------------------------
# 
# If the graph was defined properly, we should be able to check the out
# what the model outputs should look like. Can you guess by the shapes
# of our logits and losses will be?
# 
# ------------------------------------------------------------------------

im, labels, weights, logits, loss, train_op = create_model()
print(logits.shape)
print(loss.shape)

## Intialization

Now we set up some code to initialize our network graph, variables and new saver object. This code is identical to the earlier assignments.

In [0]:
# ------------------------------------------------------------------------
# Create our model
# ------------------------------------------------------------------------

im, labels, weights, logits, loss, train_op = create_model()

# ------------------------------------------------------------------------
# Add to collections
# ------------------------------------------------------------------------
# 
# Collections are used by TensorFlow to keep track of certain intermediate 
# values for quick access during save/load functions.
# 
# ------------------------------------------------------------------------

tf.add_to_collection('im', im)
tf.add_to_collection('logits', logits)

# ------------------------------------------------------------------------
# Initialize our test graph
# ------------------------------------------------------------------------
# 
# What two things do we need to initialize our graph?
# 
# ------------------------------------------------------------------------

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

# ------------------------------------------------------------------------
# Initialize our test graph
# ------------------------------------------------------------------------
# 
# Initialize a Saver object
# 
# ------------------------------------------------------------------------

saver = tf.train.Saver()

## Training

Let's train our algorithm!

In [0]:
# ------------------------------------------------------------------------
# Train our algorithm 
# ------------------------------------------------------------------------
# 
# Let's set up a loop to train our algorithm by feeding it data iteratively.
# For each iteration, we will feed a batch_size number of images into our 
# model and let it readjust it's neuronal weights.
# 
# ------------------------------------------------------------------------

def train_model(iterations=1000, batch_size=256):
    
    accuracies = []
    losses = []

    for i in range(iterations):

        # --------------------------------------------------------------------
        # Grab a total of batch_size number of random images and labels 
        # --------------------------------------------------------------------
        # 
        # 1. Pick batch_size number of random indices between 0 and 50,000
        # 2. Select those images / labels
        #
        # --------------------------------------------------------------------

        rand_indices = np.random.randint(?, size=(batch_size))
        x_batch = x[?]
        y_batch = y[?]
       
        y_batch = y_batch[..., 0]
      
        # --------------------------------------------------------------------
        # Normalize x_batch
        # --------------------------------------------------------------------
        # 
        # Currently, values in x range from 0 to 255. If we normalize these values
        # to a mean of 0 and SD of 1 we will improve the stability of training
        # and furthermore improve interpretation of learned weights. Use the
        # following code to normalize your batch:
        # 
        # --------------------------------------------------------------------

        x_batch = (x_batch - np.mean(x_batch)) / np.std(x_batch)

        # Convert to types matching our defined placeholders
        x_batch = x_batch.astype('float32')
        y_batch = y_batch.astype('int64')

        # Prepare feed_dict
        feed_dict = {?}

        # --------------------------------------------------------------------
        # Run training iteration via sess.run()
        # --------------------------------------------------------------------
        # 
        # Here, in addition to whichever ouputs we wish to extract, we need to
        # also include the train_op variable. Including train_op will tell 
        # Tensorflow that in addition to calculating the intermediates of our graph,
        # we also need to readjust the variables so that the overall loss goes
        # down.
        # 
        # --------------------------------------------------------------------

        outputs = sess.run([logits, loss, train_op], feed_dict)

        # --------------------------------------------------------------------
        # Use argmax to determine highest logit (model guess)
        # --------------------------------------------------------------------
        # 
        # Keep in mind our logits matrix is (batch_size x 10) in size representing
        # a total of batch_size number of predictions. How do we process this matrix
        # with the np.argmax() to find the highest logit along each row of the matrix
        # (e.g. find the prediction for each of our images)?
        # 
        # HINT: what does the axis parameter in np.argmax(a, axis) specify?
        # 
        # --------------------------------------------------------------------

        predictions = np.argmax(?)

        # --------------------------------------------------------------------
        # Calculate accuracy 
        # --------------------------------------------------------------------
        # 
        # Consider the following:
        # 
        # - predictions = the predicted digits
        # - y_batch = the ground-truth digits
        # 
        # How do I calculate an accuracy % with this data?
        # 
        # --------------------------------------------------------------------

        accuracy = ?

        # --------------------------------------------------------------------
        # Accumulate and print iteration, loss and accuracy 
        # --------------------------------------------------------------------

        print('Iteration %05i | Loss = %07.3f | Accuracy = %0.4f' %
            (i + 1, outputs[1], accuracy))

        losses.append(outputs[1])
        accuracies.append(accuracy)
        
    return losses, accuracies

In [0]:
# --------------------------------------------------------------------
# Train model
# --------------------------------------------------------------------
losses, accuracies = train_model(iterations=1000, batch_size=256)

# --------------------------------------------------------------------
# Graph outputs and accuracy
# --------------------------------------------------------------------

import pylab
pylab.plot(losses)
pylab.title('Model loss over time')
pylab.show()

pylab.plot(accuracies)
pylab.title('Model accuracy over time')
pylab.show()

# --------------------------------------------------------------------
# Save model
# --------------------------------------------------------------------
# 
# In this step, all model variables and the underlying graph structure
# are saved so that they can be reloaded. Although it looks like just one
# file is saved here, in fact both a *.cpkt and *.cpkt.meta file are both
# saved in this single line of code.
#  
# --------------------------------------------------------------------

import os
model_file = './model_cnn_32_64_128/model.ckpt'
os.makedirs(os.path.dirname(model_file), exist_ok=True)
print('Saving model')
saver.save(sess, model_file)

# Running inference

Now that we have a trained model, let's go ahead and see how it performs! We will use the same procedure as before to load up a trained network and then feed in random digits to see how it fares.

In [0]:
# Load the saved model
tf.reset_default_graph()
sess = tf.InteractiveSession()
saver = tf.train.import_meta_graph('./model_cnn_32_64_128/model.ckpt.meta')
saver.restore(sess, './model_cnn_32_64_128/model.ckpt')

# Find our placeholders
im = tf.get_collection('im')[0]
logits = tf.get_collection('logits')[0]

# Find a random test image
i = int(np.random.randint(50000))
image = x[i].reshape(1, 32, 32, 3)
label = y[i, 0]

# Normalize the image
image = (image - np.mean(image)) / np.std(image)

# Create a feed_dict
feed_dict = {im: image}

# Pass data through the network
l = sess.run(logits, feed_dict)

# Convert logits to predictions
prediction = np.argmax(l)

# Pred dictionary
pred_dict = {
    0: 'airplane',
    1: 'automobile',
    2: 'bird',
    3: 'cat',
    4: 'deer',
    5: 'dog',
    6: 'frog',
    7: 'horse',
    8: 'ship',
    9: 'truck'
}

# Visualize
pylab.imshow(x[i])
pylab.axis('off')
pylab.title('My prediction is %s' % pred_dict[prediction])
pylab.show()

# Model validation

Using the template code to run inference shown above, let us now write code to:

* load our saved model 
* create a `feed_dict` with new test data 
* pass through network using `sess.run()`
* convert the `logits` to predictions
* calculate overall network accuracy

In [0]:
def validate_model(model_file):
    """
    Method to test the validation performance of a model using the 
    test set data.
    
    :params
    
      (str) model_file : name of model file saved by saver object
      
    """
    # Load saved model
    tf.reset_default_graph()
    sess = tf.InteractiveSession()
    saver = tf.train.import_meta_graph('%s.meta' % model_file )
    saver.restore(sess, model_file)

    # Find our placeholders
    im = tf.get_collection('im')[0]
    logits = tf.get_collection('logits')[0]

    # Normalize our input data x_test
    input_data = (x_test - np.mean(x_test)) / np.std(x_test)
    
    # -------------------------------------------------------
    # Create a feed_dict
    # -------------------------------------------------------
    # 
    # HINT: What do we need to do to properly format this image
    # for input into the CNN?
    # 
    # -------------------------------------------------------

    feed_dict = {?}

    # Pass data through the network using sess.run() to get our logits 
    output = sess.run(?)

    # Convert logits to predictions
    predictions = np.argmax(output, axis=1)

    # Compare predictions to ground-truth to find accuracy
    accuracy = np.sum(predictions == y_test[..., 0]) / x_test.shape[0]
    
    print('Network test-set accuracy: %0.4f' % accuracy)
    
# Pass our model_file
model_file = './model_cnn_32_64_128/model.ckpt' 
validate_model(model_file)

## Notes

How did the algorithm perform? Did it finish converging? Train a few more thousand iterations, and feel free to change the learning rate and/or batch size as needed to maximize overall algorithm accuracy. What is the final performance on training and test data? 

If you've tested a few different hyperparametere configurations, you will likely conclude that the algorithm seems to be performing a bit worse on the test data compared to the training data; in other words the algorithm is *overfitting*. Let's go through the following exercises to see what we can do about that.

# Exercises

In these exercises, we explore the effect of several strategies used to limit overfitting.

## Exercise 1

Re-train the same network however this time use batch normalization after the convolutional operation but before the ReLU non-linearity for all layers:

```
conv > batch-norm > ReLU > ...
```

Keep in mind that because training mode needs to be specified for batch normalization, we need to create a **new** placeholder. What do you expect to happend to training accuracy? What do you expect to happen with validation accuracy?

In [0]:
def create_model():
    """
    Method to create the following CNN architecture:
    
      - BLOCK 1 (filter depth of 32)
      - BLOCK 2 (filter depth of 64)
      - RESHAPE
      - HIDDEN LAYER (128 nodes)
      - LOGIT SCORES (10 nodes)
      
    """
    # Reset our graph to build a new one
    tf.reset_default_graph()

    # ------------------------------------------------------------------------
    # Define placeholders for our images and labels
    # ------------------------------------------------------------------------
    #
    # 1. Define images
    # 
    #  - type: float32
    #  - size: [None, 32, 32, 3] so that we can feed in as many images as we need
    # 
    # 2. Define labels
    # 
    #  - type: int64
    #  - size: [None] so that we can feed in as many labels as we need
    #
    # 3. Define training mode placeholder (for batch normalization)
    # 
    # ------------------------------------------------------------------------

    im = tf.placeholder(tf.float32, shape = (None, 32, 32, 3))
    labels = tf.placeholder(tf.int64, shape=(None))
    training = tf.placeholder(tf.bool)

    # ------------------------------------------------------------------------
    # Define convolutional blocks 
    # ------------------------------------------------------------------------
    #
    # As in previous assignments, we will use the tf.get_variables(...) method
    # to create matrix variables initialized to random values.
    # 
    # ------------------------------------------------------------------------

    # Block 1 (use a filter depth of 32)
    w1 = tf.get_variable('w1', shape=[3, 3, 3, 32], dtype=tf.float32)
    w2 = tf.get_variable('w2', shape=[3, 3, 32, 32], dtype=tf.float32)
    layer = tf.nn.conv2d(im,w1,strides=[1, 1, 1, 1], padding='SAME')
    layer = tf.nn.relu(tf.layers.batch_normalization(layer, training=training))
    layer = tf.nn.conv2d(layer, w2, strides=[1, 2, 2, 1], padding='SAME')
    layer = tf.nn.relu(tf.layers.batch_normalization(layer, training=training))

    # Block 2 (use a filter depth of 64)
    w3 = tf.get_variable('w3', shape=[3, 3, 32, 64], dtype=tf.float32)
    w4 = tf.get_variable('w4', shape=[3, 3, 64, 64], dtype=tf.float32)
    layer = tf.nn.conv2d(layer, w3, strides = [1,1,1,1], padding='SAME')
    layer = tf.nn.relu(tf.layers.batch_normalization(layer, training=training))
    layer = tf.nn.conv2d(layer, w4, strides = [1,2,2,1], padding='SAME')
    layer = tf.nn.relu(tf.layers.batch_normalization(layer, training=training))

    # ------------------------------------------------------------------------
    # Reshape to 1D vector 
    # ------------------------------------------------------------------------
    
    flattened = tf.reshape(layer, [-1, 8 * 8 * 64])

    # ------------------------------------------------------------------------
    # Define our matmul operations
    # ------------------------------------------------------------------------

    # Hidden layer (128 nodes)
    w5 = tf.get_variable('w5', shape=[8*8*64,128], dtype=tf.float32)
    h1 = tf.matmul(flattened, w5)
    h1 = tf.nn.relu(tf.layers.batch_normalization(h1, training=training))
    
    # Logits (10 nodes)
    w6 = tf.get_variable('w6', shape=[128,10], dtype=tf.float32)
    logits = tf.matmul(h1, w6)

    # ------------------------------------------------------------------------
    # Define our softmax cross-entropy loss
    # ------------------------------------------------------------------------
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # ------------------------------------------------------------------------
    # Define our optimizer
    # ------------------------------------------------------------------------
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        train_op = tf.train.AdamOptimizer(0.001).minimize(loss)
    
    return im, labels, training, [w1, w2, w3, w4, w5, w6], logits, loss, train_op

In [0]:
# ------------------------------------------------------------------------
# Create model 
# ------------------------------------------------------------------------
im, labels, training, weights, logits, loss, train_op = create_model()

# ------------------------------------------------------------------------
# Add to collections, initialize graph and saver
# ------------------------------------------------------------------------
tf.add_to_collection('im', im)
tf.add_to_collection('logits', logits)
tf.add_to_collection('training', training)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
saver = tf.train.Saver()

### Updating train_model() 

Here we need to define a slightly different `train_model()` algorithm to account for the extra placeholder used to tell the algorithm whether or not batch normalization is in training for validation mode.

In [0]:
# ------------------------------------------------------------------------
# Train our algorithm 
# ------------------------------------------------------------------------
# 
# Let's set up a loop to train our algorithm by feeding it data iteratively.
# For each iteration, we will feed a batch_size number of images into our 
# model and let it readjust it's neuronal weights.
# 
# ------------------------------------------------------------------------

def train_model(iterations=1000, batch_size=256):
    
    accuracies = []
    losses = []

    for i in range(iterations):

        # --------------------------------------------------------------------
        # Grab a total of batch_size number of random images and labels 
        # --------------------------------------------------------------------

        rand_indices = np.random.randint(50000, size=(batch_size))
        x_batch = x[rand_indices]
        y_batch = y[rand_indices]
        y_batch = y_batch[..., 0]

        # --------------------------------------------------------------------
        # Normalize x_batch
        # --------------------------------------------------------------------
        x_batch = (x_batch - np.mean(x_batch)) / np.std(x_batch)

        # Convert to types matching our defined placeholders
        x_batch = x_batch.astype('float32')
        y_batch = y_batch.astype('int64')

        # --------------------------------------------------------------------
        # Prepare feed_dict
        # --------------------------------------------------------------------
        # 
        # What extra piece of information (which extra placeholder) do we need
        # to fill in when using batch normalization?
        #
        # --------------------------------------------------------------------
        feed_dict = {im: x_batch, labels: y_batch, training: Grue, keep_prob: 0.25}

        # --------------------------------------------------------------------
        # Run training iteration via sess.run()
        # --------------------------------------------------------------------
        outputs = sess.run([logits, loss, train_op], feed_dict)

        # --------------------------------------------------------------------
        # Use argmax to determine highest logit (model guess)
        # --------------------------------------------------------------------
        predictions = np.argmax(outputs[0], axis=1)

        # --------------------------------------------------------------------
        # Calculate accuracy 
        # --------------------------------------------------------------------
        accuracy = np.sum(predictions == y_batch) / batch_size

        # --------------------------------------------------------------------
        # Accumulate and print iteration, loss and accuracy 
        # --------------------------------------------------------------------
        print('Iteration %05i | Loss = %07.3f | Accuracy = %0.4f' %
            (i + 1, outputs[1], accuracy))

        losses.append(outputs[1])
        accuracies.append(accuracy)
        
    return losses, accuracies

In [0]:
# --------------------------------------------------------------------
# Train model
# --------------------------------------------------------------------
losses, accuracies = train_model(iterations=3000, batch_size=256)

# --------------------------------------------------------------------
# Graph outputs and accuracy
# --------------------------------------------------------------------
pylab.plot(losses)
pylab.title('Model loss over time')
pylab.show()

pylab.plot(accuracies)
pylab.title('Model accuracy over time')
pylab.show()

# --------------------------------------------------------------------
# Save model
# --------------------------------------------------------------------
import os
model_file = './model_cnn_32_64_128_bn/model.ckpt'
os.makedirs(os.path.dirname(model_file), exist_ok=True)
print('Saving model')
saver.save(sess, model_file)

### Updating validate_model() 

Here we need to define a slightly different `validate_model()` algorithm to account for the extra placeholder used to tell the algorithm whether or not batch normalization is in training for validation mode.

In [0]:
def validate_model(model_file):
    """
    Method to test the validation performance of a model using the 
    test set data.
    
    :params
    
      (str) model_file : name of model file saved by saver object
      
    """
    # Load saved model
    tf.reset_default_graph()
    sess = tf.InteractiveSession()
    saver = tf.train.import_meta_graph('%s.meta' % model_file )
    saver.restore(sess, model_file)

    # Find our placeholders
    im = tf.get_collection('im')[0]
    logits = tf.get_collection('logits')[0]
    training = tf.get_collection('training')[0]

    # Normalize our input data x_test
    input_data = (x_test - np.mean(x_test)) / np.std(x_test)
    
    # --------------------------------------------------------------------
    # Prepare feed_dict
    # --------------------------------------------------------------------
    # 
    # What extra piece of information (which extra placeholder) do we need
    # to fill in when using batch normalization?
    #
    # --------------------------------------------------------------------
    feed_dict = {im: input_data, ?}

    # Pass data through the network using sess.run() to get our logits 
    output = sess.run(logits, feed_dict)

    # Convert logits to predictions
    predictions = np.argmax(output, axis=1)

    # Compare predictions to ground-truth to find accuracy
    accuracy = np.sum(predictions == y_test[..., 0]) / x_test.shape[0]
    
    print('Network test-set accuracy: %0.4f' % accuracy)

In [0]:
# ------------------------------------------------------------------------
# Validate model 
# ------------------------------------------------------------------------
validate_model(model_file)

## Exercise 2

Re-train the same network however this time use 50% dropout in the hidden layer after the  ReLU non-linearity. What do you expect to happend to training accuracy? What do you expect to happen with validation accuracy? What do you expect to happen with speed to convergence?

In [0]:
def create_model():
    """
    Method to create the following CNN architecture:
    
      - BLOCK 1 (filter depth of 32)
      - BLOCK 2 (filter depth of 64)
      - RESHAPE
      - HIDDEN LAYER (128 nodes)
      - LOGIT SCORES (10 nodes)
      
    """
    # Reset our graph to build a new one
    tf.reset_default_graph()

    # ------------------------------------------------------------------------
    # Define placeholders for our images and labels
    # ------------------------------------------------------------------------
    #
    # 1. Define images
    # 
    #  - type: float32
    #  - size: [None, 32, 32, 3] so that we can feed in as many images as we need
    # 
    # 2. Define labels
    # 
    #  - type: int64
    #  - size: [None] so that we can feed in as many labels as we need
    # 
    # 3. Define training mode placeholder (for batch normalization)
    # 
    # 4. Define keep_prob placeholder (for dropout)
    # 
    # ------------------------------------------------------------------------

    im = tf.placeholder(tf.float32, shape=?)
    labels = tf.placeholder(tf.int64, shape=?)
    training = tf.placeholder(tf.bool)
    keep_prob = tf.placeholder(tf.float32)

    # ------------------------------------------------------------------------
    # Define convolutional blocks 
    # ------------------------------------------------------------------------
    #
    # As in previous assignments, we will use the tf.get_variables(...) method
    # to create matrix variables initialized to random values.
    # 
    # ------------------------------------------------------------------------

    # Block 1 (use a filter depth of 32)
    w1 = tf.get_variable('w1', shape=?, dtype=tf.float32)
    w2 = tf.get_variable('w2', shape=?, dtype=tf.float32)
    layer = tf.nn.conv2d(?)
    layer = tf.nn.relu(tf.layers.batch_normalization(layer, training=training))
    layer = tf.nn.conv2d(?)
    layer = tf.nn.relu(tf.layers.batch_normalization(layer, training=training))

    # Block 2 (use a filter depth of 64)
    w3 = tf.get_variable('w3', shape=?, dtype=tf.float32)
    w4 = tf.get_variable('w4', shape=?, dtype=tf.float32)
    layer = tf.nn.conv2d(?)
    layer = tf.nn.relu(tf.layers.batch_normalization(layer, training=training))
    layer = tf.nn.conv2d(?)
    layer = tf.nn.relu(tf.layers.batch_normalization(layer, training=training))

    # ------------------------------------------------------------------------
    # Reshape to 1D vector 
    # ------------------------------------------------------------------------
    
    flattened = tf.reshape(layer, [-1, 8 * 8 * 64])

    # ------------------------------------------------------------------------
    # Define our matmul operations
    # ------------------------------------------------------------------------

    # Hidden layer (128 nodes)
    w5 = tf.get_variable('w5', shape=?, dtype=tf.float32)
    h1 = tf.matmul(flattened, w5)
    h1 = tf.nn.relu(tf.layers.batch_normalization(h1, training=training))
    h1 = tf.nn.dropout(?)
    
    # Logits (10 nodes)
    w6 = tf.get_variable('w6', shape=?, dtype=tf.float32)
    logits = tf.matmul(h1, w6)

    # ------------------------------------------------------------------------
    # Define our softmax cross-entropy loss
    # ------------------------------------------------------------------------
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # ------------------------------------------------------------------------
    # Define our optimizer
    # ------------------------------------------------------------------------
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        train_op = tf.train.AdamOptimizer(0.001).minimize(loss)
    
    return im, labels, training, keep_prob, [w1, w2, w3, w4, w5, w6], logits, loss, train_op

In [0]:
# ------------------------------------------------------------------------
# Create model 
# ------------------------------------------------------------------------
im, labels, training, keep_prob, weights, logits, loss, train_op = create_model()

# ------------------------------------------------------------------------
# Add to collections, initialize graph and saver
# ------------------------------------------------------------------------
tf.add_to_collection('im', im)
tf.add_to_collection('logits', logits)
tf.add_to_collection('training', training)
tf.add_to_collection('keep_prob', keep_prob)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
saver = tf.train.Saver()

### Updating train_model() 

Here we need to define a slightly different `train_model()` algorithm to account for the extra placeholder used to tell the algorithm what the keep_prob is (50% during training, 100% during validation or inference):

In [0]:
# ------------------------------------------------------------------------
# Train our algorithm 
# ------------------------------------------------------------------------
# 
# Let's set up a loop to train our algorithm by feeding it data iteratively.
# For each iteration, we will feed a batch_size number of images into our 
# model and let it readjust it's neuronal weights.
# 
# ------------------------------------------------------------------------

def train_model(iterations=1000, batch_size=256):
    
    accuracies = []
    losses = []

    for i in range(iterations):

        # --------------------------------------------------------------------
        # Grab a total of batch_size number of random images and labels 
        # --------------------------------------------------------------------
        rand_indices = np.random.randint(50000, size=(batch_size))
        x_batch = x[rand_indices]
        y_batch = y[rand_indices]
        y_batch = y_batch[..., 0]

        # --------------------------------------------------------------------
        # Normalize x_batch
        # --------------------------------------------------------------------
        x_batch = (x_batch - np.mean(x_batch)) / np.std(x_batch)

        # Convert to types matching our defined placeholders
        x_batch = x_batch.astype('float32')
        y_batch = y_batch.astype('int64')

        # --------------------------------------------------------------------
        # Prepare feed_dict
        # --------------------------------------------------------------------
        # 
        # What extra pieces of information (which extra placeholders) do we need
        # to fill in when using batch normalization and dropout?
        #
        # --------------------------------------------------------------------
        feed_dict = {im: x_batch, labels: y_batch, training: True, keep_prob: 0.25}

        # --------------------------------------------------------------------
        # Run training iteration via sess.run()
        # --------------------------------------------------------------------
        outputs = sess.run([logits, loss, train_op], feed_dict)

        # --------------------------------------------------------------------
        # Use argmax to determine highest logit (model guess)
        # --------------------------------------------------------------------
        predictions = np.argmax(outputs[0], axis=1)

        # --------------------------------------------------------------------
        # Calculate accuracy 
        # --------------------------------------------------------------------
        accuracy = np.sum(predictions == y_batch) / batch_size

        # --------------------------------------------------------------------
        # Accumulate and print iteration, loss and accuracy 
        # --------------------------------------------------------------------
        print('Iteration %05i | Loss = %07.3f | Accuracy = %0.4f' %
            (i + 1, outputs[1], accuracy))

        losses.append(outputs[1])
        accuracies.append(accuracy)
        
    return losses, accuracies

In [0]:
# --------------------------------------------------------------------
# Train model
# --------------------------------------------------------------------
losses, accuracies = train_model(iterations=3000, batch_size=256)

# --------------------------------------------------------------------
# Graph outputs and accuracy
# --------------------------------------------------------------------

pylab.plot(losses)
pylab.title('Model loss over time')
pylab.show()

pylab.plot(accuracies)
pylab.title('Model accuracy over time')
pylab.show()

# --------------------------------------------------------------------
# Save model
# --------------------------------------------------------------------
import os
model_file = './model_cnn_32_64_128_bn_drop/model.ckpt'
os.makedirs(os.path.dirname(model_file), exist_ok=True)
print('Saving model')
saver.save(sess, model_file)

### Updating validate_model() 

Here we need to define a slightly different `validate_model()` algorithm to account for the extra placeholder used to tell the algorithm what the keep_prob should be.

In [0]:
def validate_model(model_file):
    """
    Method to test the validation performance of a model using the 
    test set data.
    
    :params
    
      (str) model_file : name of model file saved by saver object
      
    """
    # Load saved model
    tf.reset_default_graph()
    sess = tf.InteractiveSession()
    saver = tf.train.import_meta_graph('%s.meta' % model_file )
    saver.restore(sess, model_file)

    # Find our placeholders
    im = tf.get_collection('im')[0]
    logits = tf.get_collection('logits')[0]
    training = tf.get_collection('training')[0]
    keep_prob = tf.get_collection('keep_prob')[0]

    # Normalize our input data x_test
    input_data = (x_test - np.mean(x_test)) / np.std(x_test)
    
    # --------------------------------------------------------------------
    # Prepare feed_dict
    # --------------------------------------------------------------------
    # 
    # What extra pieces of information (which extra placeholders) do we need
    # to fill in when using batch normalization and dropout?
    #
    # --------------------------------------------------------------------
    feed_dict = {im: x_batch, labels: y_batch, training: False, keep_prob: 1}

    # Pass data through the network using sess.run() to get our logits 
    output = sess.run(logits, feed_dict)

    # Convert logits to predictions
    predictions = np.argmax(output, axis=1)

    # Compare predictions to ground-truth to find accuracy
    accuracy = np.sum(predictions == y_test[..., 0]) / x_test.shape[0]
    
    print('Network test-set accuracy: %0.4f' % accuracy)

In [0]:
# ------------------------------------------------------------------------
# Validate model 
# ------------------------------------------------------------------------
validate_model(model_file)

# Advanced Exercises


## Exercise 1

Add L2 regularization to the model. Keep in the mind the following necessary step:

* add together the summed (`tf.reduce_sum()`) squared version of all the weights
* multiple this value by some constant
* combine L2 regularization loss + softmax cross-entropy loss
* pass the combined loss to the optimizer `minimize()` function

Try training different models with different L2 regularization. What do you expect to happen to training accuracy? Validation accuracy?

## Exercise 2

One additional very useful strategy to prevent overfitting is network architecture design. Which components of the currecnt CNN architecture could be improved in this regard? 

Some thoughts:

* add extra convolutional layers to decrease the feature map to 4 x 4 x N (instead of 8 x 8) before reshaping and apply matrix multiplications
* decrease the size of the hidden layer