# Overview

So far in the first three assignments you have learned the evolution of AI algorithms from expert (rule-based) systems, linear classifiers and regular artificial neural networks (ANNs). To finish off the series, we will finally implement a **convolutional** neural network (CNN), a special type of neural network that is customized specifically for imaging data and which is now widely recognized as the state-of-the-art for nearly all image recognition tasks.

## Outline

* convolutional operations
* non-linearities
* spatial sub-sampling
* fully-connected layers
* training a CNN
* inference
* validation

## Restarting your Virtual Machine
​
If at any point during this assignment you accidentally execute code or do something that cannot seem to undo and need to "restart" the system (including deleting all temporary folders), go ahead and run the following single line of code. It will take about 1 minute to restart. Following this, you will have to proceed at the beginning of the assignment to re-downloaded the data and run the code you have written. Note, the code that you have already written will **not** be deleted; you simply need to start executing the code once again from the start.

In [0]:
!kill -9 -1 # Warning this restarts your machine

## Downloading the Data

The following commands can be used to copy over the assignment materials to your local Colaboratory instance and unzip in preparation for your assignment:

In [0]:
!git clone https://github.com/CAIDMRes/lecture_02
!unzip lecture_02/data.zip
!rm -r lecture_02
!ls

## Loading the Data

In [0]:
# Loading a pickle (*.pkl) file
import pickle
x = pickle.load(open('x.pkl', 'rb'))

# x is a NumPy array with (flattened) image data
print(type(x))
print(x.shape)

# y is a NumPy array with labels
y = pickle.load(open('y.pkl', 'rb'))
print(y.shape) 

# Convolutions

The primary difference between a regular ANN vs. CNN is the use of *convolutional* filters (kernels). Compared to global filters used in a regular ANN, CNNs use local convolutional filters which act uniformly at many different regions in the entire image. Whereas an ANN tests for the presence of a relatively *dense* (high-resolution) pattern throughout the image resulting in a single activation value per node, a CNN tests for the presence of a relatively *small* (low-resolution) pattern at many different image subregions resulting in a **feature map** per node. 

![Convolutions](https://raw.githubusercontent.com/CAIDMRes/images/master/assignment_05-01.png)

## What are convolutions?

Convolutions are a mathematic construct to multiply a given smaller matrix (filter or kernel) at various locations within a larger matrix. The intuitive explanation is that a convolution checks for the how well a particular subregion of the matrix matches the target filter or kernel. If there is a strong match, the overall output value (multiplication followed by summation) is high. 

In the following figure, the original 4 x 4 matrix (blue) is between convolved with a 3 x 3 kernel (dark blue) with the result of the operation stored in the top-left hand corner of the output matrix (dark green). 

![Convolution](https://raw.githubusercontent.com/CAIDMRes/images/master/assignment_05-02.png)

For an excellent, highly-recommended (visual) refresher for convolutional operations, see the following [link](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html).

##  Convolutional Math

Recall that there are three main parameters that you must define with convolutional operations. These include:

### Filter size

Recall that convolution of 2D images requires the use of a 4D convolutional kernel of size `i x j x c0 x c1`. This kernel size has two main components:

* `i x j`: this is the 2D size of filter (convolutional kernel) you want to apply to your original matrix; common sizes here include 3 x 3, 5 x 5 or 7 x 7.
* `c0 x c1`: this is size of the input image depth (`c0`) and the output image depth (`c1`); recall that our original greyscale images are the equivalent of a single image with depth of 1, whereas for every other layer in our CNN this depth corresponds to the number of feature maps in a given layer

### Stride
 
The stride defines whether to apply to filter to every location (stride 1), every other location (stride 2), etc. Recall that strides > 1 will result in an output matrix that is smaller by a factor proportional to the stride (e.g. stride 2 results in an output matrix that is 1/2th the original size).

### Padding

The padding strategy determines whether or not to pad the boundary of the image with 0s (or some other value) prior to convolutions. Recall that a **same** strategy of padding simply pads the image with enough 0s such that the output matrix is the *same size* as the input matrix, compared to a **valid** strategy of padding which does *not* pad the image at all and insteads only performs convolutions on the *valid* portion of the image

## Convolutions with Tensorflow

Now that we understand the key parameters needed to define a convolution operation, implemention is Tensorflow is quite simple. The key consideration is understanding the four different dimensions of our matrices, `N x H x W x C`. To begin, `H x W` simply corresponds to the height and width of our image or feature maps (for our 784-element MNIST digits `H x W` = 28 x 28). Next,  `C` represents the depth (number of "channels") of our image or feature maps (for MNIST digits `C` = 1 but for every other intermediate feature map it represents the total number of maps that are stacked together). Finally, `N` as in our previous assignments represents the number of images we pass into the network for now we will leave undefined (`None`) so that the network can be flexible. 

This standard notation for variable matrix size is used throughout Tensorflow and other deep learning libraries so it is useful to commit to memory. In addition to the images and feature maps, the same 4D convention is used to define operation stride. For all except the most specialized use cases, stride will be equal to 1 along both the `N` and `C` dimensions, with variations in stride primarily defined along the `H x W` direction (e.g to apply an operation that strides by 2 along the height and width directions would require a stride length = `[1, 2, 2, 1]`).

Let's take a look at all this here:

In [0]:
import tensorflow as tf
import numpy as np

# (1) Define a placeholder for the original image
im = tf.placeholder(tf.float32, [None, 28, 28, 1])

# (2) Define a placeholder for the first convolutional kernel
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])

# (3) Apply a convolution
output = tf.nn.conv2d(im, w1, strides=[1, 1, 1, 1], padding='SAME')

# What is the size of our output matrix?
print(output.shape)

Does the size of our output matrix make sense? Keep in mind that in this particular example with a `5 x 5 x 1 x 16` convolutional kernel, we are creating a "bank" of 16 different filters each of size `5 x 5` then applying to our input matrix (which has a depth of 1). Accordingly, this operation results in an output of 16 total feature maps, each of size `28 x 28`.

Let's see if you can solve through the next additional examples:

In [0]:
# ==============================================================================
# Exercise 1 
# ==============================================================================
# 
# Using the same inputs, define an operation with a stride of length of 2 in both
# the height (H) and width (W) directions (but 1 in all remaining directions).
# What is the output shape?
#
# ==============================================================================

im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])
output = tf.nn.conv2d(?) 
print(output.shape)

In [0]:
# ==============================================================================
# Exercise 2 
# ==============================================================================
# 
# Using the same inputs, define an operation with a padding operation of 'VALID'
# instead of `SAME` (stride length of 1). What is the output shape?
#
# ==============================================================================

im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])
output = tf.nn.conv2d(?) 
print(output.shape)

In [0]:
# ==============================================================================
# Exercise 3 
# ==============================================================================
# 
# Using the same inputs, define an operation with a filter size of 7 x 7 instead
# of 5 x 5 ('SAME' padding, stride length of 1) What is the output shape?
#
# ==============================================================================

im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [7, 7, 1, 16])
output = tf.nn.conv2d(?) 
print(output.shape)

In [0]:
# ==============================================================================
# Exercise 4 
# ==============================================================================
# 
# Create a new input that has an image depth of 4 instead of 1. In medical 
# imaging, these scenario is common when we would like to use different MR  
# series (T1, T2, etc) of the same body all at the same time to make an 
# algorithm. Assuming a 5 x 5 convolutional kernel with 16 output maps, what is
# the full convolutional kernel size required for this operation?
#
# ==============================================================================

im = tf.placeholder(tf.float32, [None, 28, 28, 4])
w1 = tf.placeholder(tf.float32, ?) # What size is needed here?
output = tf.nn.conv2d(im, w1, strides=[1, 1, 1, 1], padding='SAME') 
print(output.shape)

In [0]:
# ==============================================================================
# **ADVANCED** 
# ==============================================================================
# 
# (1) Define an operation with asymmetric striding (1 along H, 2 along W). What
#     is the resulting output shape?
# 
# (2) Define an operation with `VALID` padding and stride 2 along H and W. What
#     is the resulting output shape? What is the formula for resulting output 
#     shape when both `VALID` padding strategy and stride > 1 are used?
#
# ==============================================================================

im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])

# Non-linearities

As in regular (non-convolutional) neural networks, the same ReLU nonlinearity function is used for CNNs. Just like the ReLU is applied after a matrix multiply operation in ANNs, a ReLU is applied after a convolutional operation in CNNs:

In [0]:
# Define inputs
im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])

# Perform convolution + ReLU
output = tf.nn.relu(tf.nn.conv2d(im, w1, strides=[1, 1, 1, 1], padding='SAME'))

# What is the size of the output matrix?
print(output.shape)

# Spatial Downsampling

As we progress through a series of convolutional operations, each resulting new stack of feature maps represents more abstract and high-level features than the layer before. Because of the increasing larger scale of these features "deeper" in the network, we can decrease the resolution of the feature maps and still retain quite of a bit of information about the original image. 

## Pooling operation

The simplest method to downsample feature maps is the max-pool operation. Similar to a convolutional filter, a max-pool operation checks predefined subregions of an input matrix and simply outputs the maximum value (while discarding all other values). Recall that this operation must be used in combination with a stride value > 1 to actually downsample a given input matrix. An example of a 2 x 2 max-pool operation with a stride of 2 is shown below:

![Max-Pool](https://raw.githubusercontent.com/CAIDMRes/images/master/assignment_05-03.png)

To define a max-pool operation in Tensorflow, simply keep in mind the standard `N x H x W x C` convention for describing kernel sizes and strides. As before, `N` and `C` will almost always be 1 except for the most specialized use cases, with variations in kernel size primarily being defined along the `H x W` dimensions. Again, keep in mind that stride value > 1 is also required to perform downsampling:

In [0]:
# Define inputs
im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])

# Perform convolution + ReLU
output = tf.nn.relu(tf.nn.conv2d(im, w1, strides=[1, 1, 1, 1], padding='SAME'))

# Perform max-pool
output = tf.nn.max_pool(output, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

# What is the size of the output matrix?
print(output.shape)

Let's see if you can solve through these additional examples:

In [0]:
# ==============================================================================
# Exercise 1 
# ==============================================================================
# 
# Define a max-pool operation with a kernel size of [1, 4, 4, 1] and a stride
# of [1, 4, 4, 1]. What is the output matrix shape?
#
# ==============================================================================

im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])
output = tf.nn.relu(tf.nn.conv2d(im, w1, strides=[1, 1, 1, 1], padding='SAME'))
output = tf.nn.max_pool(?)

print(output.shape)

In [0]:
# ==============================================================================
# Exercise 2 
# ==============================================================================
# 
# Define a max-pool operation with a kernel size of [1, 4, 4, 1] and a stride
# of [1, 2, 2, 1]. What is the output matrix shape?
#
# ==============================================================================

im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])
output = tf.nn.relu(tf.nn.conv2d(im, w1, strides=[1, 1, 1, 1], padding='SAME'))
output = tf.nn.max_pool(?)

print(output.shape)

In [0]:
# ==============================================================================
# **ADVANCED** 
# ==============================================================================
# 
# Define an average pool operation with kernel size of [1, 2, 2, 1] and a stride
# of [1, 2, 2, 1]. What is the name of the function in Tensorflow for this 
# operation? How does this operation differ from a standard max-pool operation?
# If we change the kernel size to [1, 4, 4, 1] while keeping the stride unchanged,
# what happens to the output matrix shape? What is the primary difference to the 
# first operation?
#
# ==============================================================================


im = tf.placeholder(tf.float32, [None, 28, 28, 1])
w1 = tf.placeholder(tf.float32, [5, 5, 1, 16])
output = tf.nn.relu(tf.nn.conv2d(im, w1, strides=[1, 1, 1, 1], padding='SAME'))
output = ? 

print(output.shape)

## Strided convolutions

A second method to downsample feature maps is the strided convolution. Just like the way a max-pool (or avg-pool) operation with stride 2 downsamples a feature by a factor of 2 in the height and width dimensions, a convolution (of any kernel size) with stride 2 can to be used to accomplish the same effect. See discussion above regarding convolutional math, specifically example #1, for further information.

## Putting it all together

Let's put together a series of convolutions, non-linearities and downsampling operations together. Starting with a `[None, 28, 28, 1]` input matrix, let us perform the following:

**Block 1**
* convolution via a total of 16 filters each of size 5 x 5
* ReLU non-linearity
* max-pool operation with a `[1, 2, 2, 1]` kernel and `[1, 2, 2, 1]` stride

**Block 2**
* convolution via a total of 32 filters each of size 5 x 5
* ReLU non-linearity
* max-pool operation with a `[1, 2, 2, 1]` kernel and `[1, 2, 2, 1]` stride


In [0]:
def create_blocks():
    
    # Define inputs
    im = tf.placeholder(tf.float32, [None, 28, 28, 1])
    w1 = tf.placeholder(tf.float32, ?)
    w2 = tf.placeholder(tf.float32, ?)

    # Block 1
    output = tf.nn.relu(tf.nn.conv2d(?))
    output = tf.nn.max_pool(?)        

    # Block 2 
    output = tf.nn.relu(tf.nn.conv2d(?)
    output = tf.nn.max_pool(?)        
    
    return im, output, [w1, w2]
                    
im, output, weights = create_blocks()

# What is the size of the output matrix?
print(output.shape)

# Fully Connected Layers

After a series of convolutions, non-linearities and downsampling operations, the original 28 x 28 MNIST image will be collapsed to a number number of small feature maps (in the above example a 7 x 7 x 32 matrix). However recall that the goal remains to collapse the image even further, specifically, to a total of 10 logit scores for digit prediction. How do convert a 7 x 7 x 32 size feature map to 10 different logit scores?

While there are a handful of strategies to do this, a popular approach is to convert the 2D feature maps into a single vector and simply use matrix multiplications until we reach our final logit scores (just like regular neural networks). In other words, we can reshape for example a 7 x 7 x 32 matrix into a 1 x 1568-element vector and multiply it by a 1568 x 10 size matrix to arrive at our final 10-element logit score. As a convention, every layer after which a CNN "converts" into a regular neural network is known as a **fully connected** layer because at this point, every node in a hidden layer becomes connected to every node that came before it. This is contrast to convolutional layers, where each element in a feature map arise only from the neurons in a small receptive field in a small portion of the image or feature map that came before it. 

Let's see an example below:

In [0]:
# Use create_blocks()
im, output, weights = create_blocks()

# Reshape ("flatten") matrix
flattened = tf.reshape(output, shape=[-1, 7 * 7 * 32])

# Matrix multiply
w3 = tf.placeholder(tf.float32, [7 * 7 * 32, 10])
logits = tf.matmul(flattened, w3)

# What is the size of the output matrix?
print(logits)

How would you add one "hidden" layer to this model? Don't forget to add a ReLU non-linearity

In [0]:
# Use create_blocks()
im, output, weights = create_blocks()

# Reshape ("flatten") matrix
flattened = tf.reshape(?)

# Matrix multiply #1, hidden layer size 128 (don't forget ReLU)
w3 = tf.placeholder(tf.float32, ?)
hidden = ? 

# Matrix multiply #2, output logits size 10 (no ReLU for logits)
w4 = tf.placeholder(tf.float32, ?)
logits = ? 

# What is the size of the output matrix?
print(logits)

# Training a Convolutional Neural Network

Congratulations! At this point you're ready to train your neural network. We will use the same base architecture as above: two serial Conv-ReLU-MaxPool blocks followed by a matrix reshape, single 128-node hidden layer and 10-element logit score. Once we collapse our image into a 10-element logit score, we can use our basic softmax cross-entropy loss function and Adam optimizer to train the network. Let's do it!

In [0]:
def create_model():
    """
    Method to create the following CNN architecture:
    
      - BLOCK 1 (filter depth of 16)
      - BLOCK 2 (filter depth of 32)
      - RESHAPE
      - HIDDEN LAYER (128 nodes)
      - LOGIT SCORES (10 nodes)
      
    Note that a BLOCK consists of a Conv-ReLU-MaxPool combination where
    convolutional kernels are all 5 x 5 in shape.
    
    """
    # Reset our graph to build a new one
    tf.reset_default_graph()

    # ------------------------------------------------------------------------
    # Define placeholders for our images and labels
    # ------------------------------------------------------------------------
    #
    # 1. Define images
    # 
    #  - type: float32
    #  - size: [None, 28, 28, 1] so that we can feed in as many images as we need
    # 
    # 2. Define labels
    # 
    #  - type: int64
    #  - size: [None] so that we can feed in as many labels as we need
    # 
    # ------------------------------------------------------------------------

    im = tf.placeholder(tf.float32, [None, 28, 28, 1])
    labels = tf.placeholder(tf.int64, [None])

    # ------------------------------------------------------------------------
    # Define convolutional blocks 
    # ------------------------------------------------------------------------
    # 
    # A convolutional block is defined by a series of convolution, ReLU and 
    # max-pool operations performed consecutively. Each block is defined by a set 
    # number of filters (feature map depth) that is set by the convolutional
    # kernel size. Recall above that the convolutional kernel size must reflect
    # both the number of feature maps in the layer before as well as the number of 
    # feature maps in the layer afterwards.
    #
    # As in previous assignments, we will use the tf.get_variables(...) method
    # to create matrix variables initialized to random values.
    # 
    # ------------------------------------------------------------------------

    # Block 1 (use a filter depth of 16)
    w1 = tf.get_variable('w1', shape=?, dtype=tf.float32)
    layer = tf.nn.relu(tf.nn.conv2d(im, w1, strides=?, padding='SAME'))
    layer = tf.nn.max_pool(layer, ksize=?, strides=?, padding='SAME')        

    # Block 2 (use a filter depth of 32)
    w2 = tf.get_variable('w2', shape=?, dtype=tf.float32)
    layer = tf.nn.relu(tf.nn.conv2d(layer, w2, strides=?, padding='SAME'))
    layer = tf.nn.max_pool(layer, ksize=?, strides=?, padding='SAME')        
    
    # ------------------------------------------------------------------------
    # Reshape to 1D vector 
    # ------------------------------------------------------------------------
    
    flattened = tf.reshape(layer, [?])

    # ------------------------------------------------------------------------
    # Define our matmul operations
    # ------------------------------------------------------------------------

    # Hidden layer (128 nodes)
    w3 = tf.get_variable('w3', shape=?, dtype=tf.float32)
    h1 = tf.nn.relu(tf.matmul(flattened, w3))
    
    # Logits (10 nodes)
    w4 = tf.get_variable('w4', shape=?, dtype=tf.float32)
    logits = tf.matmul(h1, w4)

    # ------------------------------------------------------------------------
    # Define our softmax cross-entropy loss
    # ------------------------------------------------------------------------
    # 
    # HINT: use tf.losses.sparse_softmax_cross_entropy() as above
    #
    # ------------------------------------------------------------------------

    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # ------------------------------------------------------------------------
    # Define our optimizer
    # ------------------------------------------------------------------------
    # 
    # An optimizer is a special TensorFlow class that takes your model weights and 
    # adjusts them ever so slightly so that they will make a better prediction the
    # next time around. They are implemented with a technique known as 
    # backpropogation which we will learn about in further detail during later 
    # lectures. For now, just know that this is what we are using here.
    #
    # ------------------------------------------------------------------------

    train_op = tf.train.AdamOptimizer(0.001).minimize(loss)
    
    return im, labels, [w1, w2, w3, w4], logits, loss, train_op

# ------------------------------------------------------------------------
# Test our model
# ------------------------------------------------------------------------
# 
# If the graph was defined properly, we should be able to check the out
# what the model outputs should look like. Can you guess by the shapes
# of our logits and losses will be?
# 
# ------------------------------------------------------------------------

im, labels, weights, logits, loss, train_op = create_model()
print(logits.shape)
print(loss.shape)

## Intialization

Now we set up some code to initialize our network graph, variables and new saver object. This code is identical to the earlier assignments.

In [0]:
# ------------------------------------------------------------------------
# Create our model
# ------------------------------------------------------------------------

im, labels, weights, logits, loss, train_op = create_model()

# ------------------------------------------------------------------------
# Add to collections
# ------------------------------------------------------------------------
# 
# Collections are used by TensorFlow to keep track of certain intermediate 
# values for quick access during save/load functions.
# 
# ------------------------------------------------------------------------

tf.add_to_collection('im', im)
tf.add_to_collection('logits', logits)

# ------------------------------------------------------------------------
# Initialize our test graph
# ------------------------------------------------------------------------
# 
# What two things do we need to initialize our graph?
# 
# ------------------------------------------------------------------------

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

# ------------------------------------------------------------------------
# Initialize our test graph
# ------------------------------------------------------------------------
# 
# Initialize a Saver object
# 
# ------------------------------------------------------------------------

saver = tf.train.Saver()

## Training

Let's train our algorithm! The code here is nearly identical to our earlier method except that in this case we must reshape our input 784-element vector to a 28 x 28 x 1 image.

In [0]:
# ------------------------------------------------------------------------
# Train our algorithm 
# ------------------------------------------------------------------------
# 
# Let's set up a loop to train our algorithm by feeding it data iteratively.
# For each iteration, we will feed a batch_size number of images into our 
# model and let it readjust it's neuronal weights.
# 
# ------------------------------------------------------------------------

def train_model(iterations=1000, batch_size=256):
    
    accuracies = []
    losses = []

    for i in range(iterations):

        # --------------------------------------------------------------------
        # Grab a total of batch_size number of random images and labels 
        # --------------------------------------------------------------------
        # 
        # 1. Pick batch_size number of random indices between 0 and 60,000
        # 2. Select those images / labels
        #
        # --------------------------------------------------------------------

        rand_indices = np.random.randint(60000, size=(batch_size))
        x_batch = x[?].reshape(batch_size, 28, 28, 1)
        y_batch = y[?]

        # --------------------------------------------------------------------
        # Normalize x_batch
        # --------------------------------------------------------------------
        # 
        # Currently, values in x range from 0 to 255. If we normalize these values
        # to a mean of 0 and SD of 1 we will improve the stability of training
        # and furthermore improve interpretation of learned weights. Use the
        # following code to normalize your batch:
        # 
        # --------------------------------------------------------------------

        x_batch = (x_batch - np.mean(x_batch)) / np.std(x_batch)

        # Convert to types matching our defined placeholders
        x_batch = x_batch.astype('float32')
        y_batch = y_batch.astype('int64')

        # Prepare feed_dict
        feed_dict = {im: x_batch, labels: y_batch}

        # --------------------------------------------------------------------
        # Run training iteration via sess.run()
        # --------------------------------------------------------------------
        # 
        # Here, in addition to whichever ouputs we wish to extract, we need to
        # also include the train_op variable. Including train_op will tell 
        # Tensorflow that in addition to calculating the intermediates of our graph,
        # we also need to readjust the variables so that the overall loss goes
        # down.
        # 
        # --------------------------------------------------------------------

        outputs = sess.run([logits, loss, train_op], ?)

        # --------------------------------------------------------------------
        # Use argmax to determine highest logit (model guess)
        # --------------------------------------------------------------------
        # 
        # Keep in mind our logits matrix is (batch_size x 10) in size representing
        # a total of batch_size number of predictions. How do we process this matrix
        # with the np.argmax() to find the highest logit along each row of the matrix
        # (e.g. find the prediction for each of our images)?
        # 
        # HINT: what does the axis parameter in np.argmax(a, axis) specify?
        # 
        # --------------------------------------------------------------------

        predictions = np.argmax(outputs[0], axis=1)

        # --------------------------------------------------------------------
        # Calculate accuracy 
        # --------------------------------------------------------------------
        # 
        # Consider the following:
        # 
        # - predictions = the predicted digits
        # - y_batch = the ground-truth digits
        # 
        # How do I calculate an accuracy % with this data?
        # 
        # --------------------------------------------------------------------

        accuracy = np.sum(predictions == y_batch) / batch_size

        # --------------------------------------------------------------------
        # Accumulate and print iteration, loss and accuracy 
        # --------------------------------------------------------------------

        print('Iteration %05i | Loss = %07.3f | Accuracy = %0.4f' %
            (i + 1, outputs[1], accuracy))

        losses.append(outputs[1])
        accuracies.append(accuracy)
        
    return losses, accuracies

In [0]:
# --------------------------------------------------------------------
# Train model
# --------------------------------------------------------------------
losses, accuracies = train_model(iterations=1000, batch_size=256)

# --------------------------------------------------------------------
# Graph outputs and accuracy
# --------------------------------------------------------------------
import pylab
pylab.plot(losses)
pylab.title('Model loss over time')
pylab.show()

pylab.plot(accuracies)
pylab.title('Model accuracy over time')
pylab.show()

# --------------------------------------------------------------------
# Save model
# --------------------------------------------------------------------
# 
# In this step, all model variables and the underlying graph structure
# are saved so that they can be reloaded. Although it looks like just one
# file is saved here, in fact both a *.cpkt and *.cpkt.meta file are both
# saved in this single line of code.
#  
# --------------------------------------------------------------------

import os
model_file = './model_cnn_16_32_128/model.ckpt'
os.makedirs(os.path.dirname(model_file), exist_ok=True)
print('Saving model')
saver.save(sess, model_file)

# Running inference

Now that we have a trained model, let's go ahead and see how it performs! We will use the same procedure as before to load up a trained network and then feed in random digits to see how it fares.

In [0]:
# Load the saved model
tf.reset_default_graph()
sess = tf.InteractiveSession()
saver = tf.train.import_meta_graph('./model_cnn_16_32_128/model.ckpt.meta')
saver.restore(sess, './model_cnn_16_32_128/model.ckpt')

# Find our placeholders
im = tf.get_collection('im')[0]
logits = tf.get_collection('logits')[0]

# Find a random test image
i = int(np.random.randint(60000))
image = x[i].reshape(1, 28, 28, 1)
label = y[i]

# Normalize the image
image = (image - np.mean(image)) / np.std(image)

# Create a feed_dict
feed_dict = {im: image}

# Pass data through the network
l = sess.run(logits, feed_dict)

# Convert logits to predictions
prediction = np.argmax(l)

# Visualize
pylab.imshow(image.reshape(28, 28))
pylab.axis('off')
pylab.title('My prediction is %i' % prediction)
pylab.show()

# Model validation

As we learned in lecture #4, a model with *too much* learning capacity can potentially memorize the dataset without learning anything too useful. This was certainly a small but definite problem with our regular non-convolutional neural networks. How do we fare with CNNs? To test for this phenomenon we need to evaluate the model on new data that the algorithm has never seen before. Let's go ahead download this new data now:

In [0]:
!git clone https://github.com/CAIDMRes/lecture_03
!unzip lecture_03/data.zip
!rm -r lecture_03 
!ls

## Loading the data

The two new files we downloaded are `x_test.npy` and `y_test.npy` corresponding to our test set data and labels.  The format is identical to before. We have a total of 10,000 examples to test. Let us load them now:

In [0]:
import numpy as np
x_test = np.load('x_test.npy')
y_test = np.load('y_test.npy')

print(x_test.shape)
print(y_test.shape)

## Validating

Using the template code to run inference shown above, let us now write code to:

* load our saved model 
* create a `feed_dict` with new test data 
* pass through network using `sess.run()`
* convert the `logits` to predictions
* calculate overall network accuracy

This is nearly identical to our previous assignments, with the exception that our input images are now 2D (28 x 28 x 1) instead of 1D vectors.

In [0]:
def validate_model(model_file):
    """
    Method to test the validation performance of a model using the 
    test set data.
    
    :params
    
      (str) model_file : name of model file saved by saver object
      
    """
    # Load saved model
    tf.reset_default_graph()
    sess = tf.InteractiveSession()
    saver = tf.train.import_meta_graph('%s.meta' % model_file )
    saver.restore(sess, model_file)

    # Find our placeholders
    im = tf.get_collection('im')[0]
    logits = tf.get_collection('logits')[0]

    # Normalize our input data x_test
    input_data = (x_test - np.mean(x_test, axis=1, keepdims=True)) / \
        np.std(x_test, axis=1, keepdims=True)
    
    # -------------------------------------------------------
    # Create a feed_dict
    # -------------------------------------------------------
    # 
    # HINT: What do we need to do to properly format this image
    # for input into the CNN?
    # 
    # -------------------------------------------------------

    feed_dict = {im: ?}

    # Pass data through the network using sess.run() to get our logits 
    output = sess.run(logits, feed_dict)

    # Convert logits to predictions
    predictions = np.argmax(output, axis=1)

    # Compare predictions to ground-truth to find accuracy
    accuracy = np.sum(predictions == y_test) / x_test.shape[0]
    
    print('Network test-set accuracy: %0.4f' % accuracy)
    
# Pass our model_file
model_file = './model_cnn_16_32_128/model.ckpt' 
validate_model(model_file)

## Notes

How did the algorithm perform? Better or worse than our non-convolutional neural network? Were you surprised, not surprised? In the remainder of this tutorial, let's test a handful of different architectures.

# Exercises

For the following exercises, we will evaluate a number of different variations in CNN architecture. As before, the steps will include:

* writing a new model in the `create_model()` method
* initial training variables
* use the `train_model()` method defined above to run training (repeated as many times needed to converge)
* save model
* validate model on test set data

The goal is to get a sense of which combinations work better than others. Keep in mind we are already at 99%+ accuracy, so we're not expecting any dramatic changes, but the process fine-tuning a neural network is an extremely valuable experience to gain first-hand.

## Exercise 1

Re-train several neural networks this time with either more or less convolutional filters (e.g. 32-64 or 8-16). What do you expect to happen to your algorithm accuracy? 

In [0]:
def create_model():
    
    # Reset our graph to build a new one
    tf.reset_default_graph()

    # ------------------------------------------------------------------------
    # Define placeholders for our images and labels
    # ------------------------------------------------------------------------
    im = ? 
    labels = ?

    # ------------------------------------------------------------------------
    # Define convolutional blocks 
    # ------------------------------------------------------------------------

    # Block 1
    w1 = ?
    layer = tf.nn.relu(tf.nn.conv2d(?))
    layer = tf.nn.max_pool(?)        

    # Block 2
    w2 = ?
    layer = tf.nn.relu(tf.nn.conv2d(?))
    layer = tf.nn.max_pool(?)        
    
    # ------------------------------------------------------------------------
    # Reshape to 1D vector 
    # ------------------------------------------------------------------------
    flattened = ?

    # ------------------------------------------------------------------------
    # Define our matmul operations
    # ------------------------------------------------------------------------

    # Hidden layer (128 nodes)
    w3 = ?
    h1 = ?
    
    # Logits (10 nodes)
    w4 = ?
    logits = ?

    # ------------------------------------------------------------------------
    # Define our softmax cross-entropy loss
    # ------------------------------------------------------------------------
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # ------------------------------------------------------------------------
    # Define our optimizer
    # ------------------------------------------------------------------------
    train_op = tf.train.AdamOptimizer(0.001).minimize(loss)
    
    return im, labels, [w1, w2, w3, w4], logits, loss, train_op

In [0]:
# ------------------------------------------------------------------------
# Create model 
# ------------------------------------------------------------------------
im, labels, weights, logits, loss, train_op = create_model()

# ------------------------------------------------------------------------
# Add to collections, initialize graph and saver
# ------------------------------------------------------------------------
tf.add_to_collection('im', im)
tf.add_to_collection('logits', logits)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
saver = tf.train.Saver()

In [0]:
# --------------------------------------------------------------------
# Train model
# --------------------------------------------------------------------
losses, accuracies = train_model(iterations=1000, batch_size=256)

# --------------------------------------------------------------------
# Graph outputs and accuracy
# --------------------------------------------------------------------

pylab.plot(losses)
pylab.title('Model loss over time')
pylab.show()

pylab.plot(accuracies)
pylab.title('Model accuracy over time')
pylab.show()

# --------------------------------------------------------------------
# Save model
# --------------------------------------------------------------------
# 
# In this step, all model variables and the underlying graph structure
# are saved so that they can be reloaded. Although it looks like just one
# file is saved here, in fact both a *.cpkt and *.cpkt.meta file are both
# saved in this single line of code.
#  
# --------------------------------------------------------------------

import os
model_file = ?
os.makedirs(os.path.dirname(model_file), exist_ok=True)
print('Saving model')
saver.save(sess, model_file)

In [0]:
# ------------------------------------------------------------------------
# Validate model 
# ------------------------------------------------------------------------
validate_model(model_file)

## Exercise 2

Re-train several neural networks this time with either increased or decreased size of hidden layers (perhaps 196, 256, or 32, 64). Alternatively, try increasing the number of hidden layers, or completely remove the hidden layer. What do you expect to happen to your algorithm accuracy?

In [0]:
def create_model():
    
    # Reset our graph to build a new one
    tf.reset_default_graph()

    # ------------------------------------------------------------------------
    # Define placeholders for our images and labels
    # ------------------------------------------------------------------------
    im = ? 
    labels = ?

    # ------------------------------------------------------------------------
    # Define convolutional blocks 
    # ------------------------------------------------------------------------

    # Block 1
    w1 = ?
    layer = tf.nn.relu(tf.nn.conv2d(?))
    layer = tf.nn.max_pool(?)        

    # Block 2
    w2 = ?
    layer = tf.nn.relu(tf.nn.conv2d(?))
    layer = tf.nn.max_pool(?)        
    
    # ------------------------------------------------------------------------
    # Reshape to 1D vector 
    # ------------------------------------------------------------------------
    flattened = ?

    # ------------------------------------------------------------------------
    # Define our matmul operations
    # ------------------------------------------------------------------------

    # Hidden layer
    w3 = ?
    h1 = ?
    
    # Logits (10 nodes)
    w4 = ?
    logits = ?

    # ------------------------------------------------------------------------
    # Define our softmax cross-entropy loss
    # ------------------------------------------------------------------------
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # ------------------------------------------------------------------------
    # Define our optimizer
    # ------------------------------------------------------------------------
    train_op = tf.train.AdamOptimizer(0.001).minimize(loss)
    
    return im, labels, [w1, w2, w3, w4], logits, loss, train_op

In [0]:
# ------------------------------------------------------------------------
# Create model 
# ------------------------------------------------------------------------
im, labels, weights, logits, loss, train_op = create_model()

# ------------------------------------------------------------------------
# Add to collections, initialize graph and saver
# ------------------------------------------------------------------------
tf.add_to_collection('im', im)
tf.add_to_collection('logits', logits)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
saver = tf.train.Saver()

In [0]:
# --------------------------------------------------------------------
# Train model
# --------------------------------------------------------------------
losses, accuracies = train_model(iterations=1000, batch_size=256)

# --------------------------------------------------------------------
# Graph outputs and accuracy
# --------------------------------------------------------------------

pylab.plot(losses)
pylab.title('Model loss over time')
pylab.show()

pylab.plot(accuracies)
pylab.title('Model accuracy over time')
pylab.show()

# --------------------------------------------------------------------
# Save model
# --------------------------------------------------------------------
# 
# In this step, all model variables and the underlying graph structure
# are saved so that they can be reloaded. Although it looks like just one
# file is saved here, in fact both a *.cpkt and *.cpkt.meta file are both
# saved in this single line of code.
#  
# --------------------------------------------------------------------

import os
model_file = ?
os.makedirs(os.path.dirname(model_file), exist_ok=True)
print('Saving model')
saver.save(sess, model_file)

In [0]:
# ------------------------------------------------------------------------
# Validate model 
# ------------------------------------------------------------------------
validate_model(model_file)

# Advanced Exercises

Up for a challenge? Go ahead and try some (or all) of these out.

## Exercise 1

In this example, we defined a convolutional block as a single series of convolution, ReLU and down-sampling. What happens if we use two convolutional operations, in other words convolution, ReLU, convolution, ReLU, down-sampling?

## Exercise 2

In this example, the method to collapse our intermediate 7 x 7 x 28 feature map was the reshape and treat the network as a regular ANN. Can you think of alternatives to this approach? Remember the only rule is that we must end up with a 10-element logit score.

Some thoughts:

* additional application of a single convolution block, average pool and/or strided convolution to downample the feature map to 3 x 3 x N before proceeding with the matrix reshape
* continued application of convolution blocks, average pools and/or strided convolutions  to downsample the feature map to 1 x 1 x N, and subsequently treating the end result as a single dimensional hidden layer

## Exercise 3

Why does the CNN perform better than the ANN? Specifically, why does the algorithm overfit less? Calculate the number of total **trainable parameters** (total elements in all weight combined) for a CNN vs. ANN approach. Any other thoughts?

## Exercise 4

As many of you recognized in the first assignment, an *ensemble* of algorithms (e.g. multiple algorithms trained with slightly different architectures) tends to perform better than any single one model. Create an ensemble of your top CNNs (and/or ANNs), and see you if you can push your model accuracy even further