Deep Learning
=============

Assignment 4
------------

Previously in `2_fullyconnected.ipynb` and `3_regularization.ipynb`, we trained fully connected networks to classify [notMNIST](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html) characters.

The goal of this assignment is make the neural network convolutional.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
from six.moves import range

In [2]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a TensorFlow-friendly shape:
- convolutions need the image data formatted as a cube (width by height by #channels)
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10
num_channels = 1 # grayscale

import numpy as np

def reformat(dataset, labels):
  dataset = dataset.reshape(
    (-1, image_size, image_size, num_channels)).astype(np.float32)
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28, 1) (200000, 10)
Validation set (10000, 28, 28, 1) (10000, 10)
Test set (10000, 28, 28, 1) (10000, 10)


In [4]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

Let's build a small network with two convolutional layers, followed by one fully connected layer. Convolutional networks are more expensive computationally, so we'll limit its depth and number of fully connected nodes.

Key to this section is understanding how tf.nn.conv2d computes a 2-D convolution given 4-D input and filter tensors. 

tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=None, name=None)

Args:
input: A Tensor. Must be one of the following types: float32, float64.
filter: A Tensor. Must have the same type as input.
strides: A list of ints. 1-D of length 4. The stride of the sliding window for each dimension of input.
padding: A string from: "SAME", "VALID". The type of padding algorithm to use.
use_cudnn_on_gpu: An optional bool. Defaults to True.
name: A name for the operation (optional).

So, it takes:
1. an input tensor ("input" above) of shape [batch, in_height, in_width, in_channels]
    + this is variously data (to start), hidden (before pooling added) or pool
2. a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]
    + these are the weights
    + it expects the layout to be: [filter_height, filter_width, in_channels, out_channels]
    + this is why we're seeing weights initialized as [patch_size, patch_size, num_channels, depth]
    + ON EACH ITERATION THE FILTER IS ESSENTIALLY THE TARGET YOU WANT TO TRANSFORM TO.
3. strides
    + [1,2,2,1] w/out pooling, [1,1,1,] after pooling
    + presumably w/out strides > 1 (or pre-pooling), you're re-shaping the data, but not reducing it. 
4. padding = "SAME"
    
It then: 
1. Flattens the FILTER to a 2-D matrix with shape [filter_height * filter_width * in_channels, output_channels].
    + so, it takes the filter matri and flattens it out. 
2. Extracts image patches from the input tensor to form a virtual tensor of shape:
        [batch, out_height, out_width, filter_height * filter_width * in_channels].
    + so, note that both have filter_height * filter_width * in_channel components
3. For each patch, right-multiplies the filter matrix and the image patch vector.

It does this by:
output[b, i, j, k] =
    sum_{di, dj, q} input[b, strides[1] * i + di, strides[2] * j + dj, q] * # reducing the image
                    filter[di, dj, q, k] # applying the filter (weights) through matrix multiplication.
                    

In [62]:
import math as math

batch_size = 64 # was 16

# convolutional layers
patch_size = 10 # was 5, picked up 1% w/ 10
depth = 16 # how was this determined? not related to batch size?

# connected layers
con1_num_hidden = 128
con2_num_hidden = 64 # how was this determined?

# dropout
dropout = False
position1_keep_prob = 0.7
position2_keep_prob = 0.6

# learning
learning = True
start_learn = 0.1
learn_decay = 0.9
learn_step = 200
stair = True

graph = tf.Graph()

with graph.as_default():
    
    # Input data.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size, num_channels))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Variables. building 4D tensors w/ dims: [batch, height, width, channels] or [b, y, x, c]
    ## layer 1: convolutional. the is the target structure. 2d patchXpatch, but w/ much greater depth.
    layer1_weights = tf.Variable(tf.truncated_normal([patch_size, patch_size, num_channels, depth], stddev=0.1))
    layer1_biases = tf.Variable(tf.zeros([depth])) 
    ## biases initialized w/ "depth" dimension
    
    ## layer 2: conoluational
    layer2_weights = tf.Variable(tf.truncated_normal([patch_size, patch_size, depth, depth], stddev=0.1))
    layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
    
    ## layer 3: connected net that is 1. flattended (2d) and 2. implies stride = 4 or 2*(2x) downsampling
    in_size = image_size // 4 * image_size // 4 * depth
    layer3_weights = tf.Variable(tf.truncated_normal([in_size, con1_num_hidden], stddev = math.sqrt(2.0/in_size)))
    layer3_biases = tf.Variable(tf.constant(1.0, shape=[con1_num_hidden]))
    
    ## layer 4: fully connected
    layer4_weights = tf.Variable(tf.truncated_normal([con1_num_hidden,con2_num_hidden],
                                                     stddev = math.sqrt(2.0/con1_num_hidden)))
    layer4_biases = tf.Variable(tf.constant(1.0, shape=[con2_num_hidden]))
    
    ## layer 5 or "output" layer maps from hidden nodes (64) to output nodes (10)
    layer5_weights = tf.Variable(tf.truncated_normal([con2_num_hidden, num_labels], 
                                                     stddev = math.sqrt(2.0/con2_num_hidden)))
    layer5_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
    
    # Model.
    def model(data): # interesting that no inputs for regularization, dropout
        
        # could try pooling here.
        conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME')
        # telling it to build convnet taking input data record, mapping it to layer1.
        ## vector of dims seems to be communicating stride for x and y axis in center of 4d object.
        hidden = tf.nn.relu(conv + layer1_biases) ## conv was output weights to which you add biaes and run relu.
        
        pool = tf.nn.max_pool(hidden, ksize=[1,2,2,1], strides=[1,2,2,1], padding="SAME")     
        conv = tf.nn.conv2d(pool, layer2_weights, [1, 1, 1, 1], padding='SAME')
        hidden = tf.nn.relu(conv + layer2_biases)
        
        pool = tf.nn.max_pool(hidden, ksize=[1,2,2,1], strides=[1,2,2,1], padding="SAME") 
        shape = pool.get_shape().as_list() # was hidden
        reshape = tf.reshape(pool, [shape[0], shape[1] * shape[2] * shape[3]]) # collapsing dims 1-3 (sparse result?)
        
        if dropout:
            reshape = tf.nn.dropout(reshape,position1_keep_prob)
        hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
        
        if dropout:
            hidden = tf.nn.dropout(hidden,position2_keep_prob)
        hidden = tf.nn.relu(tf.matmul(hidden, layer4_weights) + layer4_biases)
        
        return tf.matmul(hidden, layer5_weights) + layer5_biases
    
    # Training computation.
    logits = model(tf_train_dataset)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
    ## insert regularization here
    
    # Learning and Optimization
    if learning: 
        global_step = tf.Variable(0)  # count the number of steps taken.
        learning_rate = tf.train.exponential_decay(start_learn, global_step, learn_step, learn_decay, staircase=stair)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    else:
        optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    
    # Predictions
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(model(tf_valid_dataset))
    test_prediction = tf.nn.softmax(model(tf_test_dataset))
    
    ## inception:
    ## http://arxiv.org/pdf/1409.4842v1.pdf
    ## https://www.tensorflow.org/versions/0.6.0/tutorials/image_recognition/index.html
    

In [59]:
num_steps = 4001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    for step in range(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        batch_data = train_dataset[offset:(offset + batch_size), :, :, :] # one more dimension. 4 total
        batch_labels = train_labels[offset:(offset + batch_size), :]
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 250 == 0):
            print('Minibatch loss at step %d: %f' % (step, l))
            print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
            print('Validation accuracy: %.1f%%' % accuracy(valid_prediction.eval(), valid_labels))
            
    print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 5.699503
Minibatch accuracy: 6.2%
Validation accuracy: 10.0%
Minibatch loss at step 250: 0.565993
Minibatch accuracy: 84.4%
Validation accuracy: 79.7%
Minibatch loss at step 500: 0.808567
Minibatch accuracy: 68.8%
Validation accuracy: 83.0%
Minibatch loss at step 750: 0.662197
Minibatch accuracy: 82.8%
Validation accuracy: 83.8%
Minibatch loss at step 1000: 0.500157
Minibatch accuracy: 85.9%
Validation accuracy: 84.7%
Minibatch loss at step 1250: 0.317431
Minibatch accuracy: 90.6%
Validation accuracy: 85.7%
Minibatch loss at step 1500: 0.639241
Minibatch accuracy: 78.1%
Validation accuracy: 85.9%
Minibatch loss at step 1750: 0.529561
Minibatch accuracy: 81.2%
Validation accuracy: 86.2%
Minibatch loss at step 2000: 0.187477
Minibatch accuracy: 93.8%
Validation accuracy: 86.3%
Minibatch loss at step 2250: 0.506581
Minibatch accuracy: 82.8%
Validation accuracy: 86.5%
Minibatch loss at step 2500: 0.329775
Minibatch accuracy: 90.6%
Validation accuracy: 

---
Problem 1
---------

The convolutional model above uses convolutions with stride 2 to reduce the dimensionality. Replace the strides by a max pooling operation (`nn.max_pool()`) of stride 2 and kernel size 2.

---

1. baseline conv2 w/ stride 2: 88.x%
2. pre-process 1st conv w/ max pool of stride = 2 and kernel = 2: 89%
3. avg_pool: 89.4%
4. boosted patch size to 10 from 5: 89.7%
5. re-configured pool. now two pools after each convolution: 89.8%
6. three pools: 90.2%
7. dropout position 1, 0.5 keep prob: 86.7%
8. dropout at position 2, 0.5 keep prob: 80.4% - CONCLUSION: retry dropout in position1 when you get to more records.
9. learning (0.25, 0.75, 100, True): 10%
    100 0.25 * 0.75^(100/100) = 0.188
    ...
    1000 0.25 * 0.75^(1000/100) = 0.014
    CONCLUSION: learning unstable. retry more gradual and/or boost steps.  
10. learning (0.1, 0.95, 200, True): 89%
    100 0.1 * 0.95^(200/200) = 0.095
    ...
    2000 0.1 * 0.95^(1000/200) = 0.077
11. doubled steps and smoothed learning: 91.8%
    100 0.1 * 0.90^(200/200) = 0.095
    ...
    1000 0.1 * 0.90^(2000/200) = 0.034
12. back to dropout, position 1, 0.7 keep prob. 92.2% - CONCLUSION: dropout is additive when you get to appropriate scale. key here is that validation set still improving at 2000 steps indicating that we're not overfitting. 
13. get baseline for dropout / learning before testing second dropout. 3000 steps, 64 batch size: 94.5% 
    NOTE: it's still underfitting.  
14. try w/ dropout position1 and position2: 93.3% - CONCLUSION: dropout before return hurts performance. likely b/c there's no opportunity for model to adjust to missing readings before it's input to next convolution w/ new data. 
15. adding 2nd fully connected layer. so, first layer 64->128, second layer 64: 94.7%
16. misc: 93.9%
    a. initialize using 2/N thing
    b. declining keep probs
    c. layer size: layer 1 128->320 (10*16*2):
    "net architectures usually sharpen towards the output layer so you force them to learn the main classification features, they are pyramid like, the more so for convolutional layers. The rest of the parameters was chosen by trial and error, choose whatever works better on the validation set."
    
17. backing up: layer size back to 128. 94.8%
18. patch back to 5: 93.8
19. 2nd convolution from 16 to 32?
20. avg vs. max: 94.5%
21. pooling 3 or 4 vs 2?
22. 4000 steps
23. inception
24. using endri.deliu's model achieved 95.3% on 5,0001 steps.


Optimization dimensions:
1. conv layers
    a. number
    b. size
    c. max/ave stride
    d. pooling
    e. 
2. nnet
    a. layers
    b. learning
    c. dropout
3. other
    a. batch size
    b. steps
    

@endri.deliu's ELU code. One thing I did play with was the elu neurons (as opposed to relu). ELU-s use an exponential function similar in shape to RELU-s but they can have negative values and really start shining in deeper nets. Using elus on my deep net of 5 conv layers and 2 fully connected layers made the model converge to the final result much sooner (about 2-3x faster) and I was able to achieve 93.4% on the validation set and 97.8% on the test set. Elus make sense if you have a lot of convolutions and/or fully connected layers and are also cheaper computationally. My previous conv net was using batch normalization and got me about 97.7%. Using ELU-s did speed things up significantly. The ELU paper: http://arxiv.org/abs/1511.0728912. If anyone is interested I can post the code of my conv net. I am pretty sure with further tweaking of the hyperparameters you can get even better accuracy. Again, had to use just 12k images (out of 18k) to calculate the test set error, otherwise my computer would throw an out of memory error.

In [64]:
#Below the code using ELU-s. The batch normalization steps are commented out.

batch_size = 16
patch_size = 3
depth = 16
num_hidden = 705
num_hidden_last = 205

graph = tf.Graph()

with graph.as_default():
    
    # Input data.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size, num_channels))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Variables. 5 convolutional and 3 fully connected layers. # conv: 4D tensors [batch, height, width, channels]
    layerconv1_weights = tf.Variable(tf.truncated_normal( # batch = 16, patch = 3, channels = 1, depth = 16 
            [patch_size, patch_size, num_channels, depth], stddev=0.1)) 
            # so, this is batch of 3, 3x1's, w/ depth 16 i.e., take each 3x1 and map it to 3?x16. 
    layerconv1_biases = tf.Variable(tf.zeros([depth]))
    
    # so, one key to his implementation is num_channels -> depth -> 2depth -> 4depth ->4depth, 
    # but then ending on 16depth. so, really focusing the information in each pixel into new dimension
    layerconv2_weights = tf.Variable(tf.truncated_normal(
            [patch_size, patch_size, depth, depth * 2], stddev=0.1)) # so, this is how me modified depth in layer 2
            # only thing that changes so, just forcing data into longer tube
    layerconv2_biases = tf.Variable(tf.zeros([depth * 2]))
    
    layerconv3_weights = tf.Variable(tf.truncated_normal(
            [patch_size, patch_size, depth * 2, depth * 4], stddev=0.03)) # doubling depth again
    layerconv3_biases = tf.Variable(tf.zeros([depth * 4]))
    
    layerconv4_weights = tf.Variable(tf.truncated_normal(
            [patch_size, patch_size, depth * 4, depth * 4], stddev=0.03)) # stable depth
    layerconv4_biases = tf.Variable(tf.zeros([depth * 4]))
    
    layerconv5_weights = tf.Variable(tf.truncated_normal(
            [patch_size, patch_size, depth * 4, depth * 16], stddev=0.03)) # 4x depth jump. 246 layers
    layerconv5_biases = tf.Variable(tf.zeros([depth * 16]))

    # the other is knowing what size to make the convolutional layer given later pooling.
    layer3_weights = tf.Variable(tf.truncated_normal( # first fully connected layer
            [image_size / 7 * image_size / 7 * (depth * 4), num_hidden], stddev=0.03))
            # the shape of each input (image) has changed from 784 to 1024 input nodes. 
    layer3_biases = tf.Variable(tf.zeros([num_hidden]))
    
    layer4_weights = tf.Variable(tf.truncated_normal(
            [num_hidden, num_hidden_last], stddev=0.0532))
    layer4_biases = tf.Variable(tf.zeros([num_hidden_last]))
    
    layer5_weights = tf.Variable(tf.truncated_normal(
            [num_hidden_last, num_labels], stddev=0.1))
    layer5_biases = tf.Variable(tf.zeros([num_labels]))
    
    # Model.
    def model(data, use_dropout=False):
        # no initial pool here.
        conv = tf.nn.conv2d(data, layerconv1_weights, [1, 1, 1, 1], padding='SAME')
        hidden = tf.nn.elu(conv + layerconv1_biases)
        pool = tf.nn.max_pool(hidden, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
        
        conv = tf.nn.conv2d(pool, layerconv2_weights, [1, 1, 1, 1], padding='SAME')
        hidden = tf.nn.elu(conv + layerconv2_biases)
        #pool = tf.nn.max_pool(hidden, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
        
        conv = tf.nn.conv2d(hidden, layerconv3_weights, [1, 1, 1, 1], padding='SAME')
        hidden = tf.nn.elu(conv + layerconv3_biases)
        pool = tf.nn.max_pool(hidden, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
        # norm1: Normalization is useful to prevent neurons from saturating when inputs may have 
        # varying scale, and to aid generalization.
        # norm1 = tf.nn.lrn(pool, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75)
        
        conv = tf.nn.conv2d(pool, layerconv4_weights, [1, 1, 1, 1], padding='SAME')
        hidden = tf.nn.elu(conv + layerconv4_biases)
        pool = tf.nn.max_pool(hidden, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
        # norm1 = tf.nn.lrn(pool, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75)
        
        conv = tf.nn.conv2d(pool, layerconv5_weights, [1, 1, 1, 1], padding='SAME')
        hidden = tf.nn.elu(conv + layerconv5_biases)
        pool = tf.nn.max_pool(hidden, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
        # norm1 = tf.nn.lrn(pool, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75)
        
        shape = pool.get_shape().as_list()
        print(shape)
        reshape = tf.reshape(pool, [shape[0], shape[1] * shape[2] * shape[3]])
        hidden = tf.nn.elu(tf.matmul(reshape, layer3_weights) + layer3_biases)
        
        if use_dropout:
            hidden = tf.nn.dropout(hidden, 0.75)
            
        nn_hidden_layer = tf.matmul(hidden, layer4_weights) + layer4_biases
        hidden = tf.nn.elu(nn_hidden_layer)
        
        if use_dropout:
            hidden = tf.nn.dropout(hidden, 0.75)
            
        return tf.matmul(hidden, layer5_weights) + layer5_biases
    
    # Training computation.
    logits = model(tf_train_dataset, True)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
    
    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.1, global_step, 3000, 0.86, staircase=True)
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(model(tf_valid_dataset))
    test_prediction = tf.nn.softmax(model(tf_test_dataset))

[16, 2, 2, 256]
[10000, 2, 2, 256]
[10000, 2, 2, 256]


In [68]:
num_steps = 5001 # was 95001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    
    for step in xrange(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step", step, ":", l)
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
            #print(time.ctime())
    print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0 : 2.31148
Minibatch accuracy: 0.0%
Validation accuracy: 10.0%
Minibatch loss at step 500 : 0.504193
Minibatch accuracy: 81.2%
Validation accuracy: 84.4%
Minibatch loss at step 1000 : 0.742254
Minibatch accuracy: 75.0%
Validation accuracy: 83.2%
Minibatch loss at step 1500 : 0.704914
Minibatch accuracy: 87.5%
Validation accuracy: 86.7%
Minibatch loss at step 2000 : 0.746453
Minibatch accuracy: 75.0%
Validation accuracy: 86.5%
Minibatch loss at step 2500 : 0.370096
Minibatch accuracy: 93.8%
Validation accuracy: 87.8%
Minibatch loss at step 3000 : 0.458386
Minibatch accuracy: 87.5%
Validation accuracy: 87.5%
Minibatch loss at step 3500 : 0.255316
Minibatch accuracy: 87.5%
Validation accuracy: 88.8%
Minibatch loss at step 4000 : 0.257595
Minibatch accuracy: 93.8%
Validation accuracy: 89.1%
Minibatch loss at step 4500 : 0.286413
Minibatch accuracy: 93.8%
Validation accuracy: 88.9%
Minibatch loss at step 5000 : 0.675464
Minibatch accuracy: 75.0%
Validatio

In [None]:
# End nn used two convolution layers, then the inception module, then three fully connected layers. 
# Result was 96.8% on the test set. Happy to hear if this is set up correct, thanks.
# https://discussions.udacity.com/t/assignment-4-problem-2/46525/41

def inception_layer1(data):
    # Inception 1x1
    conv_1x1 = tf.nn.conv2d(data, inception_1x1_weights, [1, 1, 1, 1], padding='SAME')
    conv_1x1 = tf.nn.relu(conv_1x1 + inception_1x1_biases)
    ## 1x1 - before the bigger patches
    conv_pre = tf.nn.conv2d(data, pre_inception_1x1_weights, [1, 1, 1, 1], padding='SAME')
    conv_pre = tf.nn.relu(conv_pre + pre_inception_1x1_biases)
    # Pooling 3x3
    ## average pool followed by a 1x1
    conv_pool = tf.nn.avg_pool(data, [1, 3, 3, 1], [1, 1, 1, 1], padding='SAME')
    conv_pool = tf.nn.conv2d(conv_pool, inception_1x1_pool_weights, [1, 1, 1, 1], padding='SAME')
    conv_pool = tf.nn.relu(conv_pool + inception_1x1_pool_biases)
    # Inception 3x3
    ## 1x1 followed by a 3x3 (i actually read it in his voice)
    conv_3x3 = tf.nn.conv2d(conv_pre, inception_3x3_weights, [1, 1, 1, 1], padding='SAME')
    conv_3x3 = tf.nn.relu(conv_3x3 + inception_3x3_biases)
    # Inception 5x5
    ## 1x1 followed by a 5x5
    conv_5x5 = tf.nn.conv2d(conv_pre, inception_5x5_weights, [1, 1, 1, 1], padding='SAME')
    conv_5x5 = tf.nn.relu(conv_5x5 + inception_5x5_biases)
    return tf.concat(3, [conv_1x1, conv_3x3, conv_5x5, conv_pool])

---
Problem 2
---------

Try to get the best performance you can using a convolutional net. Look for example at the classic [LeNet5](http://yann.lecun.com/exdb/lenet/) architecture, adding Dropout, and/or adding learning rate decay.

---