# Lesson 11: Convolutional Networks

## 6. Filters

### Breaking up an Image
The first step for a CNN is to break up the image into smaller pieces. We do this by selecting a width and height that defines a filter.

The filter looks at small pieces, or patches, of the image. These patches are the same size as the filter.

<img src="./screenshots/conv1.png" width='500'>

As shown in the previous video, a CNN uses __filters__ to split an image into smaller __patches__. The size of these patches matches the filter size.

We then simply slide this filter horizontally or vertically to focus on a different piece of the image.

The amount by which the filter slides is referred to as the __'stride'__. The stride is a hyperparameter which you, the engineer, can tune. Increasing the stride reduces the size of your model by reducing the number of total patches each layer observes. However, this usually comes with a reduction in accuracy.

Let’s look at an example. In this zoomed in image of the dog, we first start with the patch outlined in red. The width and height of our filter define the size of this square.

<img src="./screenshots/retriever-patch.png" width='300'>
One patch of the Golden Retriever image.

We then move the square over to the right by a given stride (2 in this case) to get another patch.

<img src="./screenshots/retriever-patch-shifted.png" width='300'>
We move our square to the right by two pixels to create another patch.

What's important here is that we are grouping together adjacent pixels and treating them as a collective.

In a normal, non-convolutional neural network, we would have ignored this adjacency. In a normal network, we would have connected every pixel in the input image to a neuron in the next layer. In doing so, we would not have taken advantage of the fact that pixels in an image are close together for a reason and have special meaning.

By taking advantage of this local structure, our CNN learns to classify local patterns, like shapes and objects, in an image.

### Filter Depth
It's common to have __more than one filter__. Different filters pick up different qualities of a patch. For example, one filter might look for a particular color, while another might look for a kind of object of a specific shape. The amount of filters in a convolutional layer is called the _filter depth_.


<img src="./screenshots/neilsen-pic.png" width='400'>
In the above example, a patch is connected to a neuron in the next layer. Source: MIchael Nielsen.

How many neurons does each patch connect to?

That’s dependent on our filter depth. If we have a depth of __k__, __we connect each patch of pixels to k neurons in the next layer__. This gives us the height of k in the next layer, as shown below. In practice, k is a hyperparameter we tune, and most CNNs tend to pick the same starting values.

<img src="./screenshots/filter-depth.png" width='100'>
Choosing a filter depth of `k` connects each patch to `k` neurons in the next layer.

But why connect a single patch to multiple neurons in the next layer? Isn’t one neuron good enough?

Multiple neurons can be useful because a patch can have multiple interesting characteristics that we want to capture.

For example, one patch might include some white teeth, some blonde whiskers, and part of a red tongue. In that case, we might want a filter depth of at least three - one for each of teeth, whiskers, and tongue.

<img src="./screenshots/teeth-whiskers-tongue.png" width='300'>
This patch of the dog has many interesting features we may want to capture. These include the presence of teeth, the presence of whiskers, and the pink color of the tongue.

Having multiple neurons for a given patch ensures that our CNN can learn to capture whatever characteristics the CNN learns are important.

Remember that the CNN isn't "programmed" to look for certain characteristics. Rather, it learns on its own which characteristics to notice.

## 9. Parameters

### Parameter Sharing

<img src="./screenshots/convnets.png" width='400'>

The weights, `w`, are shared across patches for a given layer in a CNN to detect the cat above regardless of where in the image it is located.
The weights, w, are shared across patches for a given layer in a CNN to detect the cat above regardless of where in the image it is located.
When we are trying to classify a picture of a cat, we don’t care where in the image a cat is. If it’s in the top left or the bottom right, it’s still a cat in our eyes. We would like our CNNs to also possess this ability known as translation invariance. How can we achieve this?

As we saw earlier, the classification of a given patch in an image is determined by the weights and biases corresponding to that patch.

If we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom right patch, we need the weights and biases corresponding to those patches to be the same, so that they are classified the same way.

This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are shared across all patches in a given input layer. Note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren't shared across the output channels.

There’s an additional benefit to sharing our parameters. If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. This does not scale well, especially for higher fidelity images. Thus, sharing parameters not only helps us with translation invariance, but also gives us a smaller, more scalable model.

### Padding

<img src="./screenshots/55_2.png" width='400'>

A `5x5` grid with a `3x3` filter. Source: Andrej Karpathy.

Let's say we have a 5x5 grid (as shown above) and a filter of size 3x3 with a stride of 1. What's the width and height of the next layer? We see that we can fit at most three patches in each direction, giving us a dimension of 3x3 in our next layer. As we can see, the width and height of each subsequent layer decreases in such a scheme.

In an ideal world, we'd be able to maintain the same width and height across layers so that we can continue to add layers without worrying about the dimensionality shrinking and so that we have consistency. How might we achieve this? One way is to simply add a border of 0s to our original 5x5 image. You can see what this looks like in the below image.


<img src="./screenshots/55.png" width='400'>
The same grid with `0` padding. Source: Andrej Karpathy.

This would expand our original image to a 7x7. With this, we now see how our next layer's size is again a 5x5, keeping our dimensionality consistent.

### Dimensionality

From what we've learned so far, how can we calculate the number of neurons of each layer in our CNN?

Given:

- our input layer has a width of W and a height of H
- our convolutional layer has a filter size F
- we have a stride of S
- a padding of P
- and the number of filters K,

the following formula gives us the __width of the next layer__: $W_{out} = \frac{(W+2P−width_{filter})}S + 1$.

The __output height__ would be $H_{out} = \frac{(H+2P−height_{filter})}S + 1$.

And the __output depth__ would be equal to the number of filters `D_out = K`.

The __output volume__ would be `W_out * H_out * D_out`.

Knowing the dimensionality of each additional layer helps us understand how large our model is and how our decisions around filter size and stride affect the size of our network.

## 10. Quiz: Convolution Output Shape

Introduction
For the next few quizzes we'll test your understanding of the dimensions in CNNs. Understanding dimensions will help you make accurate tradeoffs between model size and performance. As you'll see, some parameters have a much bigger impact on model size than others.

Setup
- H = height
- W = width
- D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- With padding of size 1 (P)


__What's the shape of the output?__
The answer format is HxWxD, so if you think the new height is 9, new width is 9, and new depth is 5, then type 9x9x5.

In [2]:
H = W = 32
K = 20
F_h = F_w = 8
S = 2
P = 1

h_out = (H + 2*P - F_h) / S +1
print(h_out)
w_out = (H  + 2*P - F_w) / S +1
print(w_out)
depth = K
print(depth)

14.0
14.0
20


This would correspond to the following code:



In [3]:
import tensorflow as tf
input = tf.placeholder(tf.float32, (None, 32, 32, 3))
filter_weights = tf.Variable(tf.truncated_normal((8, 8, 3, 20))) # (height, width, input_depth, output_depth)
filter_bias = tf.Variable(tf.zeros(20))
strides = [1, 2, 2, 1] # (batch, height, width, depth)
padding = 'SAME'
conv = tf.nn.conv2d(input, filter_weights, strides, padding) + filter_bias

Note the output shape of conv will be `[1, 16, 16, 20]`. It's 4D to account for batch size, but more importantly, it's not [1, 14, 14, 20]. This is because the padding algorithm TensorFlow uses is not exactly the same as the one above. An alternative algorithm is to switch __padding__ from __'SAME'__ to __'VALID'__ which would result in an output shape of `[1, 13, 13, 20]`. If you're curious how padding works in TensorFlow, read [this document](https://www.tensorflow.org/api_guides/python/nn#Convolution).

In summary TensorFlow uses the following equation for 'SAME' vs 'PADDING'

__SAME Padding__, the output height and width are computed as:

$out_{height} = ceil(float(in_{height}) / float(strides1))$

$out_{width} = ceil(float(in_{width}) / float(strides[2]))$

__VALID Padding__, the output height and width are computed as:

$out_{height} = ceil(float(in_{height} - filter_{height} + 1) / float(strides1))$

$out_{width} = ceil(float(in_{width} - filter_{width} + 1) / float(strides[2]))$

## Quiz: Number of Parameters

We're now going to calculate the number of parameters of the convolutional layer. The answer from the last quiz will come into play here!

Being able to calculate the number of parameters in a neural network is useful since we want to have control over how much memory a neural network uses.

Setup
- H = height
- W = width
- D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)
- Output Layer 14x14x20 (HxWxD)

__Hint:__
Without parameter sharing, each neuron in the output layer must connect to each neuron in the filter. In addition, each neuron in the output layer must also connect to a single bias neuron.

__How many parameters does the convolutional layer have (without parameter sharing)?__

In [5]:
(8 * 8 * 3 + 1) * (14 * 14 * 20)

756560

Now we'd like you to calculate the number of parameters in the convolutional layer, if every neuron in the output layer shares its parameters with every other neuron in its same channel.

This is the number of parameters actually used in a convolution layer (tf.nn.conv2d()).

__Hint__
With parameter sharing, each neuron in an output channel shares its weights with every other neuron in that channel. So the number of parameters is equal to the number of neurons in the filter, plus a bias neuron, all multiplied by the number of channels in the output layer.

__How many parameters does the convolution layer have (with parameter sharing)?__

In [6]:
(8 * 8 * 3 + 1) * 20

3860

That's 3840 weights and 20 biases. This should look similar to the answer from the previous quiz. The difference being it's just 20 instead of $(14 * 14 * 20)$. Remember, with weight sharing we use the same filter for an entire depth slice. Because of this we can get rid of 14 * 14 and be left with only 20.

## 17. TensorFlow Convolution Layer

Let's examine how to implement a CNN in TensorFlow.

TensorFlow provides the `tf.nn.conv2d()` and `tf.nn.bias_add()` functions to create your own convolutional layers.

In [10]:
import tensorflow as tf
# Output depth
k_output = 64

# Image Properties
image_width = 10
image_height = 10
color_channels = 3

# Convolution filter
filter_size_width = 5
filter_size_height = 5

# Input/Image
input = tf.placeholder(
    tf.float32,
    shape=[None, image_height, image_width, color_channels])

# Weight and bias
weight = tf.Variable(tf.truncated_normal(
    [filter_size_height, filter_size_width, color_channels, k_output]))
bias = tf.Variable(tf.zeros(k_output))

# Apply Convolution
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
# Add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)
# Apply activation function
conv_layer = tf.nn.relu(conv_layer)

The code above uses the `tf.nn.conv2d()` function to compute the convolution with weight as the filter and `[1, 2, 2, 1]` for the strides. TensorFlow uses a stride for each input dimension, `[batch, input_height, input_width, input_channels]`. We are generally always going to set the stride for __batch__ and __input_channels__ (i.e. the first and fourth element in the strides array) to be 1.

You'll focus on changing __input_height__ and __input_width__ while setting batch and input_channels to 1. The input_height and input_width strides are for striding the filter over input. This example code uses a stride of 2 with 5x5 filter over input.

The `tf.nn.bias_add()` function adds a 1-d bias to the last dimension in a matrix.

## 18. Explore the design space

Generally, one build a NN for image recognition like so:

<img src="./screenshots/pyra.png" width='300'>




## 19. TensorFlow Max Pooling

<img src="./screenshots/max-pooling.png" width='300'>
By Aphex34 (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

The image above is an example of max pooling with a 2x2 filter and stride of 2. The four 2x2 colors represent each time the filter was applied to find the maximum value.

For example, `[[1, 0], [4, 6]]` becomes 6, because 6 is the maximum value in this set. Similarly, `[[2, 3], [6, 8]]` becomes 8.

Conceptually, the benefit of the max pooling operation is to __reduce the size of the input__, and allow the neural network to focus on only the most important elements. Max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values.

TensorFlow provides the __`tf.nn.max_pool()`__ function to apply max pooling to your convolutional layers.
```
...
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)

# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')
```
The __`tf.nn.max_pool()`__ function performs max pooling with the ksize parameter as the size of the filter and the strides parameter as the length of the stride. 2x2 filters with a stride of 2x2 are common in practice.

The `ksize` and `strides` parameters are structured as 4-element lists, with each element corresponding to a dimension of the input tensor (`[batch, height, width, channels]`). For both ksize and strides, the batch and channel dimensions are typically set to 1.

## Pooling Intuition

A pooling layer is generally used to ...
- Decrease the size of the ouput
- prevent overfitting

__Solution__
The correct answer is decrease the size of the output and prevent overfitting. Preventing overfitting is a consequence of reducing the output size, which in turn, reduces the number of parameters in future layers.
Recently, __pooling layers have fallen out of favor__.

Some reasons are:
- Recent datasets are so big and complex we're more concerned about underfitting.
- Dropout is a much better regularizer.
- Pooling results in a loss of information. Think about the max pooling operation as an example. We only keep the largest of n numbers, thereby disregarding n-1 numbers completely.

## Pooling Mechanics 

Setup
H = height, W = width, D = depth

- We have an input of shape 4x4x5 (HxWxD)
- Filter of shape 2x2 (HxW)
- A stride of 2 for both the height and width (S)

Recall the formula for calculating the new height or width:
```
new_height = (input_height - filter_height)/S + 1
new_width = (input_width - filter_width)/S + 1
```

__NOTE:__ For a pooling layer the output depth is the same as the input depth. Additionally, the pooling operation is applied individually for each depth slice.

The image below gives an example of how a max pooling layer works. In this case, the max pooling filter has a shape of 2x2. As the max pooling filter slides across the input layer, the filter will output the maximum value of the 2x2 square.

<img src="./screenshots/convolutionalnetworksquiz.png" width='500'>

__What's the shape of the output? Format is HxWxD.__

In [12]:
input_height = 4
input_width = 4
filter_height = 2
filter_width = 2
S = 2
new_height = (input_height - filter_height)/S + 1
new_width = (input_width - filter_width)/S + 1

print(new_height)
print(new_width)

2.0
2.0


Here's the corresponding code:
```
input = tf.placeholder(tf.float32, (None, 4, 4, 5))
filter_shape = [1, 2, 2, 1]
strides = [1, 2, 2, 1]
padding = 'VALID'
pool = tf.nn.max_pool(input, filter_shape, strides, padding)
```
The output shape of pool will be `[1, 2, 2, 5]`, even if padding is changed to `'SAME'`.

In [13]:
input = tf.placeholder(tf.float32, (None, 4, 4, 5))
filter_shape = [1, 2, 2, 1]
strides = [1, 2, 2, 1]
padding = 'VALID'
pool = tf.nn.max_pool(input, filter_shape, strides, padding)

In [17]:
print(pool.shape)

(?, 2, 2, 5)


## Pooling Practice

__Max Pooling__

What's the result of a max pooling operation on the input:
```
[[[0, 1, 0.5, 10],
   [2, 2.5, 1, -8],
   [4, 0, 5, 6],
   [15, 1, 2, 3]]]
```   
Assume the filter is 2x2 and the stride is 2 for both height and width. The output shape is 2x2x1.

The answering format will be 4 numbers, each separated by a comma, such as: 1,2,3,4.

Work from the top left to the bottom right

2.5,10,15,6

__Mean Pooling__

What's the result of a average (or mean) pooling?
```
[[[0, 1, 0.5, 10],
   [2, 2.5, 1, -8],
   [4, 0, 5, 6],
   [15, 1, 2, 3]]]
``` 

Assume the filter is 2x2 and the stride is 2 for both height and width. The output shape is 2x2x1.

The answering format will be 4 numbers, each separated by a comma, such as: 1,2,3,4.

Answer to 3 decimal places. Work from the top left to the bottom right


In [24]:
import numpy as np
print(np.mean([0,1,2,2.5]), np.mean([0.5, 10, 1, -8]), np.mean([4, 0, 15, 1]), np.mean([5, 6, 2, 3]))

1.375 0.875 5.0 4.0


## 28. 1x1 Convolutions

<img src="./screenshots/1x1.png" width='500'>

Add a 1x1 convolution in the middle introduces a mini-neural network running over the patch, instead of only a linear classifier

## 29. Inception Module

<img src="./screenshots/avpool.png" width='500'>

instead of deciding between pooling or different convolutions, we can simply use them all together!

## 30. Convolutional Network in TensorFlow

It's time to walk through an example Convolutional Neural Network (CNN) in TensorFlow.

The structure of this network follows the classic structure of CNNs, which is a mix of convolutional layers and max pooling, followed by fully-connected layers.

The code you'll be looking at is similar to what you saw in the segment on Deep Neural Network in TensorFlow, except we restructured the architecture of this network as a CNN.

Just like in that segment, here you'll study the line-by-line breakdown of the code. If you want, you can even download the code and run it yourself.

Thanks to Aymeric Damien for providing the original TensorFlow model on which this segment is based.

Time to dive in!

### Dataset

You've seen this section of code from previous lessons. Here we're importing the MNIST dataset and using a convenient TensorFlow function to batch, scale, and One-Hot encode the data.

In [8]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)

import tensorflow as tf

#Parameters
learning_rate = 0.00001
epochs = 10
batch_size = 128

#Number of samples to calculate validation and accuracy
# Decrease this if you're running out of memory to calculate accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10   #MNIST total classes (0-9 digits)
dropout = 0.75   #Dropout, probability to keep units

Extracting ./train-images-idx3-ubyte.gz
Extracting ./train-labels-idx1-ubyte.gz
Extracting ./t10k-images-idx3-ubyte.gz
Extracting ./t10k-labels-idx1-ubyte.gz


### Weights and Biases

In [9]:
# Store layers weight & bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))}

### Convolutions

<img src="./screenshots/convolution-schematic.gif" weight='200'>

Convolution with 3×3 Filter. Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

__The above is an example of a convolution with a 3x3 filter and a stride of 1 being applied to data with a range of 0 to 1. The convolution for each 3x3 section is calculated against the weight, [[1, 0, 1], [0, 1, 0], [1, 0, 1]], then a bias is added to create the convolved feature on the right. In this case, the bias is zero. In TensorFlow, this is all done using tf.nn.conv2d() and tf.nn.bias_add().__

In [10]:
def conv2d(x, W, b, strides=1):
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

The `tf.nn.conv2d()` function computes the convolution against weight W as shown above.

In TensorFlow, strides is an array of 4 elements; the first element in this array indicates the stride for batch and last element indicates stride for features. It's good practice to remove the batches or features you want to skip from the data set rather than use a stride to skip them. You can always set the first and last element to 1 in strides in order to use all batches and features.

The middle two elements are the strides for height and width respectively. I've mentioned stride as one number because you usually have a square stride where height = width. When someone says they are using a stride of 3, they usually mean tf.nn.conv2d(x, W, strides=[1, 3, 3, 1]).

To make life easier, the code is using tf.nn.bias_add() to add the bias. Using tf.add() doesn't work when the tensors aren't the same shape.

### Max Pooling

In [11]:
def maxpool2d(x, k=2):
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

The __`tf.nn.max_pool()`__ function does exactly what you would expect, it performs max pooling with the ksize parameter as the size of the filter.

### Model
In the code below, we're creating 3 layers alternating between convolutions and max pooling followed by a fully connected and output layer. The transformation of each layer to new dimensions are shown in the comments. For example, the first layer shapes the images from 28x28x1 to 28x28x32 in the convolution step. Then next step applies max pooling, turning each sample into 14x14x32. All the layers are applied from conv1 to output, producing 10 class predictions.

In [12]:
def conv_net(x, weights, biases, dropout):
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)

    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)

    # Fully connected layer - 7*7*64 to 1024
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)

    # Output Layer - class prediction - 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out

### Session
Now let's run it!

In [13]:
# tf Graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32)

# Model
logits = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(\
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
    .minimize(cost)

# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf. global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: dropout})

            # Calculate batch loss and accuracy
            loss = sess.run(cost, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: 1.})
            valid_acc = sess.run(accuracy, feed_dict={
                x: mnist.validation.images[:test_valid_size],
                y: mnist.validation.labels[:test_valid_size],
                keep_prob: 1.})

            print('Epoch {:>2}, Batch {:>3} -'
                  'Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(
                epoch + 1,
                batch + 1,
                loss,
                valid_acc))

    # Calculate Test Accuracy
    test_acc = sess.run(accuracy, feed_dict={
        x: mnist.test.images[:test_valid_size],
        y: mnist.test.labels[:test_valid_size],
        keep_prob: 1.})
    print('Testing Accuracy: {}'.format(test_acc))

Epoch  1, Batch   1 -Loss: 69177.2344 Validation Accuracy: 0.097656
Epoch  1, Batch   2 -Loss: 48730.2891 Validation Accuracy: 0.109375
Epoch  1, Batch   3 -Loss: 38989.3086 Validation Accuracy: 0.125000
Epoch  1, Batch   4 -Loss: 29837.5469 Validation Accuracy: 0.117188
Epoch  1, Batch   5 -Loss: 28970.7383 Validation Accuracy: 0.140625
Epoch  1, Batch   6 -Loss: 23875.5820 Validation Accuracy: 0.156250
Epoch  1, Batch   7 -Loss: 22889.8223 Validation Accuracy: 0.167969
Epoch  1, Batch   8 -Loss: 19823.1172 Validation Accuracy: 0.179688
Epoch  1, Batch   9 -Loss: 19118.0566 Validation Accuracy: 0.171875
Epoch  1, Batch  10 -Loss: 19895.7812 Validation Accuracy: 0.183594
Epoch  1, Batch  11 -Loss: 17712.9219 Validation Accuracy: 0.203125
Epoch  1, Batch  12 -Loss: 18428.0254 Validation Accuracy: 0.234375
Epoch  1, Batch  13 -Loss: 15273.7334 Validation Accuracy: 0.238281
Epoch  1, Batch  14 -Loss: 18849.2031 Validation Accuracy: 0.234375
Epoch  1, Batch  15 -Loss: 17513.8125 Validation

KeyboardInterrupt: 