# Convolutional Neural Networks (ConvNets or CNNs)

> **Convolutional Networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.** ~ Deep Learning Book


# 1. Intuition
Let's develop better intuition for how Convolutional Neural Networks (CNN) work. We'll examine how humans classify images, and then see how CNNs use similar approaches.

Let’s say we wanted to classify the following image of a dog as a Golden Retriever.
<img src="assets/golden.jpg"  width="200" height="40" alt="">

As humans, how do we do this?

One thing we do is that we identify certain parts of the dog, such as the nose, the eyes, and the fur. We essentially break up the image into smaller pieces, recognize the smaller pieces, and then combine those pieces to get an idea of the overall dog.

In this case, we might break down the image into a combination of the following:

- A nose
- Two eyes
- Golden fur


These pieces can be seen below:
<img src="assets/eye.png"  width="200" height="40" alt="">
<img src="assets/nose.png"  width="200" height="40" alt="">
<img src="assets/fur.png"  width="200" height="40" alt="">

**Going One Step Further**

But let’s take this one step further. How do we determine what exactly a nose is? A Golden Retriever nose can be seen as an oval with two black holes inside it. Thus, one way of classifying a Retriever’s nose is to to break it up into smaller pieces and look for black holes (nostrils) and curves that define an oval as shown below.
<img src="assets/curve.png"  width="200" height="40" alt="">
<img src="assets/nostril.png"  width="200" height="40" alt="">

Broadly speaking, this is what a CNN learns to do. It learns to recognize basic lines and curves, then shapes and blobs, and then increasingly complex objects within the image. Finally, the CNN classifies the image by combining the larger, more complex objects.

In our case, the levels in the hierarchy are:

- Simple shapes, like ovals and dark circles
- Complex objects (combinations of simple shapes), like eyes, nose, and fur
- The dog as a whole (a combination of complex objects)

With deep learning, we don't actually program the CNN to recognize these specific features. Rather, the CNN learns on its own to recognize such objects through forward propagation and backpropagation!

It's amazing how well a CNN can learn to classify images, even though we never program the CNN with information about specific features to look for.
![](assets/heirarchy-diagram.jpg)

A CNN might have several layers, and each layer might capture a different level in the hierarchy of objects. The first layer is the lowest level in the hierarchy, where the CNN generally classifies small parts of the image into simple shapes like horizontal and vertical lines and simple blobs of colors. The subsequent layers tend to be higher levels in the hierarchy and generally classify more complex ideas like shapes (combinations of lines), and eventually full objects like dogs.

Once again, the CNN **learns all of this on its own**. We don't ever have to tell the CNN to go looking for lines or curves or noses or fur. The CNN just learns from the training set and discovers which characteristics of a Golden Retriever are worth looking for.

That's a good start! Hopefully you've developed some intuition about how CNNs work.


# 2. Architecture Overview

Convolutional Neural Networks (CNN) take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers a ConVNet have neurons arranged in 3 dimensions: **width, height, depth.**

1) ![](assets/neural_net2.jpeg)
<br><br>
2) ![](assets/cnn.jpeg)

> 1: A regular 3-layer Neural Network. 2: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).


# 3. Layers used to build ConvNets
A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through 
a differentiable function. 

Three main types of layers to build ConvNet architectures:
- **Convolutional Layer**
- **Pooling Layer**
- **Fully-Connected Layer**

*Example Architecture:*

A simple ConvNet for [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) Classification could have the architecture 
**[INPUT - CONV - RELU - POOL - FC].**

- INPUT[32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R, G, B.
- CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters. (Filters are explained in later paragraph)
- RELU layer will apply an elementwise activation function, such as the $max(0,x)$ thresholding at zero. This leaves the size of the volume unchanged([32x32x12]).
- POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].
- FC(full-connected) layer will compute the class scores, resulting in volume of size ([1x1x10]), where each of the 10 numbers correspond to a class score, such as among the 10 categories of  [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html).

Notice that some layers contain parameters and other don't. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons).

On the other hand, RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image.

In summary:


- A ConvNet architecture is in the simplest case a list of Layers that transform the image volume into an output volume (e.g. holding the class scores)
- There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far the most popular)
- Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function
- Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
- Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)

![](assets/convnet.jpeg)

# 5. Convolutional Layer
The Convolutional layer is the core building block of a ConvNet that does most of the computational heavy lifting.

The CONV layer's parameters consist of a set of learnable filters. Every filter 





# 5. Filters
The first step for a CNN is to break up the image into smaller pieces We do this by selecting width and height that defines a filter.

The filter looks at small pieces, or patches, of the image. These patches are the same size as the filter.
![](assets/filter.png)


We then simply slide this filter horizontally or vertically to focus on a different piece of the image. 

The amount by which the filter is referred to as **stride**. The stride is a **hyperparameter** which you, the engineer, can tune. Increasing the stride reduces the size of your model by reducing the number of total patches each layer observes. However, this usually comes with a reduction in accuracy.

Example:

We first start with the patch outlined in red. The width and height of our filter define the size of this square.

<img src="assets/dog.png"  width="500" height="400" alt="">


We then move the square over to the right by a given stride (2 in this case) to get another patch.

<img src="assets/dog2.png"  width="500" height="400" alt="">

What is important here is that we are **grouping together adjacent pixels** and treating them as a collective.

In a normal, non ConvNet, we would have ignored this adjacency. In a normal network, we would have connected every pixel in the input image to a neuron in the next layer. In doing so, we would not have taken advantage of the fact that pixels in an image are close together for a reason and have special meaning.

By taking advantage of this local structure, our CNN learns to classify local patterns, like shapes and objects, in an image.

### Filter Depth
It's Common to have more than one filter. Different filters pick up different qualities of a patch. For example, one filter might look for a particular color, while another might look for a kind of object of a specific shape. The amount of filters in a ConvNet layer is called the **filter depth**
![](assets/depth.png)
In the above example, a patch is connected to a neuron in the next layer. Source: Michael Neilsen.


**How many neurons does each patch connect to?**

That's dependent on our filter depth. If we have a depth of **k**, we connect each patch of pixels to **k** neurons in the next layer. This gives us the height of **k** in the next layer, as shown below. In practice, **k** is a hyperparameter we tune, and most ConvNets tend to pick the same starting values.

<img src="assets/filter-depth.png"  width="300" height="250" alt="">

But why connect a single patch to multiple neurons in the next layer? Isn’t one neuron good enough?

Multiple neurons can be useful because a patch can have multiple interesting characteristics that we want to capture.

For example, one patch might include some white teeth, some blonde whiskers, and part of a red tongue. In that case, we might want a filter depth of at least three - one for each of teeth, whiskers, and tongue.

![](assets/teeth-whiskers-tongue.png)

This patch of the dog has many interesting features we may want to capture. These include the presence of teeth, the presence of whiskers, and the pink color of the tongue.
Having multiple neurons for a given patch ensures that our ConvNet can learn to capture whatever characteristics the CNN learns are important.

Remember that the ConvNet isn't "programmed" to look for certain characteristics. Rather, it learns on its own which characteristics to notice.

**Some useful formulas to calculate feature map sizes** 

SAME padding equation:
```python
out_height = ceil(float(in_height) / float(strides[1]))
out_width  = ceil(float(in_width) / float(strides[2]))
```
VALID padding equation:
```python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

# 6. Parameters
Convolution leverages three important ideas that can help improve a machine learning system:
- Sparse interactions
- Parameter sharing
- Equivariant representations


Traditional NN layers use matrix multiplication by a matrix parameters with a separate parameter describing the interaction between each input unit and each output unit. CNN, however, typically have **sparse interactions**. This is accomplished by making the kernel smaller than the input. For example, when processing an image the input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels.

Therefore, we can store fewer parameters and reduce memory requirements of the model.

### Parameter Sharing
<img src="assets/kitten.png"  width="400" height="300" alt="">


**Parameter sharing** refers to using the same parameter for more than one function in a model. In a CNN, each member of the kernels is used at every position of the input(except perhaps some of the boundary pixels, depending on the design decisions regarding the boundary). This reduces the storage requirements of the model to $k$ parameters. 


In a CNN, the particular form of parameter sharing causes the layer to have a property called **equivariance** to translation. A function is equivariant when the input changes, the output changes the same way. For Example a picture with a kitten in the middle. For a CNN it doesn't matter if the CNN is fed with a picture with a kitten in the corner. It will classify the same kitten as a kitten.

When we are trying to classify a picture of a cat, we don’t care where in the image a cat is. If it’s in the top left or the bottom right, it’s still a cat in our eyes. We would like our CNNs to also possess this ability known as translation invariance. How can we achieve this?


If we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom right patch, we need the weights and biases corresponding to those patches to be the same, so that they are classified the same way.

This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are shared across all patches in a given input layer. Note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren't shared across the output channels.

There’s an additional benefit to sharing our parameters. If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. This does not scale well, especially for higher fidelity images. Thus, sharing parameters not only helps us with translation invariance, but also gives us a smaller, more scalable model.

**Note:**
- Convolution is not naturally equivariant to some other transformations, such as changes in the scale or rotation of an image.

### Padding

**A 5x5 grid with a 3x3 filter**
![](assets/padding.png)


Image Courtesy: Andrej Karpathy.

Let's say we have a 5x5 grid and a filter of size 3x3 with a stride of 1. What's the width and height of the next layer? We see that we can fit at most three patches in each direction, giving us a dimension of 3x3 in our next layer. As we can see, the width and height of each subsequent layer decreases in such a scheme.


In an ideal world, we'd be able to maintain the same width and height across layers so that we can continue to add layers without worrying about the dimensionality shrinking and so that we have consistency. How might we achieve this? One way is to simple add a border of 0s to our original 5x5 image. You can see what this looks like in the below image. 
Below is the same grid with 0 padding:

![](assets/padding2.png)

Image Courtesy: Andrej Karpathy.


This would expand our original image to a 7x7. With this, we now see how our next layer's size is again a 5x5, keeping our dimensionality consistent.

### Dimensionality
From what we've learned so far, how can we calculate the number of neurons of each layer in our CNN?

Given our input layer has a volume of W, our filter has a volume (height * width * depth) of F, we have a stride of S, and a padding of P, the following formula gives us the volume of the next layer: (W−F+2P)/S+1.

Knowing the dimensionality of each additional layer helps us understand how large our model is and how our decisions around filter size and stride affect the size of our network.


**1. Convolutional Layer Output Shape**

*Introduction*

Understanding dimensions will help you make accurate tradeoffs between model size and performance. As you'll see, some parameters have a much bigger impact on model size than others.

*Setup*

H = height, W = width, D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Valid padding of size 1 (P)
- Recall the formula for calculating the new height or width:

```python 
new_height = (input_height - filter_height + 2 * P)/S + 1
new_width = (input_width - filter_width + 2 * P)/S + 1
```

What's the shape of the output?

We can get the new height and width with the above formula resulting in:

(32 - 8 + 2 * 1)/2 + 1 = 14

(32 - 8 + 2 * 1)/2 + 1 = 14

The new depth is equal to the number of filters, which is 20.

This would correspond to the following implementation in TensorFlow

```python
import tensorflow as tf

input = tf.placeholder(tf.float32, (None, 32, 32, 3))

# (height, width, input_depth, output_depth)
filter_weights = tf.Variable(tf.truncated_normal((8, 8, 3, 20)))

filter_bias = tf.Variable(tf.Zeros(20))
strides = [1, 2, 2, 1] #(batch, height, width, depth)
padding = 'VALID'
conv = tf.nn.Conv2d(input, filter_weights, strides, padding) + filter_bias
```

Note:

Note the output shape of conv will be [1, 13, 13, 20]. It's 4D to account for batch size, but more importantly, it's not [1, 14, 14, 20]. This is because the padding algorithm TensorFlow uses is not exactly the same as the one above. An alternative algorithm is to switch padding from 'VALID' to SAME which would result in an output shape of [1, 16, 16, 20]. If you're curious how padding works in TensorFlow, 
read this [document](https://www.tensorflow.org/api_docs/python/tf/nn/convolution).


**2. Number of Parameters**

We're now going to calculate the number of parameters of the convolutional layer. The answer from the last quiz will come into play here!

Being able to calculate the number of parameters in a neural network is useful since we want to have control over how much memory a neural network uses.

*Setup*

H = height, W = width, D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)

*Output Layer*

- 14x14x20 (HxWxD)

*Hint*

- Without parameter sharing, each neuron in the output layer must connect to each neuron in the filter. In addition, each neuron in the output layer must also connect to a single bias neuron.
- Without weight sharing every parameter in the filter has a connection with every neuron in the output. So, what we need to do is calculate the total number of parameters in the filter and the total number of neurons in the output.

**Convolution Layer Parameters 1**

How many parameters does the convolutional layer have (without parameter sharing)?


Solution:

There are 756560 total parameters. That's a HUGE amount! Here's how we calculate it:

(8 x 8 x 3 + 1) x (14 x 14 x 20) = 756560 

8 x 8 x 3 is the number of weights, we add 1 for the bias. Remember, each weight is assigned to every single part of the output (14 x 14 x 20). So we multiply these two numbers together and we get the final answer.



**3. Parameter Sharing**

Now we'd like you to calculate the number of parameters in the convolutional layer, if every neuron in the output layer shares its parameters with every other neuron in its same channel.

This is the number of parameters actually used in a convolution layer ```python (tf.nn.conv2d())```

*Setup*

H = height, W = width, D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)

*Output Layer*

- 14x14x20 (HxWxD)

*Hint*

- With parameter sharing, each neuron in an output channel shares its weights with every other neuron in that channel. So the number of parameters is equal to the number of neurons in the filter, plus a bias neuron, all multiplied by the number of channels in the output layer.

**Convolution Layer Parameters 2**

How many parameters does the convolution layer have (with parameter sharing)?

Solution

There are 3860 total parameters. That's 196 times fewer parameters! Here's how the answer is calculated:

(8 x 8 x 3 + 1) x 20 = 3840 + 20 = 3860

That's 3840 weights and 20 biases. This should look similar to the answer from the previous problem. The difference being it's just 20 instead of (14 x 14 x 20). Remember, with weight sharing we use the same filter for an entire depth slice. Because of this we can get rid of 14 x 14 and be left with only 20.

# 7. Visualizing ConvNets

Let's look at an example ConvNets to see how they work in action.

The CNN we will look at is trained on ImageNet as described in [this paper by Zeiler and Fergus](http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf). In the images below (from the same paper), we’ll see what each layer in this network detects and see how each layer detects more and more complex ideas.

**Layer 1** 
![](assets/layer-1-grid.png)

The images above are from [Matthew Zeiler and Rob Fergus' deep visualization toolbox](https://www.youtube.com/watch?v=ghEmQSxT6tw), which lets us visualize what each layer in a CNN focuses on.

Each image in the above grid represents a pattern that causes the neurons in the first layer to activate - in other words, they are patterns that the first layer recognizes. The top left image shows a -45 degree line, while the middle top square shows a +45 degree line. 

Let's now see some example images that cause such activations. The below grid of images all activated the -45 degree line. Notice how they are all selected despite the fact that they have different colors, gradients, and patterns.
![](assets/grid-layer-1.png)

So, the first layer of our CNN clearly picks out very simple shapes and patterns like lines and blobs.


**Layer 2** 
![](assets/layer2.png)
A visualization of the second layer in the CNN. Notice how we are picking up more complex ideas like circles and stripes. The gray grid on the left represents how this layer of the CNN activates (or "what it sees") based on the corresponding images from the grid on the right.

We'll skip layer 4, which continues this progression, and jump right to the fifth and final layer of this CNN.

The last layer picks out the highest order ideas that we care about for classification, like dog faces, bird faces, and bicycles.

On to TensorFlow
This concludes our high-level discussion of Convolutional Neural Networks.

Next we'll practice actually building these networks in TensorFlow.


# 8. TensorFlow Convolution Layer

Let's examine how to implemement a CNN in TensorFlow.

Tensorflow provides the ```python tf.nn.conv2d() and tf.nn.bias_add() ``` functions to create our own.


In [1]:
import tensorflow as tf

In [2]:
# Output depth
k_output = 64

# Image Properties
image_width = 10
image_height = 10
color_channels = 3

# Convolution filter
filter_size_width = 5
filter_size_height = 5

# Input/Image
input = tf.placeholder(tf.float32, shape=[None, image_width, image_height, color_channels])

# Weight and bias
weight = tf.Variable(tf.truncated_normal([filter_size_width, filter_size_height, color_channels, k_output]))

bias = tf.Variable(tf.zeros(k_output))


# Apply Convolution
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')

# Add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)

# Apply activation function 
conv_layer = tf.nn.relu(conv_layer)

Let's look at one more example on how to setup the dimensions of the ConvNet filters, weights, and biases. This is in many ways the trickiest part of using ConvNets in TensorFlow. Once you have a sense of how to set up the dimensions of these attributes, applying CNNs will be far more straight forward.

**Objective:**

our Input is (1,4,4,1)

Setup the strides, padding and filter weight/bias (F_w and F_b) such that the output shape is (1, 2, 2, 3). Note that all of these except strides should be TensorFlow variables.

Solution:

Since we want to transform the input shape (1, 4, 4, 1) to (1, 2, 2, 3). we choose 'VALID' for the padding algorithm. we find it simpler to understand and it achieves the result we are looking for.

```python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

Plugging in the values:

```python
out_height = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
out_width  = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
```

In order to change the depth from 1 to 3, we have to set the output depth of our filter appropriately:

```python
F_W = tf.Variable(tf.truncated_normal((2, 2, 1, 3))) # (height, width, input_depth, output_depth)
F_b = tf.Variable(tf.zeros(3)) # (output_depth)
```

The input has a depth of 1, so we set that as the input_depth of the filter.

The depth doesn't change during a pooling operation so we don't have to worry about that.


In [10]:
import numpy as np
# Input (1,4,4,1)
x = np.array([
        [0, 1, 0.5, 10],
        [2, 2.5, 1, -8],
        [4, 0, 5, 6],
        [15, 1, 2, 3]], dtype=np.float32).reshape((1, 4, 4, 1))

X = tf.constant(x)

def conv2d(input):
    # Filter (weights and bias)
    # The shape of the filter weight is (height, width, input_depth, output_depth)
    # The shape of the filter bias is (output_depth,)
    # TODO: Define the filter weights `F_W` and filter bias `F_b`.
    # NOTE: Remember to wrap them in `tf.Variable`, they are trainable parameters after all.
    F_W = tf.Variable(tf.truncated_normal((2,2,1,3)))
    F_b = tf.Variable(tf.zeros(3))
    # TODO: Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]
    # TODO: set the padding, either 'VALID' or 'SAME'.
    padding = 'VALID'
    # https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#conv2d
    # `tf.nn.conv2d` does not include the bias computation so we have to add it ourselves after.
    return tf.nn.conv2d(input, F_W, strides, padding) + F_b

out = conv2d(X)

In [12]:
print (out)

Tensor("add:0", shape=(1, 2, 2, 3), dtype=float32)


In [14]:
def maxpool(input):
    # TODO: Set the ksize (filter size) for each dimension (batch_size, height, width, depth)
    ksize = [1, 2, 2, 1]
    # TODO: Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]
    # TODO: set the padding, either 'VALID' or 'SAME'.
    padding = 'VALID'
    # https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#max_pool
    return tf.nn.max_pool(input, ksize, strides, padding)

print(out)

Tensor("add:0", shape=(1, 2, 2, 3), dtype=float32)


# 9. TensorFlow Pooling
![](assets/pooling.JPG)

A typical layer of a convolutional network consists of three stages. 
- In the first layer, the layer performs several convolutions in parallel to produce a set of linear activations. 
- In the second layer, each linear activation is run through a non linear activation function, such as the rectified linear activation function. This stage is sometimes called **the detector stage**. 
- In the third stage, we use a **pooling function** to modify the output of the layer further.

A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs. For example, the *max pooling* operation reports the maximum output within a rectangular neighborhood. 

Pooling helps to make the representation become approximately *invariant* to small translations of the input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change.

Because pooling summarizes the responses over a whole neighborhood, it is possible to use fewer pooling units than detector units, by reporting summary statistics for pooling regions spaced $k$ pixels rather than $1$ pixel apart

![](assets/max-pooling.png)

The image above is an example of **max pooling** with a 2x2 filter and stride of 2. The four 2x2 colors represent each time the filter was applied to find th maximum value. 

For example, [[1, 0], [4, 6]] becomes 6, because 6 is the maximum value in this set. Similarly, [[2, 3], [6, 8]] becomes 8.

Conceptually, the benefit of the max pooling operation is to reduce the size of the input, and allow the neural network to focus on only the most important elements. Max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values.

TensorFlow provides the *tf.nn.max_pool()* function to apply max pooling to your convolutional layers.

In [3]:
# Apply  Max Pooling
conv_layer = tf.nn.max_pool(conv_layer, 
                           ksize=[1,2,2,1],
                           strides=[1,2,2,1],
                           padding='SAME')

The *tf.nn.max_pool()* function performs max pooling with the ksize parameter as the size of the filter and the strides parameter as the length of the stride. 2x2 filters with a stride of 2x2 are common in practice.

The ksize and strides parameters are structured as 4-element lists, with each element corresponding to a dimension of the input tensor ([batch, height, width, channels]). For both ksize and strides, the batch and channel dimensions are typically set to 1.


**Note:**

A pooling layer is generally used to decrease the size of the output and prevent overfitting. Reducing overfitting is a consequence of the reducing the output size, which in turn, reduces the number of parameters in future layers.


Recently, pooling layers have fallen out of favor. Some reasons are:

- Recent datasets are so big and complex we're more concerned about underfitting.
- Dropout is a much better regularizer.
- Pooling results in a loss of information. Think about the max pooling operation as an example. We only keep the largest of n numbers, thereby disregarding n-1 numbers completely.



# 10. Convolutional Network in TensorFlow

It's time to walk through an example ConvNet in TensorFlow. 

The structure of this network follows the classic structure of CNN, which is a mix of convolutional layers and max pooling, followed by fully-connected layers.

We are going to use the MNIST dataset. Let's import the MNIST dataset and using a convenient TensorFlow function to batch, scale, and One-Hot encode the data.

In [4]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)

Extracting .\train-images-idx3-ubyte.gz
Extracting .\train-labels-idx1-ubyte.gz
Extracting .\t10k-images-idx3-ubyte.gz
Extracting .\t10k-labels-idx1-ubyte.gz


In [5]:
# Parameters
learning_rate = 0.00001
epochs = 10
batch_size = 128

# Number of samples to calculate validation and accuracy
# Decrease this if you are running out of memory to calculate accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10   # MNINST total classes (0-9 digits)
dropout = 0.75  # Dropout, probability to keep units


# Weights and Biases

# Store layers wieght & bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))
}


biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))   
}

**Convolutions**
![](assets/convolution-schematic.gif)

Image courtesy of [UFLDL Tutorial](http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution)

The above is an example of a convolution with a 3x3 filter and a stride of 1 being applied to data with a range of 0 to 1. The convolution for each 3x3 section is calculated against the weight, [[1, 0, 1], [0, 1, 0], [1, 0, 1]], then a bias is added to create the convolved feature on the right. In this case, the bias is zero. In TensorFlow, this is all done using *tf.nn.conv2d()* and *tf.nn.bias_add()*.

In [6]:
def conv2d(x, W, b, strides=1):
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

The tf.nn.conv2d() function computes the convolution against weight W as shown above.

In TensorFlow, stride is an array of 4 elements; the first element in the stride array indicates the stride for batch and last element indicates stride for features. It's good practice to remove the batches or features you want to skip from the dataset than to use stride. You can always set the first and last element to 1 in order to use all batches and features.

The middle two elements are the strides for height and width respectively. I've mentioned stride as one number because you usually have a square stride where height = width. When someone says they are using a stride of 3, they usually mean tf.nn.conv2d(x, W, strides=[1, 3, 3, 1]).

To make life easier, the code is using tf.nn.bias_add() to add the bias. Using tf.add() doesn't work when the tensors aren't the same shape.

**Max Pooling**

![](assets/maxpool.jpeg)

The above is an example of max pooling with a 2x2 filter and stride of 2. The left square is the input and the right square is the output. The four 2x2 colors in input represents each time the filter was applied to create the max on the right side. For example, [[1, 1], [5, 6]] becomes 6 and [[3, 2], [1, 2]] becomes 3.

In [7]:
def maxpool2d(x, k=2):
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

The tf.nn.max_pool() function does exactly what you would expect, it performs max pooling with the ksize parameter as the size of the filter.

**Model**

![](assets/arch.png)

In the code below, we're creating 3 layers alternating between convolutions and max pooling followed by a fully connected and output layer. The transformation of each layer to new dimensions are shown in the comments. For example, the first layer shapes the images from 28x28x1 to 28x28x32 in the convolution step. Then next step applies max pooling, turning each sample into 14x14x32. All the layers are applied from conv1 to output, producing 10 class predictions.

In [8]:
def conv_net(x, weights, biases, dropout):
    
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)
    
    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)
    
    # Fully connected layer - 7*7*64 to 1024
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)
    
    # Output Layer - Class prediciton - 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out
    

In [9]:
# tf Graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32)


# Model
logits = conv_net(x, weights, biases, keep_prob)


# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)


# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))


# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)
    
    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size):
            
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={
                    x: batch_x,
                    y: batch_y,
                    keep_prob: 1.
                })
            
            # Calculate batch loss and accuracy
            loss = sess.run(cost, feed_dict={
                    x: batch_x,
                    y: batch_y, 
                    keep_prob: 1.
                })
            
            valid_acc = sess.run(accuracy, feed_dict={
                    x: mnist.validation.images[:test_valid_size],
                    y: mnist.validation.labels[:test_valid_size],
                    keep_prob: 1.
                })
            
            print('Epoch {:>2}, Batch {:>3} - Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(
                epoch + 1,
                batch + 1,
                loss,
                valid_acc))
            
            # Calculate Test Accuracy
            test_acc = sess.run(accuracy, feed_dict={
                    x: mnist.test.images[:test_valid_size],
                    y: mnist.test.labels[:test_valid_size],
                    keep_prob: 1.
                })
            
            print('Test Accuracy: {}'.format(test_acc))

Epoch  1, Batch   1 - Loss: 71721.0625 Validation Accuracy: 0.097656
Test Accuracy: 0.09765625
Epoch  1, Batch   2 - Loss: 43279.5625 Validation Accuracy: 0.066406
Test Accuracy: 0.078125
Epoch  1, Batch   3 - Loss: 34834.5430 Validation Accuracy: 0.058594
Test Accuracy: 0.08984375
Epoch  1, Batch   4 - Loss: 29371.1641 Validation Accuracy: 0.093750
Test Accuracy: 0.08984375
Epoch  1, Batch   5 - Loss: 26687.7305 Validation Accuracy: 0.085938
Test Accuracy: 0.09765625
Epoch  1, Batch   6 - Loss: 26890.7109 Validation Accuracy: 0.113281
Test Accuracy: 0.12109375
Epoch  1, Batch   7 - Loss: 27383.0977 Validation Accuracy: 0.105469
Test Accuracy: 0.1328125
Epoch  1, Batch   8 - Loss: 23837.0684 Validation Accuracy: 0.085938
Test Accuracy: 0.13671875
Epoch  1, Batch   9 - Loss: 21657.1895 Validation Accuracy: 0.113281
Test Accuracy: 0.125
Epoch  1, Batch  10 - Loss: 21153.0195 Validation Accuracy: 0.113281
Test Accuracy: 0.13671875
Epoch  1, Batch  11 - Loss: 23806.5664 Validation Accuracy

**References:**

- [CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.stanford.edu/index.html)
- [Goodfellow-et-al-2016 - Deep Learning](http://www.deeplearningbook.org/)
- [Udacity - Deep Learning](https://www.udacity.com/course/deep-learning--ud730)