In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../../notebook_format')
from formats import load_style
load_style( css_style = 'custom2.css' )

In [2]:
os.chdir(path)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 8, 6 # change default figure size

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload 
%autoreload 2

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,matplotlib

Ethen 2016-08-04 20:31:49 

CPython 3.5.2
IPython 4.2.0

numpy 1.11.1
pandas 0.18.1
matplotlib 1.5.1
tensorflow 0.9.0


# **Convolutional Networks**

When we hear about Convolutional Neural Network (CNN or ConvNet), we typically think of Computer Vision. CNNs were responsible for major breakthroughs in Image Classification and are the core of most Computer Vision systems today, from Facebook’s automated photo tagging to self-driving cars.

## **Motivation**

**Convolutional Neural Networks (ConvNet)** are very similar to ordinary Neural Networks from the previous chapter: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product with the weights and biases then follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer hence most of the tips/tricks we developed for learning regular Neural Networks still apply.

The thing is: Regular Neural Nets don’t scale well to full images. For the CIFAR-10 image dataset, images are only of size 32, 32, 3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have $32 * 32 * 3 = 3072$ weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectible size, e.g. 200x200x3, would lead to neurons that have $200 * 200 * 3 = 120,000$ weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

Thus unlike a regular Neural Network, ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network.

More explicitly, the layers of a ConvNet have neurons arranged in 3 dimensions: **width, height, depth**. (Note that the word depth here refers to the third dimension of an activation volume, not the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are input volumes of 32x32x3 (width, height, depth respectively).

We use three main types of layers to build ConvNet architectures: **Convolutional Layer**, **Pooling Layer**, and **Fully-Connected Layer** (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture.

# **Convolutional Layer**

The Conv layer is the core building block of a Convolutional Network that does most of the computational heavy lifting. 

## **Convolutional Layer-1 Basics**

Each layer can be visualized in the form of a block. For instance in the case of CIFAR-10 data, the input layer would have the following form:

<img src='images/conv1.png', width=40% height=40%>

Here you can see, this is the original image which is 32×32 in height and width. The depth here is 3 which corresponds to the Red, Green and Blue (RGB) colors. A convolution layer is formed by convolving a **filter or so called the kernel over it. A filter is another block that has a smaller height and width usually 3x3, 5x5 or somthing that like, but must has SAME depth size**, which is swept over its input block. Let’s consider a filter of size 5x5x3 and see how it is done:

<img src='images/conv2.png', width=80% height=80%>

Convolving means that we take the filter and start sliding it over all possible spatial locations of the input from the top left corner to the bottom right corner. This filter is a set of weights, i.e. 5x5x3=75 + 1 bias = 76 weights in total (the filter weights are parameters which are learned during the backpropagation step). At each position, the weighted sum of the pixels is calculated as $W^TX + b$ and a new value is obtained.

After the computation for a single filter, we end up with a a volume of size 28x28x1 (more on this later) as shown above. This is often referred to as the **activation map**.

Another way of think of this process is that it is as a sliding window function. Consider the following gif:

![Convolution with 3×3 Filter](images/convolution.gif)

Imagine that the matrix on the left represents an black and white image. Each entry corresponds to one pixel, 0 for black and 1 for white (typically it’s between 0 and 255). Here we use a 3×3 filter, multiply its values element-wise with the original matrix, then sum them up. To get the full convolution we do this for each element by sliding the filter over the whole matrix.

For this step of the process, we usually want to have multiple filters (each independent of each other). Therefore, if 10 filters (the number is a hyperparameter that we can tune) are used, the output would look like:

<img src='images/conv3.png', width=80% height=80%>

The result of the process is we've just represented the original image with a 28x28x10 activation maps that are stacked together along the depth dimension. And this activation map will be feed into later processing.

## **Convolutional Layer-2 Spatial Dimensions**

You might have noticed that we got a block of size 28×28 as the output when the input was 32×32. Why so? Let’s look at a simpler case. Suppose the initial image has a of size 6x6xd and the filter takes a size of 3x3xd. Since depth size of the input and the filter is always same, we can only look at it from a top-down view (leaving out the depth) perspective.

<img src='images/conv4.png', width=80% height=80%>

Here we can see that the output will be 4x4xd volume block (4 distinct positions in one row and there're 4 rows in total). 

Let’s define a generic case where image has dimension $NxNxd$ and filter has $FxFxd$. Also, lets define another term **stride (S) another hyperparameter**, which is the number of cells (in the matrix above) to move in each step. We had a stride of 1 but it can be a higher value as well. Given these information the size of the output can be computed by the following formula:

$$(N – F)/S + 1$$

You can validate the formula with the example above or with the CIFAR-10 image dataset where $N=32$, $F=5$, $S=1$ where the output is 28. 

<p>
<div class="alert alert-warning">
Note that some $S$ values might lead to non-integer result and we generally don’t use such values, different libraries might deal with it different (e.g. throw an error).
</div>

Let’s consider another example to consolidate our understanding. Starting with the same image as before of size 32×32, we apply 2 filters consecutively, first 10 filters of size 7, stride 1, next 6 filters of size 5, stride 2. Before looking at the solution below, just think about 2 things:

- What should be the depth of each filter?
- What will the resulting size of the images in each step?

---

Answer:

<img src='images/conv5.png', width=80% height=80%>

<p>
<div class="alert alert-info">
Notice in the second layer, the filters are now 5x5x**10** because **10** is the depth of its input.
</div>

This [github repo](https://github.com/vdumoulin/conv_arithmetic) contains animations of different types of convolutional layers if you're interested.

## **Convolutional Layer-3 Paddings**

Notice from the picture above, that the size of the input is shrinking consecutively (from 32 to 26 to 11). This will be undesirable in case of deep networks where the size will become very small too early and we'll loose a lot of representations of the original input. Also, it would restrict the use of large size filters as they would result in faster size reduction.

To prevent this, we generally use a stride of 1 along with **zero-padding** of size $(F-1)/2$. **Zero-padding** is nothing but adding additional zero-value pixels along every border of the image.

Consider the example we saw above with 6×6 image and 3×3 filter. The required padding is (3-1)/2=1. We can visualize the padding as:

<img src='images/conv6.png', width=40% height=40%>

Here you can see that the image now becomes 8×8 because of padding of 1 on each side. So now by plugging in the number to the output size formula, $(N – F)/S + 1$, we will notice the output will be of size 6×6, ( 8 - 3 ) / 1 + 1 = 6, which is the same as the original input size.

Given this new information, we can write down the final formula for calculating the output size. The output width and height is a function of the input volume size (N), the filter size of the Conv Layer (F), the stride with which they are applied (S) and the amount of zero padding used (P) on the border:

$$(N−F+2P)/S+1$$

And the depth is simply the number of filters that we have.

> Example: Input volumn of 32x32x3, what is the output size if we're to apply 10 5x5 filters with stride 1 and pad 2. And what is the total number of weights for this layer?

- The output width and the height will be ( 32 - 5 + 2 * 2 ) / 1 + 1 = 32 and the depth will be 10.
- The total number of weights will be the filter size times the depth plus one for the bias and mutiply that with the total number of filters ( 5 x 5 x 3 + 1 ) * 10 = 760.

# **Pooling Layer**

Recall that we use padding in convolution layer to preserve input size. The **pooling layers** are used to REDUCE the size of input. They're in charge of downsampling the the network to make it more manageable. The most common form of performing the downsampling is by **max-pooling**.

Consider the following 4×4 layer. So if we use a 2×2 filter with stride 2 and max-pooling, we get the following response:

<img src='images/pool1.jpeg', width=60% height=60%>

Here we can see that each 2×2 matrix are combined into 1 single number by taking their maximum value. Generally, max-pooling is used and works best but there're other options like average pooling.

It is common to periodically insert a **pooling layer** in-between successive **Conv layers** in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. 

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2. Every max-pooling operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth size remains unchanged for this operation.

**Summary of Conv and Pool Layer**

**The Conv Layer:**

- Accepts a volume of size \\(N_1 \times H_1 \times D_1\\)
- Requires four hyperparameters: 
  - Number of filters \\(K\\) 
  - The filter's size \\(F\\)
  - The stride \\(S\\)
  - The amount of zero padding \\(P\\)
- Produces a volume of size \\(N_2 \times H_2 \times D_2\\) where:
  - \\(N_2 = (N_1 - F + 2P)/S + 1\\)
  - \\(H_2 = (H_1 - F + 2P)/S + 1\\) (i.e. width and height are computed equally)
  - \\(D_2 = K\\)
- Each layer introduces \\(F \cdot F \cdot D_1\\) weights per filter, for a total of \\((F \cdot F \cdot D_1) \cdot K\\) weights and \\(K\\) biases.

Some common settings for these hyperparameters are  \\(K = \text{ power of 2, e.g. 32, 64, 128, } F = 3, S = 1, P = 1\\).

**The Pooling Layer:**

- Accepts a volume of size \\(N_1 \times H_1 \times D_1\\)
- Requires two hyperparameters: 
  - The filter's size \\(F\\)
  - The stride \\(S\\)
- Produces a volume of size \\(N_2 \times H_2 \times D_2\\) where:
  - \\(N_2 = (N_1 - F)/S + 1\\)
  - \\(H_2 = (H_1 - F)/S + 1\\)
  - \\(D_2 = D_1\\)
- Introduces zero parameters since it computes a fixed function of the input
- Note that it is not common to use zero-padding for Pooling layers

It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with \\(F = 3, S = 2\\) (also called overlapping pooling), and more commonly \\(F = 2, S = 2\\). Pooling sizes with larger receptive fields are too destructive.

# **Fully Connected Layer**

At the end of convolution and pooling layers, networks generally use fully-connected layers in which each pixel is considered as a separate neuron just like a regular neural network. The last fully-connected layer will contain as many neurons as the number of classes to be predicted. For instance, in CIFAR-10 case, the last fully-connected layer will have 10 neurons.

# **Example ConvNet Architecture**

Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity. In this section we discuss how these are commonly stacked together to form entire ConvNets.

The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern. At some point, it is common to transition to FC layers and the last fully-connected layer holds the output, such as the class scores (by applying softmax to it). In other words, the most common ConvNet architecture follows the pattern:

<p>
<div class="alert alert-info">
INPUT -> [ [CONV -> RELU] x N -> POOL?] x M -> [FC -> RELU] x K -> FC
</div>

where the `x` indicates repetition, and the `POOL?` indicates an optional pooling layer. Moreover, `N >= 0` (and usually `N <= 3`), `M >= 0`, `K >= 0` (and usually `K < 3`). For example, here are some common ConvNet architectures you may see that follow this pattern:

A simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:

- **INPUT layer** [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.
- **CONV layer** will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.
- **RELU layer** will apply an elementwise activation function, such as the $max(0,x)$ thresholding at zero. This leaves the size of the volume unchanged [32x32x12].
- **POOL layer** will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].
- **FC (i.e. fully-connected) layer** will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.


## **Layer Sizing**

The **input layer** (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512. And images are squished to be rectangular shaped if they weren't already.

The **conv layers** should be using small filters (e.g. 3x3 or at most 5x5), using a stride of \\(S = 1\\), and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when \\(F = 3\\), then using \\(P = 1\\) will retain the original size of the input. When \\(F = 5\\), \\(P = 2\\). For a general \\(F\\), it can be seen that \\(P = (F - 1) / 2\\) preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image.

The **pool layers** are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. filter size \\((F) = 2\\)), and with a stride of 2 (i.e. \\(S = 2\\)). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another slightly less common setting is to use 3x3 receptive fields with a stride of 2. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and aggressive. This usually leads to worse performance.

*Getting rid of pooling.* Many people dislike the pooling operation, since it loses information and think that we can get away without it (in favor of architecture that only consists of repeated CONV layers). To reduce the size of the representation it is suggested to use larger stride in CONV layer once in a while. Discarding pooling layers has also been found to be important in training good generative models and it seems likely that future architectures will feature very few to no pooling layers.

*Reducing sizing headaches.* The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don't zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters "work out", and that the ConvNet architecture is nicely and symmetrically wired.

*Why use stride of 1 in CONV?* Smaller strides work better in practice. Additionally, as already mentioned stride 1 allows us to leave all spatial down-sampling to the POOL layers, with the CONV layers only transforming the input volume depth-wise.

*Why use padding?* In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be "washed away" too quickly.

*Compromising based on memory constraints.* In some cases (especially early in the ConvNet architectures), the amount of memory can build up very quickly with the rules of thumb presented above. For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64]. This amounts to a total of about 10 million activations, or 72MB of memory (per image, for both activations and gradients). Since GPUs are often bottlenecked by memory, it may be necessary to compromise. In practice, people prefer to make the compromise at only the first CONV layer of the network. For example, one compromise might be to use a first CONV layer with filter sizes of 7x7 and stride of 2 (as seen in a ZF net). As another example, an AlexNet uses filer sizes of 11x11 and stride of 4.

## Reference

- [Analytics Vidhya: Deep Learning for Computer Vision – Introduction to Convolution Neural Networks](https://www.analyticsvidhya.com/blog/2016/04/deep-learning-computer-vision-introduction-convolution-neural-networks/)
- Standford cs231n
    - [Course Note: convolutional-networks](http://cs231n.github.io/convolutional-networks/) (includes case study of different network architecture)
    - [Youtube Video Lecture](https://www.youtube.com/watch?v=V8JDMkARdfU&list=PLlJy-eBtNFt6EuMxFYRiNRS07MCWN5UIA&index=7)