# **03 - Convolutional Neural Network (CNN)**

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics.

The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlap to cover the entire visual area.

<img src="03_images/03_CNN_Overview.jpeg" width=700px/>

## Convolution

Suppose we have two **signals** $x$ and $w$, which you can think of as arrays, with elements
denoted as $x[t]$ and so on. As you can guess based on the letters, you can think of $x$ as an input signal (such as a waveform or an image) and $w$ as a set of weights, which we’ll refer to as a **filter** or **kernel**. Normally the
signals we work with are finite in extent, but it is sometimes convenient to treat them as infinitely large by treating the values as zero everywhere else; this is known as **zero padding**.

This operation is called convolution.

$$
s(t) = \int x(a)w(t-a)da \\
$$

The convolution operation is typically denoted with an asterisk:

$$
s(t) = (x * w)(t) \\
$$

Let’s start with the one-dimensional case. The convolution of $x$ and $w$, denoted $x * w$, is a signal with entries given by

$$
s(t) = (x * w)[t] = \sum_{a}x[a]w[t-a] \\
$$

There are two ways to think about this equation. The first is **translateand-scale**: the signal $x ∗ w$ is composed of multiple copies of $x$, translated and scaled by various amounts according to the entries of $w$.

<img src="03_images/03_Figure_1.PNG" width=500px/>

A second way to think about it is **flip-and-filter**. Here we generate each of the entries of $x ∗ w$ by flipping $w$, shifting it, and taking the dot product with $x$.

<img src="03_images/03_Figure_2.PNG" width=500px/>

In practice, we often use convolutions over more than one axis at a time.

$$
s[i,j] = (I * K)[i,j]=\sum_{m}\sum_{n}I[m,n]K[i-m,j-n] \\
$$

* The input is usually a multidimensional array of data.
* The kernel is usually a multidimensional array of parameters that should be learned.
* We assume that these functions are zero everywhere but the finite set of points for which we store the values.
* We can implement the infinite summation as a summation over a finite number of array elements.

Convolution is commutative

$$
s[i,j] = (I * K)[i,j]=\sum_{m}\sum_{n}I[i-m,j-n]K[m,n] \\
$$

Cross-correlation,

$$
s[i,j] = (I * K)[i,j]=\sum_{m}\sum_{n}I[i+m,j+n]K[m,n] \\
$$

<img src="03_images/03_Convolved_Feature.gif" width=400px/>

### Properties of convolution

Now that we’ve seen some examples of convolution, let’s note some useful properties. First of all, it behaves like multiplication, in that it’s **commutative** and associative: 

$$
u * v = v * u \\
(u * v) * w = u * (v * w) \\
$$

Another useful property of convolution is that it is **linear**:

$$
(ax + bx') * w = ax * w + bx' * w \\
x * (aw + bw') = ax * w + bx * w' \\
$$

This is convenient, because linear operations are often easier to deal with. But it also shows an inherent limit to convolution: if you have a neural net which computes lots of convolutions in sequence, it can still only compute linear functions. In order to compute more complex operations, we’ll need to apply some sort of nonlinear activation function in each layer.

One last property of convolution is that it’s **equivariant** to translation. This means that if we shift, or translate, $x$ by some amount, then the output $x∗w$ is shifted by the same amount. This is a useful property in the context of neural nets, because it means the network’s computations behave in a well-defined way as we transform the inputs.

### Convolutional feature detection

As alluded to above, convolutions are even more powerful when they’re paired with nonlinearities. A sequence of convolutions can only compute a linear function, but a sequence of convolutions alternated with nonlinearities can do fancier things. E.g., consider the following sequence of operations:

1. Convolve the image with a horizontal edge filter.
2. Apply the linear rectification nonlinearity

$$
\phi(z) = \begin{cases}
z \ \ if \ z>0 \\
0 \ \ if \ z\leq 0
\end{cases}
$$

3. Blur the result

This sequence of steps, gives a map of horizontalness in various parts of an image; the same can be done for verticalness. You can hopefully imagine this being a useful feature for further processing. Because the resulting output can be thought of as a map of the feature strength over parts of an image, we refer to it as a **feature map**.

<img src="03_images/03_Feature_Map.PNG" width=400px/>

## Convolution layers

We just saw that a convolution, followed by a nonlinear activation function, followed by another convolution, could compute something interesting. This motivates the **convolution layer**, a neural net layer which computes convolutions followed by a nonlinear activation function. Since convolution layers can be thought of as doing feature detection, they’re sometimes referred to as **detection layers**. First, let’s see how we can think about convolution in terms of units and connections.

Confusingly, the way they’re standardly defined, convolution layers don’t actually compute convolutions, but a closely related operation called **filtering**:

$$
(x \star w)[t] = \sum_{a}x[t+a]w[a] \\
$$

Like the name suggests, filtering is essentially like flip-and-filter, but without the flipping. (i.e., $x * w = x \star flip(w)$.) The two operations are basically equivalent — the difference is just a matter of how the filter (or kernel) is represented.

In the above example, we computed a single feature map, but just as we normally use more than one hidden unit in fully connected layers, convolution layers normally compute multiple feature maps $z_{1}, . . . , z_{M}$. The input layers also consist of multiple feature maps $x_{1}, . . . , x_{D}$; these could be different color channels of an RGB image, or feature maps computed by another convolution layer. There is a separate filter $w_{ij}$ associated with each pair of an input and output feature map. The activations are computed as follows:

$$
z_{i} = \sum_{j}x_{j} \star w_{ij} \\
h_{i} = \phi(z_{i}) \\
$$

The activation function $\phi$ is applied elementwise.

We can think about filtering as a layer of a neural network by thinking of the elements of $x$ and $x * w$ as units, and the elements of $w$ as connection weights. Such an interpretation is visualized in Figure 6 for a one-dimensional example. Each of the units in this network computes its activations in the standard way, i.e. by summing up each of the incoming units multiplied by their connection weights. This shows that a convolution layer is like a fully connected layer, except with two additional features:

* **Sparse connectivity**: not every input unit is connected to every output unit.
* **Weight sharing**: the network’s weights are each shared between multiple connections.

<img src="03_images/03_Convolution_Layer.PNG" width=600px/>

Missing connections can be thought of as connections with weight 0. This highlights an important fact: any function computed by a convolution layer can be computed by a fully connected layer. This means convolution layers don’t increase the representational capacity, relative to a fully connected layer with the same number of input and output units. But they can reduce the numbers of weights and connections.

## Pooling layers

We observed that a neural network’s classifications ought to be invariant to small transformations of an image, such as shifting it by a few pixels. In order to achieve invariance, we introduce another kind of layer: the pooling layer. Pooling layers summarize (or compress) the feature maps of the previous layer by computing a simple function over small regions of the image. Most commonly, this function is taken to be the maximum, so the operation is known as max-pooling.

Suppose we have input feature maps $x_{1}, . . . , x_{N}$. Each unit of the output map computes the maximum over some region (called a pooling group) of the input map. (Typically, the region could be 3$\times$3.) In order to shrink the representation, we don’t consider all offsets, but instead we space them by a stride S along each dimension. This results in the representation being shrunk by a factor of approximately S along each dimension. (A typical value for the stride is 2.) Figure 7 shows an example of how pooling can provide partial invariance to translations of the input.

<img src="03_images/03_Pooling_Figure7.PNG" width=550px/>

Pooling also has the effect of increasing the size of units’ receptive fields, or the regions of the input image which influence their activations. For instance, consider the network architecture in Figure 8, which alternates between convolution and pooling layers. Suppose all the filters are 5 $\times$ 5 and the pooling layer uses a stride of 2. Then each unit in the first convolution layer has a receptive field of size 5 $\times$ 5. But each unit in the second convolution layer has a receptive field of size approximately 10 $\times$ 10, since it does 5 $\times$ 5 filtering over a representation which was shrunken by a factor of 2 along each dimension. A third convolution layer would have 20 $\times$ 20 receptive fields. Hence, pooling allows small filters to account for information over large regions of an image.

<img src="03_images/03_Pooling_Figure8.PNG" width=550px/>

## Fully Connected Layer (FC Layer)

Adding a Fully-Connected layer is a (usually) cheap way of learning non-linear combinations of the high-level features as represented by the output of the convolutional layer. The Fully-Connected layer is learning a possibly non-linear function in that space.

Now that we have converted our input image into a suitable form for our Multi-Level Perceptron, we shall flatten the image into a column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied to every iteration of training. Over a series of epochs, the model is able to distinguish between dominating and certain low-level features in images and classify them using the Softmax Classification technique.

## Reference

[1] Saha, S. (2018, December 17). A Comprehensive Guide to Convolutional Neural Networks - the ELI5 way. Medium. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53. 

[2] Grosse, R. (2019, February 5). Lecture 9: Convolutional Networks. http://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/readings/L09%20Convolutional%20Networks.pdf. 