# CNNs

Feedforward (or fully connected) networks lead to a number of parameters. Inspired by the human (or cat's in fact) visual cortex, convolutional neural networks (CNNs) are primarily used for the Computer Vision tasks.



## Convolution Operation

Convolution operation is inspired from Digital Signal Processing (its ok if you don't know about it), where we convolve a filter with the input signal.

Luckily, its much easier to understand it for images in 2D.

![](https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_no_strides.gif)



There are a number of ways of performing the convolution. Regardless of the type, in convolution, we have:

- Input image ($I$) having dimensions $m \times n$.
- Convolution filter ($F$) having dimensions $f \times f$. Not only they are square, but usually its common to have them in an odd order: $3 \times 3, 5 \times 5$, etc.
- And output/resultant image ($O$) having dimensions $x \times y$. We will look more into it soon.

For applying convolution, we place the filter on the top left and calculate the weighted sum of the underlying (input image's) pixels as:

$$s = \sum_{i,j=0}^{f-1}  I_{p,q}F_{i,j}$$

Dont be worried by the above equation (yeah, its not perfectly written). It just means that we take the product of first pixel with the filter's first pixel, second's with the filter's second pixel and so on.

As an example: We have a filter matrix as:

<TODO: insert matrices for both filter and image>



### Output image has smaller resolution

As we can see in the gif above, we can't convolve beyond the last column, which means that output image will be _slimmer_ than the input one. And similarly, we can't convolve beyond the last row, which means it will be _shorter_ as well.

If we look closely, the output will have 2 columns less than the input one. Same for the rows as well.

In other words:

$$x = m - f +1$$

And similarly,

$$y = n -f +1$$

Collectively,

$$(x,y) = (m-f+1,n-f+1)$$

#### Example

We can better understand it with an example:

Suppose, we have an image of $28 \times 28$ (normal MNIST images) and want to apply $3 \times 3$ filter on it. Resultant image's dimensions would be:

$$(x,y) = (28-3+1,28-3+1)$$

$$ = (26,26)$$

### Padding

Often there are cases where we need to add the padding with the input image to ensure that filter is applied throughout (and hence less or no shrinkage in size).

![](https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/arbitrary_padding_no_strides.gif)

In such a case, resultant image will be _stretched_ from both left and right (for width) and top and bottom (for height).

If we have a padding of $1$ on each side, it means we will have an addition of $2$ in $x$ and $2$ in $y$ as well. Generalizing it for $p$ size of padding, we will get:

$$(x,y) = (m-f+2p+1,n-f+2p+1)$$

#### Example

Applying a padding of 2 for the above example, we get:

$$(x,y) = (28-3+2(2)+1,28-3+2(2)+1)$$

$$ = (30,30)$$

Whoa! The output is bigger than the input. Rare, but theoretically it's still possible.

### Strides

There can be scenarios where we have low computational resources and bigger images. In such cases, its useful to shrink the size considerably. For that, we can choose to filter by taking a jump or a stride rather than applying it consecutively.

![](https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_strides.gif)


In such a case(s), size decreases by a factor. If we take the stride of 2 (like in the example above), it will shrink the output by half. Similarly, if we do it by 3, it will be one-third and so on. Mathematically,

$$(x,y) = (\frac{m-f}{s}+1,\frac{n-f}{s}+1)$$

**Note:** Please note that stride is applied in both horizontal and vertical traversing.

#### Strides with Padding

We can make it insane by having both strides and padding combined (doesn't make any sense to me, though).

![](https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/padding_strides.gif)

In this scenario, we will have full equation:

$$(x,y) = (\frac{m-f+2p}{s}+1,\frac{n-f+2p}{s}+1)$$

> **Note:** The equation [on CS231](https://cs231n.github.io/convolutional-networks/) assumes a square input, which is not always the case.

## Pooling

Usually pixels in a local region are pretty much similar (even more true for the bigger images), hence we can even simplify the scenario by just summing them up without any weights. In this case, our filter is a fixed function and doesn't need to learn any weights; hence a relaxer backprop.

We can have these pooling operations:

- Average
- Max – Pick the maximum of the pixels under the filter.

If you can think of any other pooling operator, feel free to try one. Actually, it would be a good practice.

**Note:** Pooling is usually applied in a non-overlapping way. JAX allows us to set the strides manually too using the `strides` attribute.

---

<To be continued>