# Convolutional Neural Networks 🧊

As dicussed prior, the primary difference between a convolutional neural network (CNN) and a fully connected multi-layer perceptron (MLP) network lies in how the input data is encoded into the network.

- MLPs treat input features as independent and flat, losing information about spatial relationships.

- **CNNs preserve and exploit the locality of 2D input data**
 (such as images), enabling the network to learn spatial hierarchies of features.

- This is particularly powerful for understanding and preserving the locality of 2D input data (i.e. images).

CNNs are basically stacked layers of pattern matching where the process of pattern matching is based on *kernels* which act as trainable weighted sum functions (similar to the role of a perceptron) that sweep in *patches* across the image.

- **Each kernel eseentially acts as an eye scanning the image for the presence of a particular feature, based on the parameters of the kernel!**

- - The weighted sum computed for each patch generates a feature map associated with that kernel.

- - The result of this scanning is a *feature map*: **a spatial map where higher values indicate stronger evidence of the learned pattern in specific regions, and lower (or negative) values indicate absence or contrast of that pattern.**

When we last left off we covered how the kernel is applied to each patch:

Ex:

<img src="./media/CNN_visualization.gif" width="500px">


## Understanding Stride and Kernel Movement

<img src="./media/CNN_visualization.gif" width="500px">

Note how the kernel moves:

- Each kernel does not map a completely disjoint patch, each patch has some overlap with the prior and next convolution after it.
- The kernel moves or *strides* one pixel at a time and moves to the right and then down. 
- - Therefore we say that the **stride of this kernel's application is 1**.
- If the stride of the kernel was *3* then the kernel's convolutions would produce completely disjoint (non-overlapping) patches but this could result in a kernel potentially going out of bounds of the image **unless we add some padding to the image that prevents this**.

**A stride of 1 ensures that the kernel covers the image densely, without skipping positions, and without risking incomplete patches at the edges (assuming appropriate padding is applied).**

Assuming square kernels and a square input image we can actually use a formula to compute the dimensions of the feature map produced by the kernel:

$$\text{feature map size} = \lfloor\frac{\text{input size} + 2 \cdot \text{padding} - \text{kernel size}}{\text{stride}}\rfloor$$

*note the floor operation performed to the fraction*

- **The padding here is how many pixels are applied as padding to all sides of the image.**

And by size we mean the length or height of the respective item (which should be identical since we assume everything is a square).

- **While it may be tempting to rearrange the formula such that it solves for stride, this isn't necessarily helpful.**

We could solve for stride as follows:

$$\text{stride} = \frac{\text{input size} + 2 \cdot \text{padding} - \text{kernel size}}{\text{feature map size} - 1}$$

- One issue is that sometimes this formula will end up giving you a fractional value for the stride which is not possible since a "partial pixel" is not attainable nor useable.

Example:

Suppose we know all the following and want to calculate stride

- Input size = 7
- Kernel size = 3
- (desired) Output size = 4
- Padding = 0

$$\text{stride} = \frac{7 + 2 \cdot 0 - 3}{4 - 1} = 1.333$$

**There are several potential solutions in this case:**

- Add some padding
- Change the kernel size
- Change the desired output size

Or another solution, which is more mathematically involved, would be **transposed convolution** which we will not cover here.

When we actually toss data into the models, **kernels in the same level of the neural network (i.e. adjacent kernels) are independent of each other**.

- Kernels act on the same underlying input but act as different lenses and filters on the data.
- Since they are computed independently, **kernels on the same layer are convoluted in parallel on the same input data.**
- This is why GPUs are so useful, they have thousands of tiny computing cores that can be delegated to compute kernels in parallel with a massive amount of data throughput!

A **smaller stride (such as 1) maximizes the resolution of the feature map by ensuring that every possible local pattern is examined with significant overlap.** This overlap allows the network to build a detailed and high-resolution representation of the spatial relationships in the image.

- In contrast, a larger stride skips positions, reducing overlap and producing a coarser and lower-resolution summary of the input as evident in the smaller feature maps produed as a result.

## Downsampling

What if we intentionally wanted to reduce the resolution of an image? What are the ways in which we could do this?

- As discussed before, increasing the stride length will substanially reduce the size of the resulting feature map according to the formula:

$$\text{feature map size} = \lfloor\frac{\text{input size} + 2 \cdot \text{padding} - \text{kernel size}}{\text{stride}}\rfloor$$

In general, **doubling the stride length halves the size of the input feature map, tripling it cuts it to a third of its original size and so on and so forth.**

- By creating a smaller feature map, we are *downsampling* our representation of the original image.

- Intentionally downsampling is good practice to generalize our insights from the input data while also improving the performance of the model (which comes as result of needing to compute smaller feature maps).

Another option is to use a technique called *pooling* which is a form of postprocessing on the feature-map itself somewhat similar to convolution.

Here is an example:

<img src="./media/convolution_and_max_pooling_example.png" width="500px">

In the example, the kernel is:

$$
\begin{bmatrix}
1 & 0 & 1 \\
0 & 1 & 0 \\
1 & 0 & 1 \\
\end{bmatrix}
$$

**Max pooling is a technique which groups up different patches of the feature map and chooses the highest value on that patch to represent the entire patch.**

- Max pooling can be thought of as a technique similar to convolution. The example shows how the max pooling process was done "with a 2x2 filter and a stride of 2". But you could reinterprate that as simply chunking up the feature map into 2 by 2 patches and then reducing each 2 by 2 patch to its greatest value.

- Alternatively we could apply **average pooling** to reduce each patch into the average of it's values. Note that this value is not supposed to directly represent a pixel, therefore it is perfectly okay for it to be fractional.

The biggest downside of this technique is that unlike general convolution, it is an application of a fixed rule as opposed to a tuneable function.

- Strided convolution (stride > 1) generates a new learned feature map while simultaneously reducing spatial resolution which integrates downsampling with feature extraction, unlike pooling which applies a fixed post-processing rule.

- Pooling can still serve as a supplementary technique, applied after standard or strided convolution, to further reduce spatial size or introduce robustness to local variations

**Ultimately you can think of pooling as a sort of plug and play post processing tweak that may slightly improve the model, whereas strided convolution generates different insights in the original feature map altogether.**

**Note that if you do choose to apply pooling, you can tweak the stride length of the pooling process to be greater than size of the filter to skip portions of the feature map entirely, or choose stride lengths smaller than the size of the filter to perform overlapping pooling.**


