# Convolutional Neural Networks 🧊

As dicussed prior, the primary difference between a convolutional neural network (CNN) and a fully connected multi-layer perceptron (MLP) network lies in how the input data is encoded into the network.

- MLPs treat input features as independent and flat, losing information about spatial relationships.

- **CNNs preserve and exploit the locality of 2D input data**
 (such as images), enabling the network to learn spatial hierarchies of features.

- This is particularly powerful for understanding and preserving the locality of 2D input data (i.e. images).

CNNs are basically stacked layers of pattern matching where the process of pattern matching is based on *kernels* which act as trainable weighted sum functions (similar to the role of a perceptron) that sweep in *patches* across the image.

- **Each kernel eseentially acts as an eye scanning the image for the presence of a particular feature, based on the parameters of the kernel!**

- - The weighted sum computed for each patch generates a feature map associated with that kernel.

- - The result of this scanning is a *feature map*: **a spatial map where higher values indicate stronger evidence of the learned pattern in specific regions, and lower (or negative) values indicate absence or contrast of that pattern.**

When we last left off we covered how the kernel is applied to each patch, the general movement of a kernel across an image, and intuition behind how kernels can be structured to learn some low level features.


## Understanding Stride and Kernel Movement

<img src="./media/CNN_visualization.gif" width="500px">

Note how the kernel moves:

- Each kernel does not map a completely disjoint patch, each patch has some overlap with the prior and next convolution after it.
- The kernel moves or *strides* one pixel at a time and moves to the right and then down. 
- - Therefore we say that the **stride of this kernel's application is 1**.
- If the stride of the kernel was *3* then the kernel's convolutions would produce completely disjoint (non-overlapping) patches but this could result in a kernel potentially going out of bounds of the image **unless we add some padding to the image that prevents this**.

**A stride of 1 ensures that the kernel covers the image densely, without skipping positions, and without risking incomplete patches at the edges (assuming appropriate padding is applied).**

Assuming square kernels and a square input image we can actually use a formula to compute the dimensions of the feature map produced by the kernel:

$$\text{feature map size} = \lfloor\frac{\text{input size} + 2 \cdot \text{padding} - \text{kernel size}}{\text{stride}}\rfloor$$

*note the floor operation performed to the fraction*

- **The padding here is how many pixels are applied as padding to all sides of the image.**

And by size we mean the length or height of the respective item (which should be identical since we assume everything is a square).

- **While it may be tempting to rearrange the formula such that it solves for stride, this isn't necessarily helpful.**

We could solve for stride as follows:

$$\text{stride} = \frac{\text{input size} + 2 \cdot \text{padding} - \text{kernel size}}{\text{feature map size} - 1}$$

- One issue is that sometimes this formula will end up giving you a fractional value for the stride which is not possible since a "partial pixel" is not attainable nor useable.

Example:

Suppose we know all the following and want to calculate stride

- Input size = 7
- Kernel size = 3
- (desired) Output size = 4
- Padding = 0

$$\text{stride} = \frac{7 + 2 \cdot 0 - 3}{4 - 1} = 1.333$$

**There are several potential solutions in this case:**

- Add some padding
- Change the kernel size
- Change the desired output size

Or another solution, which is more mathematically involved, would be **transposed convolution** which we will not cover here.

When we actually toss data into the models, **kernels in the same level of the neural network (i.e. adjacent kernels) are independent of each other**.

- Kernels act on the same underlying input but act as different lenses and filters on the data.
- Since they are computed independently, **kernels on the same layer are convoluted in parallel on the same input data.**
- This is why GPUs are so useful, they have thousands of tiny computing cores that can be delegated to compute kernels in parallel with a massive amount of data throughput!

A **smaller stride (such as 1) maximizes the resolution of the feature map by ensuring that every possible local pattern is examined with significant overlap.** This overlap allows the network to build a detailed and high-resolution representation of the spatial relationships in the image.

- In contrast, a larger stride skips positions, reducing overlap and producing a coarser and lower-resolution summary of the input as evident in the smaller feature maps produed as a result.

## Downsampling

What if we intentionally wanted to reduce the resolution of an image? What are the ways in which we could do this?

- As discussed before, increasing the stride length will substanially reduce the size of the resulting feature map according to the formula:

$$\text{feature map size} = \lfloor\frac{\text{input size} + 2 \cdot \text{padding} - \text{kernel size}}{\text{stride}}\rfloor$$

In general, **doubling the stride length halves the size of the input feature map, tripling it cuts it to a third of its original size and so on and so forth.**

- By creating a smaller feature map, we are *downsampling* our representation of the original image.

- Intentionally downsampling is good practice to generalize our insights from the input data while also improving the performance of the model (which comes as result of needing to compute smaller feature maps).

Another option is to use a technique called *pooling* which is a form of postprocessing on the feature-map itself somewhat similar to convolution.

Here is an example:

<img src="./media/convolution_and_max_pooling_example.png" width="500px">

In the example, the kernel is:

$$
\begin{bmatrix}
1 & 0 & 1 \\
0 & 1 & 0 \\
1 & 0 & 1 \\
\end{bmatrix}
$$

**Max pooling is a technique which groups up different patches of the feature map and chooses the highest value on that patch to represent the entire patch.**

- Max pooling can be thought of as a technique similar to convolution. The example shows how the max pooling process was done "with a 2x2 filter and a stride of 2". But you could reinterprate that as simply chunking up the feature map into 2 by 2 patches and then reducing each 2 by 2 patch to its greatest value.

- Alternatively we could apply **average pooling** to reduce each patch into the average of it's values. Note that this value is not supposed to directly represent a pixel, therefore it is perfectly okay for it to be fractional.

The biggest downside of this technique is that unlike general convolution, it is an application of a fixed rule as opposed to a tuneable function.

- Strided convolution (stride > 1) generates a new learned feature map while simultaneously reducing spatial resolution which integrates downsampling with feature extraction, unlike pooling which applies a fixed post-processing rule.

- Pooling can still serve as a supplementary technique, applied after standard or strided convolution, to further reduce spatial size or introduce robustness to local variations

**Ultimately you can think of pooling as a sort of plug and play post processing tweak that may slightly improve the model, whereas strided convolution generates different insights in the original feature map altogether.**

**Note that if you do choose to apply pooling, you can tweak the stride length of the pooling process to be greater than size of the filter to skip portions of the feature map entirely, or choose stride lengths smaller than the size of the filter to perform overlapping pooling.**




## Adding Channels and Activations

So far we have been discussing kernels as a 2d 3x3 matrix, but in reality there is a third dimension that must be accounted for in the case of images with multiple channels.

- Most images have 3 or 4 channels: red, green, blue, and optionally alpha (transparency).

- In the context of deep learning, we can view the kernel as a 3 dimensional cube, where each slice of the cube is applied to each channel / slice of the input image.

**In the context of deep learning and libraries like Tensorflow, this kernel is an example of a tensor.**

- Simply put, **a tensor is a multidimensional data structure with semantical rules that allow it perform within linear algebra calculations and operations like dot products and matrix multiplications.**

- A 3d array such as a kernel is a rank-3 tensor.

However, when we apply the kernel 3 by 3 by 3 kernel to a 3 by 3 by 3 image patch, our resulting feature is a singular number.

---

### Example of kernel application with 3 channels

Input patches:

$$
\text{patch}_R =
\begin{bmatrix}
1 & 2 & 1 \\
0 & 1 & 0 \\
2 & 1 & 2
\end{bmatrix}
$$

$$
\text{patch}_G =
\begin{bmatrix}
0 & 1 & 0 \\
1 & 2 & 1 \\
0 & 1 & 0
\end{bmatrix}
$$

$$
\text{patch}_B =
\begin{bmatrix}
2 & 0 & 2 \\
1 & 1 & 1 \\
2 & 0 & 2
\end{bmatrix}
$$

---

Kernel weights:

$$
\text{kernel}_R =
\begin{bmatrix}
0 & 1 & 0 \\
1 & -1 & 1 \\
0 & 1 & 0
\end{bmatrix}
$$

$$
\text{kernel}_G =
\begin{bmatrix}
1 & 0 & 1 \\
0 & 1 & 0 \\
1 & 0 & 1
\end{bmatrix}
$$

$$
\text{kernel}_B =
\begin{bmatrix}
-1 & 0 & -1 \\
0 & 1 & 0 \\
-1 & 0 & -1
\end{bmatrix}
$$

---

Dot product results:

$$
\text{R sum: } (1\times 0) + (2\times 1) + (1\times 0) + (0\times 1) + (1\times -1) + (0\times 1) + (2\times 0) + (1\times 1) + (2\times 0) = 2
$$

$$
\text{G sum: } (0\times 1) + (1\times 0) + (0\times 1) + (1\times 0) + (2\times 1) + (1\times 0) + (0\times 1) + (1\times 0) + (0\times 1) = 2
$$

$$
\text{B sum: } (2\times -1) + (0\times 0) + (2\times -1) + (1\times 0) + (1\times 1) + (1\times 0) + (2\times -1) + (0\times 0) + (2\times -1) = -7
$$

---

Final value:

$$
\text{output feature} = 2 + 2 + (-7) = -3
$$

**The idea that a kernel "squashes" all the color channel info at each patch into a single scalar feels like it might lose richness.**

- The kernel’s job is precisely to blend and weigh the input channels in a way that highlights useful patterns.

- A kernel might learn to detect red-green contrasts

- Another kernel might learn to detect blue intensity edges

- **The dot product produces 1 value that reflects the combined evidence across all channels for that kernel's target feature.**

---

### The Missing Piece: Achieving Nonlinearity

Recall that we showed how simple weighted sums failed to capture non-linear relationships in basic feed forward networks. **Ultimately, by computing dot products to make feature maps, we are creating the same weighted sums and have to tackle the same problem.**

If we consider the individual computed features in a feature map to be akin to the scalar outputs of neurons in a basic perceptron network, we can postprocess each feature by applying an activation function like ReLU to it.

ReLU is simply defined as:

$$
\operatorname{ReLU}(x) = \max(0, x)
$$

This means:
- If the input value is positive, ReLU leaves it unchanged.
- If the input value is negative, ReLU sets it to zero.

Without ReLU:
- The network would just compute **linear combinations** of the input features at every layer.
- Stacking multiple layers would still result in a model that is no more powerful than a single linear layer.
- The network would be unable to learn or represent the non-linear patterns that are crucial for complex tasks like image classification.

With ReLU:
- We introduce non-linearity at each layer, allowing the network to model complex decision boundaries.
- We create **sparse activations** — many outputs are zero, which makes the network more efficient and helps reduce overfitting.

---

### Example of ReLU

Suppose a convolution produces a small feature map:
$$
\begin{bmatrix}
-3 & 5 \\
2 & -1
\end{bmatrix}
$$

Applying ReLU:

$$
\begin{bmatrix}
0 & 5 \\
2 & 0
\end{bmatrix}
$$

- This is ideal because our goal is to specialize each feature or "neuron" in the feature map to detect the likelihood of the kernel's feature being present at a given location.

- **Understand the nuance:** having negative values in the kernel itself is important because these allow the kernel to capture contrasts and patterns (for example, differences between light and dark regions) that define the feature it is looking for. 

However, having negative values in the **feature map** is unnecessary, because the role of the feature map is to indicate the *presence* of a feature at a location, not the absence of one. ReLU ensures that only positive evidence for a feature is propagated forward.

---

## Putting it Together

Recall one of the questions mentioned in the teaser in the prior lesson:

*"Even if we can detect low-level features, how exactly are these combined to represent more abstract patterns or objects?"*

What we've discussed so far is only how to extract the first level of features from an image.

The process of getting higher-level features involves adding two simple steps to our current recipe:

1) **Input raw image**: e.g., a 3-channel RGB image of size 32×32. Shape is $= (32, 32, 3)$

2) **Run the kernels over the image**: Each kernel (with depth matching the input channels) slides over the image, producing a 2D feature map. Each map highlights regions where its learned pattern (e.g., edge, texture, color contrast) is detected.

3) **Apply an activation function to each feature in each kernel**: Preferably ReLU or a variant of it.

4) **Stack the 2D feature maps generated by each kernel**: If we used 6 kernels this forms a stack of 6 feature maps. Together, these can be thought of as a new multi-channel "image" of shape $(H, W, 6)$ (note that the input image had 3 channels).

5) **Treat this new stacked entity as the new "image" and repeat steps 2-5**:
   - The next layer applies new kernels that operate across *all* of these input channels.
   - Each kernel now learns to detect patterns *of patterns* — e.g., combinations of edges forming corners, simple textures forming shapes.
   - The output of these kernels is again a stack of 2D feature maps.

**By stacking simple patterns in place of the original image, deeper layers of the CNN work on tuning to increasingly more complex patterns:**

- **First layer:** Detects low-level features like edges, lines, simple color contrasts.
- **Second layer:** Detects combinations of these — corners, curves, small shapes.
- **Third layer (and beyond):** Detects parts of objects, textures, or entire patterns relevant to the task (e.g., eyes, wheels, fur texture).

In this way, the network builds a hierarchy: simple features are combined into complex ones, layer by layer, until we have high-level abstractions that can be used for tasks like classification or detection.

**An important caveat is that these features are learned, not preprogrammed. This means that developers do not create the kernels themselves.**

- Instead: the kernels start out with random values (weights). 
- Through training, the network adjusts these values using optimization techniques (like gradient descent) to minimize a loss function, only "learning" what patterns are useful for the task (e.g., classification).
- This process allows the network to discover the most relevant features directly from the data, rather than relying on human-designed rules or filters.