# Convolutional Neural Networks  

We've seen what fully connected networks do. Finding features is their specialty. But not all data is created equal. Depending on the task at hand, the input itself may assume different shapes. Some many years ago, people started looking into how they could map the structure of images. As we know, images are 2D arrays. Flatenning them into 1D vectors would mean that previously adjacent pixels would now be far apart. Of course, considering that *Fully-Connected Networks* (aka *dense* networks [called this way because fully-connected layers are called *dense layers*], aka *multilayer perceptrons* aka the *vanilla* network - this is what I call it) are just a series of linear transformations, pixels that are far apart would still contribute to the next layer of neurons. This is not a bad thing, but consider this:  
I have an image. Let's say a 20x20 patch of that image depicts an object (object A). In a totally unrelated part of the image I might have an unrelated, different object (object B). FCNs would use both the pixels from object A and object B to compute the next layer. I present to you 2 problems:  
1) the network might wrongly think these object are related, since they might share colors or shapes.  
2) the network has to do **a lot** of computations, since all pixels get connected to all neurons of the next layer.  

The 1st problem has to do with locality (or spatiality). In order to be efficient in finding patterns, we have to respect the fact that an image's pixels are spatially related. This assumption saves us the trouble of wrong associations.  
The 2nd problem has to do with the number of parameters. It's quite obvious that by only considering the pixels that are close to each other as part of a feature, we greatly reduce the number of parameters. Remember in our old example how we had around 5 neurons total (not counting the bias terms)? With them, we got to 20+ weights/parameters to learn. With images, even a pixelated 20x20 image would get us to 400+ parameters (and that's the absolute lowest possible number; there's a higher chance you would see thousands or tens of thousands of weights). By contrast, CNNs would deal with the order of 10s of parameters, due to its nature. We will discuss the mechanics of this shortly.  

**BONUS**: what if the image were shifted to the right by 1 pixel? Would the network still be able to recognize the object? Absolute positioning (which is what we analyze with FCNs) is not an appropriate solution for image-related tasks.

## A bit of history

CNNs were inspired by a concept in biology called *receptive fields*. This is the idea that neurons are stimulated by a small area of the visual field, meaning that our eyes don't see the world as a whole, and neither as a bunch of pixels. We see small patches of the world. There are various cells that recognize simple shapes in the world (think vertical, horizontal and so on). These are stimulated and transmit their impulses over to more complex cells that combine the information they receive (think how a linear combination works). These complex cells then transmit their outputs over to other complex cells, and bigger-level features are extracted in our brain.

This is the idea that CNNs are based on. Small patches of the image are used to compute the next layer (the most basic features = simple shapes contribute to the formation of more complex features). The technological translation of this is called *convolution*. They are used to detect basic shapes, like lines and curves. These shapes are then combined through *pooling* to reduce dimensionality and transform the image into a more abstract representation, which we can then use to detect more complex shapes. CNNs use convolutions and pooling to build complex features of the image step by step.  

The first idea of convolution is attributed to Kunihiko Fukushima in a [paper](https://www.rctn.org/bruno/public/papers/Fukushima1980.pdf) from 1980. The network he proposed is considered the precursor of CNNs. Called the *neocognitron*, it was a network that was able to recognize handwritten digits and letters. The structure was inspired by the concepts discussed above. He called them S-cells (convolution) and C-cells (pooling). One important mention is that he did not use backpropagation, as it was not yet that popular. Instead, he "trained" the network in an unsupervised manner, by repeatedly presenting it with "stimuli" (the images) and letting the most active S-cells increase their value, based on some formulas. This borrows from the idea of Hebbian learning, where "neurons that fire together, wire together" (it means that when one neuron activates another, the connection between them gets stronger). In our case, the S-cells that were activated most strongly by the stimuli were reinforced. This way, the network had no idea what it was presented with, but it knew to associate them (think back to our unsupervised algorithms that also associated through similarity). In subsequent iterations (and published papers), Fukushima did move on to supervised learning (what he called "learning-with-a-teacher").  
![](https://www.researchgate.net/publication/336163445/figure/fig1/AS:809198191398912@1569939291177/The-architecture-of-the-neocognitron.png)  
Neocognitron architecture, via ResearchGate

The first introduction of CNNs as a type of architecture was, however, in the 1990s by Yann LeCun (you can find him on Twitter, he's pretty active), who is one of the most influential people in the field of deep learning. Called [LeNet](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf)(this '98 one with LeNet5 was the earliest I could find), this network was used to recognize handwritten digits by training it on the MNIST dataset with backpropagation. Although the mechanisms LeCun and Fukushima used were essentially the same, LeNet was trained through backpropagation. Inspired by both the ideas of *local receptive fields* and Hubel and Wiesel's experiments with cats' orientation-selective cells, he recognized the connection with mathematics' convolutions (which function similarly to the vertebrates' visual system, in this case) and created the first named CNN. It had 7 layers (not counting the input), going in this order: C(onvolution), P(ooling/subsampling), C, P, F(ully-connected), F, O(utput). The input was a 32x32 image, and the output was a 10-dimensional vector (one for each digit). The network had 60,000 trainable parameters.  
![](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1lvvWF48t7cyRWqct13eU0w.jpeg)  
LeNet architecture, via DataScienceCentral

CNNs' potential was unfortunately limited by computational power availability. In the 2010s, GPUs became more powerful and more affordable (and implementations made use of them). A team from the University of Toronto created AlexNet (which is famous) to win the ImageNet competition by leveraging GPUs. They managed to classify images into 1000 categories, using around 60 million parameters. The network had 8 layers with learnable parameters, 5 convolutional and 3 fully connected. Before the fully-connected part (which is mixed with dropout layers), we also note the 3 pooling layers, placed after C1, C2, and C5 (so not after each C layer).
![](https://www.researchgate.net/publication/312515254/figure/fig2/AS:489373281067011@1493687090882/Framework-of-AlexNet-This-figure-is-from-the-original-paper-8.png)  
AlexNet architecture, via ResearchGate  

Nowadays, research is much more advanced, and we've discovered many more techniques to handle image or video data. Next, let's get into the details of how CNNs work.

# The mechanics of CNNs  

In the previous section, we've talked about the history of CNNs. Inevitably, we've touched upon some aspects that differentiate them from other techniques. Now, it's the time to take the time to explain each of those aspects in detail.  
**The inner workings**: CNNs work by *searching for patterns* across an image, *without* keeping track of the absolute positioning. We look for locality and describe features in terms of their relative arrangement. The tools that enable us to do this are *convolutions*.  

### Convolutions

As we've seen with our history lesson, convolutions came to be as result of the combined work of biology, neuroscience and mathematics. In mathematical terms, a convolution is the process by which we *convolve* (read *combine*) two functions to produce a third one. The exact mechanism has to do with integrals, but we won't go into that. For a cool explanation, check out [3B1B](https://www.youtube.com/watch?v=KuXjwB4LzSA)(he is one of the authorities on math explained right). We can apply convolutions to either continuous or discrete functions. In our case, the functions are discrete, since we work with image pixels. The way we apply is by defining a *kernel* (also called *filter* or *mask*). This is a small matrix that we slide over the image, and we multiply the values of the image with the values of the kernel, then add them together. The result is a single value, which we place in the corresponding position in the output image. The kernel is then shifted by one pixel (depending on the **stride** - we'll discuss in a bit), and the process is repeated. Our output is a matrix that represents the convolution of the image with the kernel. We call this the *feature map*. We might use multiple kernels to produce multiple feature maps, and the set of all created feature maps is the convolutional layer. Let's look at a visual example:  

![](../assets/convEx1.png)  

Here we have an image of the digit "0". Let's suppose it looks great for a moment. Our image is grayscale, and the values of the pixels are shown in the image. If we were to apply a kernel of size 3x3 (we generally use odd-sized kernels), meaning we would convolve the image with a 3x3 matrix, we could get something like this:  

![](../assets/convEx2.png)  

We chose a kernel that checks for oblique lines. We slide the kernel over the image, multiply the corresponding numbers and add them together; the result is the value we output to the feature map (pixel by pixel).  
#### Note: Our convolution has decreased the size of the image. This is effectively a reduction operation. In order to avoid the incremental reduction of dimensions, we can use *padding* = adding a border of 0's around the image. In a long stream of convolution layers, any image would have its dimensions reduced by a lot. To avoid this huge loss of information, we can use padding. In practice, it depends on the task at hand. Keep in mind that deep cnn models benefit the most from this, since they have many convolutional layers.  
We slide the kernel over the image by a number of cells/pixels called the **stride**. The stride, along with the size of the kernel, determine the dimensions of the output feature map. Having a bigger stride means we do less calculations but lose some information along the way. Let's look at the next steps:  

![](../assets/convEx3.png)

![](../assets/convEx4.png)  

And skipping the rest of the steps, we get this:  

![](../assets/convEx5.png)  

That is our feature map that encodes the existence of oblique lines tilted to the right in our image. Of course, we could use any kind of shape in the kernel.

The feature map is an *encoded* representation of the image. It allows us to represent low-level features (as described by the corresponding kernel) in a compact way. By chaining multiple convolutions with different kernels, we can extract different features. The feature maps, stacked together, form the convolutional layer. As you've seen above with Neocognitron,LeNet and AlexNet, they used multiple kernels to extract different features. The way they portray them is different, but the concept remains the same. Let's say we would like to use 2 kernels. We imagine our result as 2 stacked windows like this:

![](../assets/convEx6.png)  

An alternative way to think about it is a 3D matrix, or a cube. It's easier to think of it as blocks like in the pictures of LeNet and AlexNet. We call them **tensors**.  
Going back to the original 3x3 result. What do we do with feature maps? We have 2 options generally: pool them or flatten them and feed them to a fully connected layer (as you've seen in the LeNet example).

### Pooling layer 

These are the C-cells of Fukushima. They process the information given by feature maps. As a kernel might differ, so might a pooling window's function. We might use:
- Max pooling: we take the maximum value in the window and that's what passes onward
- Average pooling: we take the average value in the window 
- Sum pooling: we take the sum of the values in the window
- Min pooling: we take the minimum value in the window
- L2 pooling: we take the L2 norm of the values in the window  

In practice, max pooling is the most common. It essentially tells us where the strongest features are. Average pooling might blur the results, so it could be useful in cases where we expect noise to be present. The rest are rarely used.  
We've mentioned the name *pooling window*. Let's clarify: just like we slide the kernel along the image, we also slide the pooling window along the feature map.  
As for the size, we usually use 2x2 or 3x3. The size should be at most that of the corresponding kernel. Since we slide, we also *stride*. The stride of the pooling layer can benefit from being larger than the stride of the convolution. If we use a 2x2 pooling window with a stride of 2, we never overlap information in the cells. We also reduce our size by half, so use your best judgement. Experimenting is always important.  

As a general rule, pooling is used to reduce the size of the feature maps. We convert low-level information into higher-order, more complex and more abstract information. It's no longer about viewing images, but about where certain features are present. The combined and repeated use of convolutions and pooling is what makes CNNs recognize complex patterns in images.  

Let's see an example of what pooling would result in; we use a 2x2 window with a stride of 1 (since our sizes are already small numbers): 

![](../assets/convEx7.png)

There we go! Our 0 is encoded as (3,2,2,3). Flatten that, and a fully-connected layer should be able to map it to the correct class.  And we only used 9 weights!  
Weights? Where?  
Right, we haven't talked about weights. But you might've guessed it already

### Weights

In CNNS, **the values of the kernels are the weights!** It's easy to see it, once you think about it. In traditional, dense NNs, the weights are the numbers that allow us to find patterns hidden in the data. In CNNs, kernels do that. And they do that by convolving the input with their values. Ergo, the values of the kernels are the weights.  

We have a few options them: define the kernels ourselves, or let the model learn them through gradient descent and backpropagation (they are a package).  
Consider trying to define your kernels manually. Sure, for the first layer a few edges might work well, but what about the next layers? Edges are not the most complex features out there, and the network must surely extract something more complex if we want to classify cats on the internet!  

So a kernel's values are trainable. In a pooling layer, we use a predetermined function to aggregate the results. They are not trainable. That is why, when talking about LeNet and AlexNet, we specified layers of "learnable" parameters. In CNNs, not everything is trainable.

### Activation functions

We know about convolutions, we know about pooling, we know who our weights are. In the original NNs, we used activation functions to introduce non-linearity. We can do the same in CNNs. The same activation functions can be used. The question is where to place them, and the answer is: after the convolution and before the pooling.  
This is especially true since we might often see CNNs that have multiple convolutions before introducing a pooling layer. It depends on many factors, but those networks work as well. Remember, a pooling layer has the sole purpose of aggregating some extracted information. If we only want to detect higher-order features, we can skip the pooling layer for a while.  
One other key factor is that nowadays, deep networks are pretty much the only thing we see used. To get good results, we cannot reduce our dimensions too much, and especially not too early. To avoid that, we introduce the *padding* we mentioned earlier, and let convolutional layers extract information, while not decreasing the dimensions. Pooling layers are the ones actually used to reduce dimensions, primarily.  
Then, with the dense layers at the end, we've essentially reduced our images to a series of numbers that are combined in a traditional, fully-connected network to produce an output. I would call that the ultimate abstraction.

### Regularization

We've talked about regularization way back in our regression chapters. Neural networks also have a few ways to avoid overfitting. We call them *dropout layers*.  

As the name suggests, we randomly drop some neurons from the network. For each layer in the usual architecture, we can apply dropout. If a given neuron is dropped, its value is considered null. The implementation is quite straightforward:  
For each layer, we choose if we want dropped neurons. We choose the dropout rate (usually 0.5; the input layer would be given a dropout rate of 0 or close to 0), generate a Bernoulli random discrete variable with a probability of 1-dropout rate, and element-wise multiple the layer with it, before passing values to the next layer. Do this for any layer you want dropout in.  

The effect is that the model is now forced not to cater to noise in the data. It has to learn the patterns that stand our from general noise, which leads to better generalization.

### Normalization

We've mentioned normalization before, so we know what it means. In the context of CNNs, we usually want to normalize our data after applying the activation functions. As we've seen in the examples above, we can get higher values (2,3 in our case), depending on our pixel values, our kernel size and our activation function. This means that those numbers can eventually spiral out of control. By normalizing, we ensure that we return to a reasonable range of values, which generally helps the model converge faster.  

We use [batch normalization](https://arxiv.org/abs/1502.03167) to normalize our data. The idea is simple: after using the activation function, normalize using whatever technique you prefer.

### Small recap

Since we've discussed so many aspects of layers that are used in CNNs, let's make sure some things are clear.  
I. So far, we've discussed about the fully-conected neural network, and the convolutional neural network. We know that there are tons of other architectures, which are arbitrarily chosen, named and used. Going forward, keep in mind that a certain architecture is rather characterized by the techniques used to approach a task, rather than a fixed order, number, and type of layers.  
II. The convolutional layer is the defining feature of CNNs. It allows us to extract local features from the input, giving us robustness against distortions and translations.  
III. The pooling layer is used to reduce the size of the feature maps. It's used to aggregate information, reducing dimensions in the process.  
IV. The weights of a CNN are the values of the kernels. They are trainable.  
V. The activation function is used to introduce non-linearity. It's usually placed after the convolution and before the pooling.  
VI. Dropout is used to avoid overfitting. It's used to randomly drop neurons from the network. It can be used on any kind of architecture or layer.  
VII. Normalization is used to ensure that the values of the neurons don't spiral out of control. It's usually used after the activation function.  
VIII. The fully-connected layer is used to combine the information extracted by the convolutional layers. It's used to produce the output. There are networks that don't use dense layers at the end. We call them FCN (Fully Convolutional Networks). We could also consider them encoders, since they encode the input into a series of numbers. !!!!!(to verify)  

With a recap of the recap, here is an architecture a CNN might use (not considering dimensions, just order of layers):  

INPUT -> CONV -> ACTIVATION -> POOL -> CONV -> ACTIVATION -> POOL -> CONV -> ACTIVATION -> POOL -> DENSE -> ACTIVATION -> DENSE -> ACTIVATION -> DENSE -> OUTPUT  

or  

INPUT -> CONV -> ACTIVATION -> CONV -> ACTIVATION -> CONV -> ACTIVATION -> POOL -> DENSE -> ACTIVATION -> DENSE -> ACTIVATION -> DENSE -> OUTPUT  

You get it by now. In practice, there are infinite ways to combine these building blocks. That's were experimenting comes into play. For example, certain architectures make use of skipping connections (certain values are used as input for layers that are not directly adjacent; e.g: sending half of the neurons' values from layer 2 to layer 9 directly; call it time-travel), or sending values multiple ways through the network (sending layer 2 neurons' values both to layer 3 and 9).

### A few more things

As we've seen, kernels are the weights of a CNN, meaning we want to train them. But know we can choose them by hand, and we saw how some results are actually interesting. Before CNNs and training kernels via backpropagation, we could call them *filters*, choose them by hand, and have them extract interesting features from images. The simplest example is edge detection, given certain orientations for lines. Let's look at some code:

In [1]:
# convolution filter types:
# 1. edge detection
# 2. sharpening
# 3. blurring
# 4. embossing

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import cv2

One important question might come to mind while reading this chapter: what about color images? What do we do when we have 3 channels of values for each pixel?  
The answer is simple: we use 3 kernels, one for each channel. We then combine the results of the 3 kernels, using a technique called *channel fusion*. The most common technique is to sum the results of the 3 kernels. We can also use other techniques, such as concatenation, or element-wise multiplication.!!! ( verify)

Due to vectorization / parallelization, we don't actually have to worry about dimensions, since matrix multiplications help us, as long as we get our formulas in check. Of course, frameworks do the heavy lifting for us, but we're here to learn the inner workings.