**CNN INTRODUCTION**

This is a section of learning notes gained from the learning materials in *CNN Explainer*. 

**What is a Convolutional Neural Network?**

In machine learning, we have something called *classifiers* which assigns a class label to a certain data point / sample. For example, in computer vision, we have image classifiers which produces predefined class label (dog, cat, bird, etc.) for  objects it has detected in an image.  

Convolutional Neural Networks or CNN for short, is a type of classifier which is exceptional when it comes to solving this type of problem. 

CNNs are built ontop of neural networks which are algorithms that are used to detect pattenrs in data. Neural networks in the most basic of terms are composed of a collection of neurons that are organized in layers with each having their own learnable weights and biases. 

Neural networks as the name suggests try to mimic the human brain. With that said and done, let's try breaking down what a Convolutional Neural Network is: 

1. **Tensor**: This can be thought of as a n-dimensional matrix. In the CNN above, tensors are usually 3-dimensional with the sole exception of an output layer.

2. **Neuron**: This can be thought of as a function that can take multiple inputs and yields a single output.

3. **Layer**: Each neuron belongs to one layer and a group of neurons combined together forms an individual layer. All of the neurons have one operation and also have the same hyperparameters.

4. **Kernel Weights**: This can also be called as filters or feature detectors. Each kernel is a small matrix that is used to apply the convolutional operation. Basically, as the kernel moves along the input image, it adjusts its own values by multiplying with the original pixel values and sums them up. This allows efficient detection of patterns in the input data.

5. **Biases**: Bias values allows the activation function to shift to the left or right which is critical for learning patterns properly. Each kernel has it's own bias which is added to the sum of the weighted inputs before the activation function is applied.

When it comes to neural networks, if you've learned before and we have. These aren't exactly new so what's the difference between neural networks and convolutional neural networks? CNNs use a different type of layer called a convolutional layer that makes them well-suited to learn from image and image-like data. 

But that's not all. CNNs are also great for image processing, classification, segmentation, and object detection. 

**What Does Each Layer of the Network do?**

Let's go through each layer in the network and talk about them. 

1. **Input Layer**

The input layer represents the input image into a CNN. Since we're dealing with an RGB image as an input then that means that the input layer has three channels for each color. 

2. **Convolutional Layers**

Convolutional layers are the bread and butter of CNNs. These contain the learned kernel weights, which are the one's that extract features that distinguish different images from one another. This is exactly what we need for classification.

Let's break it down into parts: 

    1. Kernel/Filter: Each kernel is a small matrix of weights (3x3 / 5x5 / etc.) that are randomly initialized and function similiar to weights but only exist in convolutional layers. They go over the input data, pixel by pixel. When we say 'kernel' that refers to the entire matrix. If we say 'kernel weights' then that refers to the individual weights of the matrix. 

    2. Dot Product: For each position of the kernel over the input, a dot product is computed between the kernel and the area of the input that it covers. So if there is a 3x3 matrix of kernel weights then it would calculate an area of similiar size on the image to calculate the output. 

    3. Feature Map: Each calculation from the dot products are then summed up together alongside the bias term. This creates one singular value in the feature map. This process is repeated until the entire input is covered to produce a complete feature map for each kernel. 

    4. Activation Function: An activation function is then added ontop of the feature map to introduce non-linearity. Example - ReLU 

![Display](images/convlayer_overview_demo.gif "Application of Convolution")

It is vital to know that each neuron from the previous layer are all connected with the next layer. In the architecture provided in the website, the previous layer contained only 3 neurons but the convolutional layer which is right next to it has 10 neurons. Each input channel has different kernels. Meaning that each input channel has it's own 3x3 kernel matrix that has different values.

The size of the matrix for a kernel is a hyperparameter that can be adjusted depending on the user. In order to create a feature map, we first need to do an elementwise dot product (see description above) with the previous layer and the unqiue kernel learned by the network.

In the our model's case, the dot product operation uses a stride by 1. Meaning that the kernel being calculated on moves towards the next pixel (1) per dot product. Simply put, we calculate the dot product pixel by pixel. The stride is also a hyperparameter that can be adjusted. 

![Display](images/convlayer_kernel_matrix.gif "Convolution Kernel Weights")

It’s important to note that in a convolutional layer, neurons are not connected to all neurons in the previous layer, but only to a local region defined by the size of the kernel. This is different from a fully connected layer where each neuron is connected to all neurons in the previous layer.

In the architecture provided in the website, the previous layer contained only 3 neurons but the convolutional layer which is right next to it has 10 neurons. Each input channel has different kernels. Meaning that each input channel has it's own 3x3 kernel matrix that has different values.

The size of the matrix for a kernel is a hyperparameter that can be adjusted depending on the user. In order to create a feature map, we first need to do an elementwise dot product (see description above) with the previous layer and the unqiue kernel learned by the network.

In the our model's case, the dot product operation uses a stride by 1. Meaning that the kernel being calculated on moves towards the next pixel (1) per dot product. Simply put, we calculate the dot product pixel by pixel. The stride is also a hyperparameter that can be adjusted. 

**Going it over again:**

* In the example, we have three channels corresponding to RGB. Each channel has it's own unique kernel applied to it from each filter. 

* These kernels are then operated by elementwise dot product. The results from these kernels from the different channels are then added together (elementwise sum) alongside a bias to create a single output value / pixel. This repeats over and over again until the entire input's has been operated on. This results into a feature map or the complete output of a convolutional layer. 

* Once a feature map goes through an activation function then it becomes an activation map.

**MORE INFO ON KERNEL WEIGHTS**: Kernel Weights exist and are adjusted in convolutional layers. All kernel weights are weights but not all weights are kernel weights. Kernel weights function just the same as regular weights. Both are randomly initialized and optimized. It's just that Kernel Weights have an added purpose to them.

**ACTIVATION MAP vs FEATURE MAP**: Feature map refers to the direct output from a convolutiional operation. Meanwhile activation map refers to a feature map that has gone through an activation layer such as ReLU. So, feature map -> activation layer -> activation map

Tying it all together, we have 30 unique kernels of size 3x3 applied in the first convolutional layer. 3 (channels) x 10 (filters) = 30. The way that you create the connections between the previous layer and the convolutional layer will play an important role on the number of kernels per convolutonal layer. 

**CONVOLUTION REDUCES SIZE**:
The size of the kernel matrix could affect the total size of the input once it's been processed. Since a 3x3 matrix would be akin to merging a 3x3 area of pixels into one pixel. Luckily, there is a way to preserve the size of an image. 

**Understanding Convolutional Hyperparameters**:

With the addition of convolutions means the addition of new hyperparameters to tinker around. We are introducting 3 of these hyperparameters. **Padding - Kernel Size - Stride**. 

Let's get started.

1. **Padding** - Since convolutional layers merge pixels, it's natural that the edges of an input wouldn't be able to be covered since there's not enough data. So an image's shape would gradually decrease. However, padding gives better performance and preserves the input's spatial size.
  
Padding in it's most simplest terms is just adding another column + row to cover the edges. There are many different padding techniques but one of the most common is *zero-padding* since it is simple and efficient. Zero-padding from the name itself means that the row and column added are filled with zero's. 

2. **Kernel Size** - This can also be referred to as the *filter size*. This refers to how big the kernel is and how many kernel weights it contains. This is a very important hyperparameter because of just how massive is it's effects on the image classification task.
  
For example, smaller kernel sizes means that it is able to extract more information containing highly local features from the input. This leads to smaller reduction on the shape of an image which then means a deeper architecture. 

In contrast, larger kernel sizes means that it extracts less information which means larger reduction of image shape eventually leading to worse performance. But larger kernels are better suited to extract features that are larger. Not to mention it's much cheaper in computation costs. 

Overall, smaller kernel sizes generally lead to better performance since designers can stack more and more layers together and with it learn more complex features. 

3. **Stride** - Stride just means how many pixels is being shifted. So if it is 1 then that means the next pixel beside it. If it is 2 then it skips a pixel, etc. The impact of stride is similiar to the kernel size because as the stride is decreased means more of the features are being learned thuslarger output shape's. If the stride is increased then there are less features learned and smaller output shape's. It is important the the kernel slides along the input symmetrically when implementing a CNN.

**Activation Functions**

1. **ReLU** - Neural networks are widely adopted in modern technology because they are highly accurate. The best peforming CNNs today consist of absurldy large amounts of layers which allows them to learn and learn more features. Part of the reasons that these groundbreaking CNNs are able to accomplish highly accurate results is simply because of their non-linearity.

ReLU applies the required non-linearity to the model. Non-linearity is required in order to accomodate for non-linear decision bounderies. If non-linear activation functions aren't present then deep CNN architectures would just be a single equivalent convolutional layer which isn't as good at all.
  
The ReLU activation function is specifically used as a non-linear activation function as compared to other non-linear functions such as Sigmoid because it is empirically observed that CNNs utilizing ReLU are faster to train than other counterparts. 
 
ReLU is being applied after each convolutional layer. 

2. **Softmax** - Softmax is then applied to get a prediction/class probabilities. After flattening, we then add up all the products (elements x weights + bias) to get logits. We then apply the softmax to these logits in order to turn them into class probabilities. Afterwards, we use argmax to get the highest probability that corresponds to the appropriate label. 

![Display](images/softmax_animation.gif "Visualization of Softmax")

**Pooling Layers**

There are many different pooling layers in CNN architectures but they all function the same with gradually decreasing the spatial size of the network. This reduces the parameters and overall computation requirements of the network. Basically, making it a lot cheaper to work on. 

Let's take a look two different pooling layers:

1. **Max Pooling**: This is done by taking an area from the input / group of pixels with a defined region (filter size) and grabbing the maximum value. For example, if you have a 2x2 filter moving through the image then max pooling will take the highest value pixel from the 2x2 group and output that.

2. **Average Pooling**: Average pooling calculates the average of the pixel values within the filter's region. For example, a 2x2 filter area will compute the average of the four pixel values and then output the result.

So, what are the benefits of pooling layers again? Aside saving computation power, we also reduce the required parameters needed to learn. This helps prevent against overfitting. 

In addition we also are able to capture more features since pooling helps with emphasizing the presence of features rather than focusing on exact locations. Lastly, because we are summarizing the presence of features, pooling layers can help recognize objects even if they are being translated / moving around in a frame. 

Summing it all up, Pooling helps in enhancing a network's ability to learn and generalize from visual data. It also helps in creating an efficient network so that it doesn't draw as much computational costs. 

In TinyVGG - the architecture in the examples provided, we are using Max-Pooling. The pooling layers are using a 2x2 kernel with a stride of 2. Because of these parameters we are discarding around 75% of the activations. From an input size of 26 - 26 we are then down to 13-13. Because we are discarding so many values, we can prevent overtfitting. Think of this as the architecture is changing from memorizing down to the last detail, it's instead familiarizing with what the object looks like.


**Flatten Layer**

After going through all the convolution, activation, and pooling layers - we need to flatten the tensor down into a one-dimensional vector. For example, if we are given a 5x5x2 tensor then this would be then converted into a vector of size 50. T

he work of the convolutional layers with extracting the features from the images are done so all that's left is to classify these features. 

*After flattenning, we use the softmax function which requires a 1-dimensional input. This is why the flatten layer is important!*

**NOTE**: In a Multilayer Perceptron (MLP) which were the linear layers & non-linear activations that we have done previously, the inputs that we were using were already transformed into a one-dimensional vector. In a CNN - we retain the original shape of the image's and then transform them into a one-dimensional vector when we just need to classify them since Softmax requires one-dimensional inputs.

**MLP: Flattening is done BEFORE the data is inputted into the model.**

**CNN: Flattening is done AFTER the convolutional, activation, and pooling layers.**

That's it for CNNs! Congratulations!