# Convolutional Neural Networks (CNNs)

>*Reference:*   
https://towardsdatascience.com/a-beginners-guide-to-convolutional-neural-networks-cnns-14649dbddce8

### **Convolution and Pooling:**  


Convolution is a fundamental operation in image processing and signal processing. It involves combining two functions to produce a third function that represents how one function affects the other. In the context of image processing, convolution is often used to process images by applying a filter or kernel to the image pixels.  
> It is how the input is modified by a filter

In convolutional networks, multiple filters are taken to slice through the image and map them one by one and learn different portions of an input image. For example, the dark edges of an image are mapped onto a blank image using a convolution. The network then learns how to identify a dark edge using this mapping and filter.

<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN1.png" alt="Dark Edge Mapping using a Convolution" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption style="justify-content:center;">Dark Edge Mapping using a Convolution</figcaption>
    </figure>
</div>

#### How Convolution Works:

Convolution can be done in 2D or in 3D. When the input image is in grayscale then the image has only 2 channels and hence we perform 2D convolution. However, if the image is in RGB, the image has three channels - red, blue and green. Here, we perform 3D convolution or 2D convolution three times, for each colour channel separately.


Convolution filters are usually 3 x 3 matrices

<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN2.png" alt="2D Convolution" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption style="justify-content:center;">2D Convolution</figcaption>
    </figure>
</div>

Take a look at the above image. We have the following:
> *Input Image:* 4x4 2D image without any padding.   
> *Convolution Filter:*  3x3 filter  
> *Output Image:* 2x2 image  

Consider the terminologies used:
> *Padding:*   
Padding is the process of adding additional noise to the edges of an image. This is to mitigate loss and preserve spatial information.  
> *Stride:*   
Stride refers to the number of columns/rows by which the filter will move.

##### Process of Applying the Filter:

1. The filter is first mapped to the first 3 x 3 matrix in the input image. This is from [0,0]- [2,2]. Here that is: 
<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN3.png" alt="" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption style="justify-content:center;"></figcaption>
    </figure>
</div>

2. Once mapped, the values at the respective positions are multiplied with each other and then all the products are added. That would be:

$$ (2 * 1) + (0 * 0) + (1 * 1) + (0 * 0) + (1 * 0) + (0 * 0) + (0 * 0) + (0 * 1) + (1 * 0) = 3 $$

3. This value obtained here is the filtered value of the first 3X3 pixel of the input image. The same is put into the [0,0] cell of the output image.

4. Then move the filter to the right/left/up/down by the stride decided upon. Repeat the above steps until all the columns and rows of the input matrix has been covered and the final output matrix is filled.

The output we obtain here is:
<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN4.png" alt="" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption style="justify-content:center;"></figcaption>
    </figure>
</div>

The above process is for a 2D image - that is an image in grayscale. The same process is applied when it comes to a coloured image - except it is done three times. Once for the red channel, once for the blue and once for the green. At the end the output images for all three channels are combined to obtain the final output image. 

#### Relu Activation Function:

> *Reference:*  
https://builtin.com/machine-learning/relu-activation-function  
https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.

#### Pooling

Pooling is the methodology used to reduce the x and y dimensions of a 3d image. It is similar to convolution. Convolution is usually used to reduce the y dimension of the image. 

There are two types of pooling that are commonly used:
1. Max pooling -  Maximum value is taken across the filter window
2. Average pooling - Average value is taken

## CNN Architecture:


<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN5.png" alt="" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption></figcaption>
    </figure>
</div>

When it comes to the implementation of CNNs, the most successful architectures use one or more stacks of convolution + pool layers with Relu activation followed by a flatten and then dense layer

### Factors Determining the Architecture of the CNN Model

How do you decide what architecture to use for the CNN model that you are to use? Keep in mind the following factors:


1. **Analyze your input data**:   
Understand the nature of your input data. Consider the *dimensions*, *format*, and *characteristics* of the images or data you will be working with. This information will help determine the appropriate input shape and preprocessing steps, such as resizing, normalization, or data augmentation.  

2. **Task Requirements:**   
Clearly define the objective of your model. Different tasks, such as image classification, object detection, or semantic segmentation, may have specific architectural requirements. 
  
3. **Complexity of the Problem:**   
Assess the complexity of the problem you are trying to solve. More complex tasks, such as fine-grained image classification or scene understanding, require deeper and more intricate architectures to capture intricate features.
  
4. **Model Capacity:**   
Consider the capacity or complexity of the model. Deeper networks with more parameters have the potential to learn more intricate patterns but may also be more prone to overfitting, especially if the available dataset is limited. Striking the right balance between model capacity and data availability is crucial.
  
5. **Layer Types:**   
CNNs typically consist of convolutional layers, pooling layers, and fully connected layers. The arrangement and combination of these layers can vary. Convolutional layers extract local spatial patterns, pooling layers downsample the features to reduce spatial dimensions, and fully connected layers process the global information for classification or regression. Determine the number and order of these layers based on the complexity of the task and the size of the input data.

6. **Network Depth:**  
The depth of the CNN refers to the number of convolutional and pooling layers. Deeper networks can potentially capture more abstract features but may also require more computational resources and larger datasets. Consider the trade-off between depth and computational efficiency.
  
7. **Filter Sizes and Strides:**  
Decide on the size and stride of the filters used in the convolutional layers. Smaller filters (e.g., 3x3) are commonly used to capture local patterns, while larger filters (e.g., 5x5 or 7x7) can capture more global structures. Strides determine the amount of shifting that occurs between each application of the filter.
  
8. **Pooling Strategies:**   
Determine the type and size of pooling operations (e.g., max pooling, average pooling) and the amount of downsampling to perform. Pooling helps reduce spatial dimensions and extract dominant features.
  
9. **Regularization Techniques:**
Incorporate regularization techniques to prevent overfitting. Common approaches include dropout, batch normalization, weight decay (L2 regularization), or early stopping. The choice and placement of these techniques can impact the model's generalization ability.
  
10. **Existing Architectures:**     
Consider established CNN architectures that have been successful in similar tasks, such as VGGNet, ResNet, Inception, or EfficientNet. These architectures often serve as a good starting point and can be customized or fine-tuned for your specific needs. 
  
11. **Computational Resources:**   
Take into account the available computational resources, such as GPU memory and processing power. 

12. **Hyperparameter Tuning:**   
Experiment with different hyperparameters, such as learning rate, batch size, optimizer, activation functions, and weight initialization methods. Conduct systematic hyperparameter tuning to find the optimal configuration for your model.

### Popular CNN Architecture

>*Reference:*  
https://levelup.gitconnected.com/a-practical-guide-to-selecting-cnn-architectures-for-computer-vision-applications-4a07ef90234

#### *LeNet:*
LeNet was one of the first convolutional neural networks, and it has been around since the 1990s. This architecture is relatively simple, with only *7 layers*.

<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN6.png" alt="LeNet Architecture" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption></figcaption>
    </figure>
</div>

**When to use?**
> Small image classification tasks (e.g. recognizing handwriting digits)

#### *AlexNet:*
AlexNet is a deep CNN with *8 layers*

<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN7.png" alt="LeNet Architecture" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption></figcaption>
    </figure>
</div>

**When to use?**
> Large Scale image classification tasks   
> Tasks that require a high degree of accuracy and a large dataset

#### *VGGNet:*
VGGNet is a deeper CNN than AlexNet, with up to *19 layers*. It uses small convolutional filters to achieve high accuracy in image classification tasks.

<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN8.png" alt="VGGNet Architecture" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption></figcaption>
    </figure>
</div>

**When to use?**
> fine-grained classification tasks (e.g. identifying dog breeds/ flower species)

#### *GoogLeNet:*
GoogLeNet is a CNN architecture that uses inception modules, which are blocks of convolutional layers that have multiple filter sizes. These modules allow for more efficient use of computational resources and higher accuracy in image classification tasks.

<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN9.png" alt="LeNet Architecture" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption></figcaption>
    </figure>
</div>

**When to use?**
> Large scale image classification tasks (e.g. object detection and segmentation)

#### *ResNet:*
ResNet is a CNN architecture that uses residual connections, which are shortcuts between layers that allow the network to learn the residual mapping. This architecture can go as deep as *152 layers* while maintaining high accuracy in image classification tasks.

<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN9.png" alt="ResNet Architecture" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption></figcaption>
    </figure>
</div>

**When to use?**
> Recognizing fine details in images

#### *DenseNet:*
DenseNet is a CNN architecture that connects each layer to every other layer in a feed-forward fashion. This architecture maximizes feature reuse and allows for better gradient flow, which leads to higher accuracy in image classification tasks.

<div style="display: flex; flex-wrap: wrap; justify-content: center;">
    <figure style="margin: 10px;">
        <img src="./Images/CNN/CNN11.png" alt="DenseNet Architecture" style="width: auto-width; height: auto-height; object-fit: cover;">
        <figcaption></figcaption>
    </figure>
</div>

**When to use?**
> Tasks that require a large number of params (e.g. medical image analysis)

#### Other CNN Architectures:

1. MobileNet
2. EfficientNet

If the **input data is small and simple**, such as images with low resolution, then a smaller CNN architecture such as **LeNet or AlexNet** might be sufficient.  
  
If the **input data is large and complex**, such as high-resolution images or videos, then a larger and more complex CNN architecture such as **VGG, Inception, or ResNet** might be needed to extract relevant features.  
  
If the task involves **object detection or segmentation**, then architectures like **YOLO, RCNN, or Mask R-CNN** might be suitable.  
  
If the task involves **processing sequential data such as speech or text**, then architectures such as **Convolutional LSTM or Time Distributed CNN** might be used.  
  
If the available computational resources are limited, then smaller architectures with fewer layers and parameters may be preferred to reduce training time and memory usage.  