# Convolutional Neural Networks (CNN)
---------------------------------------

## Introduction to CNNs

Wikipedia describes CNN as 'a regularized type of feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization.'

In simpler words, it is a type of Neural Network that learns to understand and analyse images / other structured data. What makes CNN special is the fact that it can learn features directly without relying on humans to handcraft features for it. CNN utilizes a process called optimization for this.

In the vast field of ML and AI, CNNs are something that have the capability to change the playing field and one of the most versatile Neural Networks, they're useful for tasks like image classification and object detection. 

### Importance of CNNs

CNNs as an algorithm are one of the most important part of ML. They are used in many different applications like image classification, object detection, speech recognition, etc. These alogrithms have been used for many years and are still in use today. Furthermore, they're also used in generative models like GANs. Outside of ML, CNNs are used in other fields like medical image processing, natural language processing, etc.

### Terms used in CNNs
1) Convolution:
    - Convolution is a mathematical operation on two functions (f&g) which produces another function. This new third function expresses how f is modified by g, the new function is called convolution and is the actual output of the CNN.

   → Mathematical Representation of Convolution:


   
   ![image.png](attachment:image.png)

   In the image,
    - f and g are the functions being convoluted.
    - (f*g)(t) is the output of the convolution at point t.
    - t is the real number variable of functions f and g , is the point where we are evaluating the convolution result itself. For example, of we're working with images, t will be the pixel location for a specific pixel of the image.
    - g(τ) is the convolution of function f(t)
    - dτ is the 1<sup>st</sup> derivative of g(τ) function.


2) Filter / Kernel :
    - A filter or kernel is a small matrix of weights that slide over the input data (like an image). It basically performs element wise multiplication with the part of image/input  that it's currently on and sums up the results into a single output pixel. 
    
    - These smaller matrices are the heart of what makes CNN work as they make the data to be processed smaller incerasing efficiency and performance and are the primary componenets that helps CNN models extract useful features form input data.

    * Okay, but how does a filter even work?

        → A filter can be understood like sliding your cupped hands to look through a window. The size of the cup, how fast/ slow you slide it and how you process information all changes your understanding of the stuff in the opposite side of window.

        → A step by step example of how filters work:

            Size → Sliding → Convolution operation → Learning → Feature Extraction

        Let's take a matrix example to understand it better.


        
        ![image-2.png](attachment:image-2.png)

        In the image, the size of filter is 2x2. The highlighted field is Local Receptive Field (LRF) for a 2x2 kernel.

    Concept of LRF is inspired from the fact that many neurons in our visual cortex are responsive to stimuli located in a limited region of the visual field. So, LRF is basically like what you focus on in your visual field. Interesting, isn't it?   

    <i>Neurons in CNN (Convolution layer) can be understood as a local region (LRF) of the input field.</i>


    <i>Neurons in CNN (Fully Connected layer) can be understood as a tiny processor that takes the outputs of previous layers ,performs a weighted sum of the outputs and provides the weighted sum as output. To perform a weighted sum operation, the input of neuron in multiplied by a corresponding weight, which is learned by the network in training and sums up the product.</i>



3) Weight sharing :
    - Weight sharing or parameter sharing is the key feature of a CNN that sets it apart from other Neural Networks.

    - A kernel's weight is shared across all the neurons that use the feature. So, in this image:
        
        ![image-2.png](attachment:image-2.png)

    The weight of the shaded field is shared across all the neurons that use the feature. 





4) Stride:
    - Stride is the parameter which determines the step size or the number of positions the kernel moves when sliding over the input data. The size of stride is directly proportional to the size of the feature map. Stride of size 1 (1x1) results in overlapping receptive fields, meaning the same feature is captured multiple times.
    
    - If a stride size is `n` it means that our kernel is moving `n` pixels both horizontally and vertically in the data.

    #### Stride: How it affects CNN models?
    1) Dimensionality Reduction:
        - Dimensionality reduction, as the term implies is a process of reducing the number of random variables under consideration in the input data. It's basically reducing the size (dimension) of the input image.
        
        Relationship between dimension and stride:
        - stride ∝ 1/dimensions

    2) Relationship between stride and computation time:
    - stride ∝ computation time 
        - Since, kernel needs to be applied less times if the stride is bigger, it reduces the computational time and increases the computation efficiency.

    3) Model Capacity
        - If our stride is too large, it results in the model losing detailed information since the filter won't cover every single pixel, which could result in a model losing it's accuracy! 

    
    * Let's look at an example to understand stride a bit better.
        - Output Size = (Input Size - Filter Size / Stride) + 1 (Size can be replaced with either height or weight as per need, this is the general equation / formula.)

        ![image-2.png](attachment:image-2.png)

        In this image, the kernel of 2x2 is being applied, let us consider it is being applied only horizontally (width) with a size of 2, the resulting feature map can be calculated as:

        Input size : 7x4 

        Filter size : 2x2

        Stride : 2
        Since, we're caclualting for only width, we only take the numbers of widths in the calculation.
        
        Output size :

        = (7-2/2)+1

        = (5/2)+1

        = 2.5+1 

        Rounding off to nearest integer as a size of 2.5 is not possible, both 2 and 3 are correct in this case, as the output size depends on various use-cases. It's conventional to round down to the nearest integer. Furthermore, the rounding down is done by something called `Floor Function` in CNNs.
                
        = 3

        Let's take the same example for a stride that moves in both horizontal and vertical directions (stride size = 2)

        [Output height,Output width]: 

        = [(4-2/2)+1,(7-2/2)+1]

        = [(2/2)+1,(5/2)+1]

        = [2,2.5+1]

        = [2,3] or [2,2]

5) Padding
    - Padding is the process of adding additional layers of pixels around the border of an image. This is done to preserve the information from the image that lies on the border area of the image. It also helps in dimension preservation and border information preservation.

    - There are two types of padding in the context of CNN. They're:
     1) Valid Padding:
        - Valid padding is the type of padding that is used only where it is necessary, meaning if a padding is applied only in the border area of an image, than it can be considered as valid padding. It is also known as `no padding`. Using this type of padding results in spatial dimensions of features maps being less than the dimensions of given input.


     2) Same padding:
        - Unlike no padding, same padding plays it safe and a padding of 0s is applied all over the image to make sure that the dimensions of input and output are the same. It is useful for cases where preservation of spatial information is a must like image segmentaion or object detection. It can increase the computational time of a model, but it retains important information and can help improve the overall performance of the network.

    Let's see padding in use through mathematics : 
     1) Valid padding :

        - Output for valid padding is quite similar to the output for stride. i.e. (Output size = ((Input Size - Filter Size / Stride) + 1 ))

     2) Same padding:

        - In same padding, we need to determine the side of padding to be used before actually applying it. We can find the size of padding to be used by using the formula:

        Padding Size (P) =((Input Width-1) * Stride + Filter size - Input width) / 2

        - The output size for same padding is given as:

        Output size (Width) = [(Input Width + 2*Padding Size - Filter width)/Stride] +1 


    <i>In both the cases, if stride is given as a matrix, we need to calcualte the size twice, one for height and other for width.</i>

    
6) Pooling/Subsampling:
    - Pooling layer or Subsampling layer progressively reduces the spatial size of input or it subsamples the input, hence the name of the layer. There are 3 main types of pooling: 

    1) Min Pooling:
        - Min pooling is a type of pooling that takes the smallest value from the selected cells or pixels. It provides computational efficiency but gives information of only minimum value in selected region which causes infomration / feature loss.
    
    2) Max Pooling:
        - Max pooling, as the name suggests is a type of pooling that takes the largest value from the selected cells or pixels. It helps in transforming feature abstraction invariance to small translations (region to compute becomes small after pooling.) but gives information of only maximum value in selected region which causes information / feature loss.

    3) Average Pooling:
        - Average pooling is a type of pooling that takes the average value from the selected cells or pixels. It is helpful in reducing computational complexity and it also provides a better transitions between feature maps which can be helpful in tasks like image segmenatation. However, it can lose information on individual pixels of the image.

        * Global Average Pooling:
            - Global average pooling is a type of pooling that takes the average value from the entire input image. It reduces the spatial dimensions of the feature map to a single value per feature map channel, making it computationally efficient. However,it discards spatial information by taking the average value from the entire feature map, resulting in a loss of spatial locality.

7) Fully Connected Layers:
    - In a CNN, the fully connected layers (dense layers) come after the convolutional and pooling layers. These layers take the learned representation form convolutional and pooling layers and transform it into a single vector. The single vector is the prediction of the model.

    - These layers connect every neuron of a layer to every neuron of the next layer.

    - So, fully connected layer can be defined as 'A fully connected layer is a layer that has all the neurons in the previous layer connected to all the neurons in the next layer.'
    
    - Uses of Fully Connected Layer:
        - Global Feature Learning:
            - Global feature learning is the process of learning the global features of an image.This is extremely important while making CNN models as during the convolution and pooling layers, a model only learns the local features of an image which may lead to feature loss, to prevent this we use global feature learning . Fully Connected layers help a model to learn the global features in given data. Fully connected layers do this by connecting all the neurons with each other, which than enables a model to learn complex relationships and dependencies between features across the entire input.

        - Parameter Aggregation:
            - Parameter aggregation is the process of combining and summarizing the parameters learned throughout the network, which is generally done before the the classification or regression layer. This step generally condenses the learned inormation from the previous layers to a compact representation, which ultimately helps in decision-making of a model. Some common techniques for parameter aggregation are : 

                1) Fully Connected Layer(FC)

                2) Global Average Pooling(GAP)

                3) Global Max Pooling(GMP)
                
                4) Concatenation

        - Non Linearities:
            - Fully Connected layers are followed by non-linear activation functions like ReLU, Sigmoid which allows a model to learn complex , non -linear relationships in data.

        - Semantic Understanding:
            - Semantic understanding refers to a model's ability to comprehend the meaning & significance of features in the given input data. Fully connected layers play a huge role in semantic understanding within a CNN model by capturing global patterns,relationships and high level abstractions from the input data.

        - Classification and Decision Making:
            - CNNs are mostly used for task of classification and decision making ,these fully connected layers is what make the classification possible for a CNN model. This is done through the help of global parameters learned in the FC layer of a CNN model.

        - Adaptation to varying input sizes:
            - Fully connected layers provide the model with a way to adapt to varying input sizes. This is done by allowing the network to handle inputs of different dimensions.

8) Flattening:
    - Before connecting the output of pooling layers to fully connected layers, the data should be transformed from 2-D form ([[1,2],[3,4]]) to 1-D form ([1,2,3,4]). This process of making the data 1-D is called flattening. The goal of this is to convert all the learned features to a format that can be easily understood and used by the FC layer. The output of this layer is the input to the FC layer.

9) Activation Functions:
    - Activation functions introduce non-linerity in the model, which allows a model to learn amd perform complex tasks. With out activation functions, no matter how many layers are in a model, it'll behave like a single model <u>Perceptron</u>. <i>Activation functions are used typically after each convolution and fully connected layers in a CNN</i>.
        <i> A perceptron is the simplest form of a Neural Network. It serves as a foundation building block for more complex neural network architectures. It is basically a binary classification algorithm that takes multiple binary input and produces the single binary output.</i>

    - __Commonly Used Activation Functions:__

        - ReLU (Rectified Linear Unit/ Rectifier activation function):

            → f(x) = max(0,x)

            → ReLu is a simple activation function that returns `x` for any postive input (`x`>0) & returns `0` for any negative input (`x`<= 0).


            → As the name `Rectifier` implies, it rectifies the negative input to 0 and leaves the positive input as it is.

            → ReLu Visualized:
            
            ![image-3.png](attachment:image-3.png)

        - Leaky ReLU (Leaky Rectified Linear Unit):

            →f(x) = max(αx,x), where
                α = small positive constant, usually a small fraction.

            →The main difference between Leaky ReLU and ReLU is the fact that Leaky Relu allows a small non zero gradient fot negative inputs, introducing a slight slope (αx) for the negative values.

            →The value of α determines how 'leaky' the function is for negative inputs , basically estimates how much negative values are considered.

            → Leaky ReLU helps us tackle the `dying ReLu` problem, where neurons with negative inputs become inactive during the training process which causes them to stop learning, however, by allowing negative inputs to be considered, the problem is solved.

            →Leaky ReLU Visualized:
            
            ![image-4.png](attachment:image-4.png)

        - Sigmoid:

            → Sigmoid functions are used in binary classification problems. It is used to convert any continuous input to 0 or 1.

            → Sigmoid is also known as the logistic function as it is used mostly in logistic regression.

            → σ(x)  = 1 / (1 + e^(-x)), where e is the base of the natural logarithm (2.71828).

            →Sigmoid function has a `Decision Threshold`, this is the point at which the function becomes 0 or 1. The threshold is usually set to 0.5. Meaning, if the output of the sigmoid function is greater than 0.5, the output will be 1 and if the output is less than 0.5, the output will be 0.

            →Sigmoid Visualized:

            ![image-5.png](attachment:image-5.png)

        - Binary Step-Function:

            → Binary Step-Function is the simplest type of threshold bassed activation function.

            → It works by either passing on it's output to next layer (if threshold is passed) or doing nothing (if threshold is not passed).

            → f(x) = 1 if x >= y else 0, where y is the threshold.

            →Binary Step-Function Visualized:

            ![image-6.png](attachment:image-6.png)

        - tanh:

            → tanh is used in the output layer of a neural network. It maps any real number to a value between -1 and 1.

            → It is similar to sigmoid function.

            → e = 2.718

            → f(x) = 2σ(2x) - 1, where σ(x) = (e^x - e^(-x)) / (e^x + e^(-x)) tanh expressed in terms of sigmoid function

            → f(x) = (e^x - e^(-x)) / (e^x + e^(-x)).

            →tanh Visualized:

            ![image-7.png](attachment:image-7.png)

    

<i> Some other activation functions are : 
1. **Softmax**:
    - **Definition**: The softmax function is used to convert a vector of values into a probability distribution.
    - **Formula**:
      
      softmax(x_i) = e^(x_i) / Σ_j e^(x_j)
      
      where `x_i` is the `i`-th element of the input vector `x`.

2. **Swish**:
    - **Definition**: A smooth, non-monotonic activation function that often performs better than ReLU in deep networks.
    - **Formula**:
      
      swish(x) = x * σ(x) = x / (1 + e^(-x))
      

3. **ELU (Exponential Linear Unit)**:
    - **Definition**: An activation function that aims to mitigate the vanishing gradient problem by using a small slope for negative values.
    - **Formula**:
      
      ELU(x) = {
        x, if x > 0
        α * (e^x - 1), if x ≤ 0
      }
      
      where `α` is a hyperparameter (typically 1.0).

4. **SELU (Scaled Exponential Linear Unit)**:
    - **Definition**: A scaled version of ELU that maintains the mean and variance of inputs close to 0 and 1, respectively.
    - **Formula**:
      
      SELU(x) = λ * {
        x, if x > 0
        α * (e^x - 1), if x ≤ 0
      }
      
      where `λ` and `α` are predefined constants (`λ ≈ 1.0507` and `α ≈ 1.67326`).

5. **GELU (Gaussian Error Linear Unit)**:
    - **Definition**: An activation function that smooths the ReLU function using the Gaussian cumulative distribution function.
    - **Formula**:
      
      GELU(x) = x * Φ(x)
      
      where `Φ(x)` is the cumulative distribution function of the standard normal distribution.

6. **Mish**:
    - **Definition**: A smooth, non-monotonic activation function that often performs better than ReLU and Swish in some tasks.
    - **Formula**:
      
      Mish(x) = x * tanh(ln(1 + e^x))

You can view more  activation functions in the [Keras documentation](https://www.tensorflow.org/api_docs/python/tf/keras/activations).

</i>


10) Epoch:
    - In most cases, it is not possible to feed all the training data in a single pass, due to various reasons like size of dataset and memory limitations. To handle this issue, we can split the training dataset into smaller dataset. These smaller datasets are called `Batches`. Each batch contains `Batch Size` number of samples and is processed seperately.

    - One Epoch is completed when model processes the entire training set, computes the loss and updates the models parameters to reduce the loss. However, that doesn't mean that higher number of epochs can give higher accuracy and in some cases it can lead to overfitting. 
    
    <i>Overfitting is when a model learns to memorize the training data rather than learning from it. If a model is overfitted, it'll perform very good on the training data but poor on the test data.</i>


11) Loss Function:
    - Loss function is considered as the key component of the training process. It is used to measure the difference between the predicted output and the actual output. The loss function is used to update the model's parameters to reduce the loss. When training a model, our main goal along with increasing the accuracy is to decrease the loss function.

    - Types of Loss Functions:
        - There are a lot of types of loss functions. Some of them are :

            1) **Mean Squared Error (MSE)**
            2) **Mean Absolute Error (MAE)**
            3) **Cross Entropy Loss**
                - Binary Cross Entropy Loss
                - Categorical Cross Entropy Loss
            
            4) **Hinge Loss**
            5) **Focal Loss**

        1) Mean Squared Error (MSE):
            - Mean Squared Error is used mostly in regression tasks. It works by averaging the squared difference between predicted and actual values. It is quite simple to understand and easy to use. It is also called Squared Error.

            - Formula:

                MSE = (1/n) * Σ (y_i - ŷ_i)^2, where

                n = number of samples
                
                y_i = actual output
                
                ŷ_i = predicted output

        2) Mean Absolute Error (MAE):
            - Mean Absolute Error is also used in regression tasks. It is the average of the absolute differences between predicted and actual values. It is also called Absolute Error. It is less sensitive to outliers which makes it more useful when outliers are present in the data.

            - Formula:

                MAE = (1/n) * Σ |y_i - ŷ_i|, where

                n = number of samples

                y_i = actual output

                ŷ_i = predicted output

        3) Cross Entropy Loss:
            - Cross Entropy Loss is used in classification tasks. It is a measure of the difference between the predicted probability and the actual probability. It is also called Log Loss.It is applicable in both binary and multiclass classification tasks. It has two types:

                1) Binary Cross Entropy Loss:

                    - Binary Cross Entropy Loss is used in binary classification tasks. It is defined as: 
                        - For single input,

                            -(y.log(ŷ) + (1 - y).log(1 - ŷ)), where y is the actual output and ŷ is the predicted output.

                        - For multiple inputs,

                            -(1/n) * Σ [y_i * log(ŷ_i) + (1 - y_i) * log(1 - ŷ_i)], where y_i is the actual output , ŷ_i is the predicted output and n is the number of samples.





13) Optimizer

14) Feature Map

15) Transfer Learning

16) Data Augmentation

18) Learning Rate


## Parts of a CNN
A typical Convolutional Neural Network consists of various layers, each of which is responsible for a specific task. The most common layers in CNNs are:




