# **Basic Operations of Convolutional Neural Networks (CNNs/ConvNets)**:

Although quite similar to each other, ConvNets are different architecturally. ConvNets specifically assume that they're receiving an image as an input with 3D volume of neurons. In other words, the input layer accepts a vector of `Width`, `Height` & `Depth` units of the image.

The ConvNet contains a sequence of layers & each layer transforms a volume of activations through a differentiable function. Generally, a model architecture would've the following layers sequencially laid beside one-another:

1. `INPUT[W x H x D]` layer which holds the raw pixels values of an image with image _width of `W`_, _height of `H`_ & _3 colour channels of `R,G,B`_.

2. The `CONV` layer computes the output of the neurons that are connected to certain local regions of the input which results in a volume of `[W x H x F]` where `F` is the number of filters.

3. The `RELU` layer while leaving the size of the volume unchanged activates certain areas of the input, elementwise.

4. The `POOL` layer performs a downsampling of the input.

5. The `FC` layer computes the class layers with the volume size represented as `[W x H x C]` where C represents the classes.

### **The Convolutional Layer**

 - The core building block of the Neural Net doing most of the computational heavy lifting.

 - The learnable parameters of the Convolutional layer consists of a set of spatial filters extending through the whole depth of the input image.

 - The process of convolution involves performing a dot product of the entries from the filter & the input image, creating a 2-D activational map at every spatial position of the filter. 

 - The network learns those filters that activate based on some kind of visual feature. Hence, accordingly the entire set of filters in each `CONV` layer producing a 2-D output stacked up on each other.

 - Summary of a `CONV` layer:
    
    1. Accepts a volume of size `W x H x D`.

    2. Requires 4 hyperparameters.

        i. Number of filters, `K`.

        ii. Their spatial extent `F`.

        iii. The Stride, `S`.

        iv. The Amount of Zero-Padding, `P`.

    3. Produces a volume of size `W2 × H2 × D2` where:

        - `W2 = (W1−F+2P)/S+1`

        - `H2 = (H1−F+2P)/S+1` (i.e. width and height are computed equally by symmetry)

        - `D2 = K`

    4. With parameter sharing, it introduces `F⋅F⋅D1` weights per filter, for a total of `(F⋅F⋅D1)⋅K` weights and `K` biases.

    5. In the output volume, the _d_-th depth slice (of size `W2×H2`) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of `S`, and then offset by _d_-th bias.

### **The Pooling Layer**:

A pooling layer is periodically inserted to function as a spatial size reducer of the representation so that the number of parameters can be brought down significantly. This helps in preventing overfitting as well as lesser strain on the computation process.

Generally, the _Pooling layer_ does:

1. Accepts a volume of size `W × H × D`.

2. Requires two hyperparameters:

    - Their spatial extent, `F`.
    
    - The stride, `S`.

3. Produces a volume of size `W2 × H2 × D2`, where:

    - `W2 = (W−F)/S+1`

    - `H2 = (H−F)/S+1`

    - `D2 = D`

4. Introduces zero parameters since it computes a fixed function of the input

5. For Pooling layers, it is not common to pad the input using zero-padding.

### **Some More Information On Layers**:

- _Normalization layers_

- _Fully-Connected (FC) layers_

- _Converted FC layers to Conv layers_

- General pattern of stacked layers look like:

    ```
    INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
    ```
    
    where, the `*` indicates repetition, and the `POOL?` indicates an optional pooling layer. Moreover, `N >= 0` (and usually `N <= 3`), `M >= 0`, `K >= 0` (and usually `K < 3`).

- Layer Sizing pattern

    1. The `INPUT` layer should be divisible by 2 many times over & some common numbers iclude - `32`, `64`, `96`, `224`, `384`, `512`.

    2. The `CONV` layers should be zero-padded properly such that the original spatial dimemsions are intact. Also small filters like `3x3` or `5x5` should be used with stride = 1.

    3. The `POOL` layers are responsible for downsampling the spatial dimensions by using a maxpooling of 2x2 receptive fields.

**_Resources_**:

1. Perhaps the most comprehensive & informative piece is the official [CS231n Convolutional Neural Network for Visual Recognition](https://cs231n.github.io/convolutional-networks/#overview) course website from Stanford. Highly recommend & almost answers everything about the topic.

2. [A Gentle Introduction to Pooling Layers for Convolutional Neural Networks](https://machinelearningmastery.com/pooling-layers-for-convolutional-neural-networks/)

3. [Intuitively Understanding Convolutions for Deep Learning](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1) visually appealing & brief enough to grasp the concepts quickly.

4. Analogical example of the article - [Understanding Convolutions](https://colah.github.io/posts/2014-07-Understanding-Convolutions/) has good visual descriptions alongwith examples.