# classic CNN architectures

| Architecture        | Year | Description                                                                                                                                                                                        |
|---------------------|------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LeNet-5             | 1998 | 2 conv layers and 3 FC layers.|
| AlexNet             | 2012 | 5 conv layers, 3 FC layers, ReLU, dropout|
| VGG                 | 2014 | small filters in conv layers, 16/19 layers|
| GoogLeNet/Inception | 2014 | Inception module (multiple filter sizes and pooling operations in parallel). 22 layers|
| ResNet              | 2015 | residual blocks with shortcut connections, very deep (152 layers)|
| U-Net                    | 2015 | medical image segmentation, Autoencoder and skip connections. |
| DenseNet            | 2016 | multiple dense blocks with transition layers in between.|
| YOLO                     | 2016 | real-time object detection |
| Faster R-CNN             | 2016 | region-based object detection|
| Mask R-CNN               | 2017 | An extension of Faster R-CNN that adds a branch for predicting object masks|
| EfficientNet             | 2019 | compound scaling|
| Vision Transformer (ViT) | 2020 | self-attention to replace the traditional convolutional layers.               |

# Inception

Inception network (also known as GoogLeNet) is a deep CNN architecture designed for image classification.

## inception module

Inception module is a key building block of Inception


Key Points:

- Multi-scale processing: Inception modules enable the network to process input data in parallel and capture various features at different scales by applying convolutional filters of different kernel sizes and pooling operations simultaneously.

- Dimensionality reduction: Before applying larger convolutional filters, 1x1 convolutions are used to reduce the number of channels in the input data, which reduces computational complexity and memory usage.

- Concatenation of outputs: The outputs from the different convolutional and pooling branches are concatenated channel-wise, allowing the network to learn a richer and more diverse set of features.

- Increased network depth and width: The Inception module's design allows for deeper and wider architectures compared to traditional CNN, which can improve expressiveness and ability to learn complex patterns.

![image.png](attachment:image.png)

a simple Inception module have following components:

- 1x1 conv branch

- 3x3 conv branch, preceded by 1x1 conv for dim reduction

- 5x5 conv branch, preceded by 1x1 conv for dim reduction

- Max-pooling branch, followed by 1x1 conv for dim reduction

The outputs of all these branches are concatenated to form the final output of the Inception module.

<img src='https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-22_at_3.22.39_PM.png' />

## 1x1 convolution

<img src= 'https://www.baeldung.com/wp-content/uploads/sites/4/2020/06/3D_1D_cropped.gif'/>

A 1x1 convolution (pointwise convolution): a convolutional operation with a kernel size of 1x1. 

use cases:

- Dimensionality reduction: By reducing the number of output channels while preserving the spatial dimensions of the input, decrease the complexity of the model.

- Increasing non-linearity: Introducing 1x1 convolutions between other convolutional layers adds more non-linearity to the network, potentially improving its expressiveness.

# ResNet (Residual Networks)

## Residual block

<img src='https://neurohive.io/wp-content/uploads/2019/01/resnet-e1548261477164.png' style='width: 50%; height: auto;'/>

Skip Connection:
$z = x + F(x)$

Where:
- $z$: Output of the residual block
- $x$: Input to the residual block
- $F(x)$: Residual function learned by the intermediate layers

Derivative of output w.r.t $x$:

$$\frac{dz}{dx} = \frac{dx}{dx} + \frac{dF(x)}{dx}=1+\frac{dF(x)}{dx}$$

No matter what derivative of $F$ is, derivative going through identity branch is always constant.


## skip-connection

Skip-connections: also known as shortcut connections or residual connections, are a key component of ResNet. 

They enable the direct flow of information from earlier layers to later layers by "skipping" one or more intermediate layers in the network.


- Identity mapping: Skip-connections create an identity mapping between the input and output of a block of layers. They allow the output to be the sum of the input and the residual learned by the block, promoting the learning of **residual functions** instead of the entire transformation.

- Addressing Vanishing Gradient by direct gradient flow: Skip-connections create direct paths for the gradient to flow from output layer to input layer during backpropagation. This prevents the gradient from diminishing as it moves through the network, ensuring that the weights in earlier layers are updated more effectively, make it easier to train deeper networks.

- smoother loss landscape

<img src='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*_Qd_txKxRlsMdfuH2J-k4g.png'/>

# Convolutional AutoEncoder (CAE)

## architecture

conv Autoencoder has similar architecture as Autoencoder, except FC layers change to convolutional layers

- encoder (conv layers and pooling layers)

- decoder (transpose conv and upsampling)

![image.png](attachment:image.png)

## application

CNN architecture makes CAE well-suited for processing data with spatial structure, such as images. 

- Dimensionality reduction

- Image denoising

- Feature learning

- Anomaly detection: By training a CAE on normal data samples, it can learn to reconstruct typical samples accurately. Anomalies can then be detected by measuring the reconstruction error between the input and the reconstructed output for new samples.

- Image inpainting: CAEs can be used to fill in missing or corrupted parts of an image by learning to reconstruct the complete image from the available data.

- Generative models: CAEs can be adapted into generative models, such as Variational Autoencoders (VAEs) or Adversarial Autoencoders (AAEs), to generate new data samples similar to the training data.

- Image segmentation: CAEs can be extended to other architectures, like U-Net, to learn pixel-level classification for tasks such as semantic segmentation.