<a href="https://colab.research.google.com/github/JaiSuryaPrabu/deep_learning/blob/main/4_Paper_Replication.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Architecture

The paper : https://arxiv.org/pdf/2010.11929.pdf

* Collection of **layers** are called as **blocks**
* Collection of **blocks** makes the model architecture


## The main resource in the paper 
* The architecture image in **Fig 1**
* The mathematical equations in the **Section 3.1**
* The hyperparameter tuning in the **table 1**

## Stages 

### 1. Input
* Turn the input image into patches
* And number the patches
### 2. Embedded Patches
* Embedding is used to convert the *images* into *vectors*
### 3. Layer Normalization
* Called as **Norm**
* Regularizing the neural network *(Reducing the overfitting)*
* `torch.nn.LayerNorm()` is used to get **norm**
### 4. Multi Head Attention - MSA
### 5. MLP
* A Multi Layer Perceptron block contains
    1. `nn.Linear()` * 2 times
    2. `nn.GELU()` * 1 times
    3. `nn.Dropout()` * 1 times
### MLP Head
* This is the **output layer**
* This is the **classifier head**


## Mathematical Equations

### Equation 1
\begin{aligned}
\mathbf{z}_{0} &=\left[\mathbf{x}_{\text {class }} ; \mathbf{x}_{p}^{1} \mathbf{E} ; \mathbf{x}_{p}^{2} \mathbf{E} ; \cdots ; \mathbf{x}_{p}^{N} \mathbf{E}\right]+\mathbf{E}_{\text {pos }}, & & \mathbf{E} \in \mathbb{R}^{\left(P^{2} \cdot C\right) \times D}, \mathbf{E}_{\text {pos }} \in \mathbb{R}^{(N+1) \times D}
\end{aligned}

---



This equation deals with the input image of
* Class tokens
* Patch embeddings
* Position embeddings
> **E** means Embedding

In the vector form it looks like
`input_image = [class_token,image_patch_1,image_patch_2,...]`

### Equation 2
\begin{aligned}
\mathbf{z}_{\ell}^{\prime} &=\operatorname{MSA}\left(\operatorname{LN}\left(\mathbf{z}_{\ell-1}\right)\right)+\mathbf{z}_{\ell-1}, & & \ell=1 \ldots L
\end{aligned}


---



\begin{aligned}
\mathbf{z}_{\ell}^{\prime} &=\operatorname{MSA}\left(\operatorname{LN}\left(\mathbf{z}_{\ell-1}\right)\right)
\end{aligned}

* It tells that for each layer from 1 to ℓ contains the **MSA** layer and the **Norm** 
* The $+$ is the **residual connection**
* Psudeo code
    * `output_msa_block = MSA_layer(Norm_layer(x_input)) + x_input`

### Equation 3
\begin{aligned}
\mathbf{z}_{\ell} &=\operatorname{MLP}\left(\operatorname{LN}\left(\mathbf{z}_{\ell}^{\prime}\right)\right)+\mathbf{z}_{\ell}^{\prime}, & & \ell=1 \ldots L \\
\end{aligned}

---



* Same as the equation 2
* Pseudo code
    * `output_mlp_block = MLP_layer(Norm_layer(output_msa_block)) + output_msa_block`

### Equation 4
\begin{aligned}
\mathbf{y} &=\operatorname{LN}\left(\mathbf{z}_{L}^{0}\right) & &
\end{aligned}

---



* This is the last layer **L**
* This layer is wrapped by **Norm layer**
* The pseudo code
    * `y = linear_layer(Norm_layer(output_mlp_block[0]))`