# Vision Transformers (`ViTs`) The change of the universe for `CV` tasks.

**📑 Timeline:**
1. [What a Vision Transformer is ?]().
2. [How Transformers Work?]().
3. [The Transformer with Visual data]().
4. [Vision Transformers]().
5. [Vision Transformers vs CNNs]().
6. [Vision Transformers for Image Classification tasks]().
7. [`PyTorch` code, fine-tunning example with `vit-base-16` for `OxfordFlower102` classification]().
8. [Overall Sum Up and further explanation]().

## [1. What a Vision Transformer is?]()

<img src="https://i0.wp.com/bdtechtalks.com/wp-content/uploads/2022/05/transformer-neural-network.jpg?ssl=1" alt="Example Image" width="800">


#### **A First Early Definition.**

A **`Vision Transformer` (`ViT`)** is a type of neural network architecture that applies the principles of the `Transformer` model, originally designed for `natural language processing` (`NLP`), to `computer vision` tasks. Instead of relying on `convolutional layers` like traditional `CNNs`, **`ViTs` divide images into smaller `patches` and treat them as sequences, similar to how words are treated in `NLP` tasks**.


#### **A Simple Comprehensive Explanation.**

<img src="https://keras.io/img/examples/vision/object_detection_using_vision_transformer/object_detection_using_vision_transformer_11_2.png" alt="Example Image" width="600">

Imagine you're trying to understand a picture by breaking it down into smaller, manageable pieces, like a puzzle. Each **piece**, or **patch**, of the image holds some information, but to really understand the whole picture, you need to look at how these pieces relate to each other.

Traditional `convolutional neural networks` (`CNNs`) are like looking at each piece through a magnifying glass, focusing on small details one at a time. But a `Vision Transformer` (`ViT`) does something different: it looks at all the pieces at once and figures out how they fit together using a method called *"`self-attention`"*. This means that the `ViT` can capture the big picture much more effectively, understanding how different parts of the image relate to each other, even if they are far apart.


#### **A Little Bit of the History.**

`Transformers` were first introduced in 2017 by Vaswani et al. in the paper titled *'[Attention Is All You Need](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://arxiv.org/abs/1706.03762&ved=2ahUKEwiLqoyryIWIAxUmDjQIHQ5fFJEQFnoECBQQAQ&usg=AOvVaw2ceXGQohV5Kx51VSkfkG08).'* The original Transformer model was designed for tasks in `NLP`, such as language translation and text generation. It quickly became the dominant architecture in NLP due to its ability to handle long-range dependencies in sequences, leading to groundbreaking results.

Seeing the success of `Transformers` in `NLP`, researchers wondered: could this approach be applied to images too? The answer came in the form of `Vision Transformers`, first proposed by Dosovitskiy et al. in the 2020 paper titled *'[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://arxiv.org/abs/2010.11929&ved=2ahUKEwichv6zyIWIAxXWzwIHHfflBEgQFnoECAgQAQ&usg=AOvVaw0M2QdGs7HfTuzqO_pjfstS).'* This paper showed that `ViTs` could match or even surpass the performance of traditional `CNNs` on various `image classification tasks`, especially when trained on large datasets.


#### **The Motivation Behind `Vision Transformers`.**

The main motivation behind `ViTs`stems from the limitations of `CNNs`. While `CNNs` are excellent at capturing **local features** (like edges or textures), **they struggle with understanding long-range dependencies or global features** (like the relationship between different objects in an image). `ViTs` were introduced to overcome these limitations by leveraging the `self-attention` mechanism, which allows the model to capture both local and global relationships in images.

Another **key motivation** was the success of `Transformers` in `NLP`, where **they had already demonstrated their power in handling sequences of data**. Researchers hypothesized that *images could also be treated as sequences—specifically*, sequences of image patches—allowing the powerful `self-attention mechanism` to be applied to `vision` tasks.

## [2. Transformers simply explained.]()

<img src="https://miro.medium.com/v2/resize:fit:1400/1*BHzGVskWGS_3jEcYYi6miQ.png" alt="Example Image" width="600">


#### **A Historical and Chronological Background.**

`Transformers` were introduced in 2017 by Vaswani et al. in the groundbreaking paper titled *Attention Is All You Need*. Before `Transformers`, `NLP` models like `RNNs` and `LSTMs` were widely used, but they had limitations, particularly with long-range dependencies in sequences. `Transformers` revolutionized `NLP` by introducing a new architecture that relies entirely on `self-attention` mechanisms rather than recurrence, enabling models to capture dependencies regardless of their distance in the sequence. This innovation led to superior performance in many `NLP` tasks, such as translation, summarization, and text generation.


#### **What is a `Transformer`?**

A `Transformer` is a deep learning model designed to process sequential data by employing `self-attention` mechanisms **to understand relationships between different parts of the input sequence**. Unlike traditional models that process data sequentially, `Transformers` **can process entire sequences in parallel**, making them much more efficient and powerful, especially for large datasets.


#### **The Basic Operation of `Transformers`.**
Simply, `Transformers` operate by **transforming input sequences into output sequences**, understanding the context of each part of the sequence relative to the others. This is achieved through a series of steps:

1. `Input`: The raw sequence data (e.g., words in a sentence).
2. `Embedding`: Each part of the input sequence is converted into a dense vector representation.
3. `Positional Encoding`: Information about the position of each element in the sequence is added.
3. `Self-Attention Mechanism`: The model computes `attention scores` to determine how much focus each element should give to the others in the sequence.
4. `Feed-Forward Networks`: The `attention-processed data` is passed through `fully connected layers` to further process the information.
5. `Output`: The model produces the output sequence, which can be used for tasks like translation or classification.




### **The Architecture.**


<img src="https://aiml.com/wp-content/uploads/2023/09/Annotated-Transformers-Architecture.png" alt="Example Image" width="600">

> `Transformers` consist of two **main parts**: the `Encoder `and the `Decoder`. The `Encoder` processes the input sequence, while the `Decoder` generates the output sequence based on the encoded information.


 Let’s break down each part of this architecture in great detail!

 #### **General Architecture.**
`Transformers` *- as we mentined -*  are composed of an `Encoder`-`Decoder` structure. The `Encoder` processes the input sequence and produces a continuous representation, while the `Decoder` uses this representation to generate the output sequence.
> - `Encoder`: Encodes the input sequence into a set of continuous representations.
- `Decoder`: Uses these encoded representations to generate the output sequence, such as translating one language to another or predicting the next word in a sequence.

##### **1. The input.**
The input to a `Transformer` is typically *a sequence of `tokens`*. In the context of natural language processing (`NLP`), these `tokens` are usually words or subwords.

Let's take as an example the sentence: *`"My cat ate my birthday cake"`* can be tokenized in each word
```python
['My', 'cat', 'ate', 'my', 'birthday', 'cake']
```

##### **2.  Input Embedding.**
Since `Transformers` work with vectors, the first step is **to convert each `token` into a dense vector representation**. This is done through `embedding layers`.

- `Embedding`: A lookup table that maps each token to a vector of fixed size. The vectors capture semantic meaning, where similar words have similar embeddings.

So, considering the previous example, The word *`'cat'`* might be represented as a vector `[0.5, 0.8, 0.1, ...]` in a high-dimensional space.

> **Why Embedding?**
>
>Words that save similar meanings are placed *closer* together in this space. This helps the `Transformer` **understand relationships** between words *based on their **meaning***.


##### **3.  Positional Encoding.**

<img src="https://machinelearningmastery.com/wp-content/uploads/2022/01/PE3.png" alt="Example Image" width="600">

`Transformers` don’t inherently understand the order of `tokens` because they process the entire sequence in parallel. To give the model a sense of position (i.e., order in the sequence), `positional encoding` is added to the input embeddings.

- `Positional Encoding`: A vector added to each `token`'s embedding, encoding its position in the sequence.

For instance, the first word might get a `positional encoding` `[0.1, 0.2, ...]`, the second  `[0.1, 0.2, ...]` and so on.


>  **How it works:**
>
> `Positional encodings` use `sine` and `cosine` `functions` of different frequencies to generate these vectors. This way, the model can distinguish between *`'Cat ate cake'`* and *`'cake ate Cat'`* even though they contain the same words.


#### **5.  Encoder.**
The `Encoder` is responsible for transforming the input sequence** into a set of continuous representations that capture the relationships** between all `tokens`.

The `Encoder` consists of several identical layers (typically 6-12), each containing two main components:

1. `Multi-Head Self-Attention Mechanism`.
2. `Feed-Forward Neural Network`.

Between these components, there are also `Add & Norm` operations **to maintain stability** during training.

##### **5.1. Attention Mechanism (`Multi-Head Attention`).**

`Attention Mechanism` is the **core innovation** of the `Transformer`. It allows the model **to weigh the importance of different `tokens` relative to each other** when processing a sequence.

<img src="https://media.licdn.com/dms/image/D5612AQHGqpYzGg5Rgw/article-cover_image-shrink_600_2000/0/1710049461853?e=2147483647&v=beta&t=_pyabqkK_m74LhbD-IiewTcfZ0cJUsUeu7t6htJMgNU" alt="Example Image" width="600">


- `Self-Attention`: In `self-attention`, each `token` in the input sequence considers every other `token `to decide *how much attention to pay to each one*. This is crucial for understanding context.

Consider the example sentence: *`'The cat sat on the mat'`*. The word *`'cat'`* might pay attention to *`'sat'`* and *`'mat'`* more than *`'on'`*.


<img src="https://www.researchgate.net/publication/345482934/figure/fig2/AS:955463785013258@1604811726722/Internal-structure-of-the-Multi-Headed-Self-Attention-Mechanism-in-a-Transformer-block.png" alt="Example Image" width="600">

- `Multi-Head Attention`: Instead of computing a single `attention score`, the `Transformer` computes several `attention scores` in parallel, called `heads`.Each `head` can focus on different aspects of the input sequence.

> `Multiple attention heads` allow the model to capture various relationships in the data. For example, one `head` might focus on grammatical structure, while another might focus on meaning.

> **How it works:**
>
> 1. `Query` - `Key` - `Value`:
    - Each `token` is transformed into three `vector`.
    - `Query`: What the token is looking for in other tokens.
    - `Key`: What other tokens have to offer.
    - `Value`: The actual content carried by the token.
> 2. `Attention Calculation`: The `Query` vector of each token is compared with the `Key` vectors of all tokens to calculate attention scores. These scores are used to weigh the `Value` vectors, producing the final attention output.

##### **5.2. Add & Norm.**
After the `attention mechanism`, the output is passed through an `Add & Norm` layer:
- `Add`: The original input to the attention mechanism is added back to the attention output (this is called a `residual connection`).
- `Normalization` (`Norm`): The result is normalized to ensure stable training. This is usually done using `Layer Normalization`.

> Why❓
>
> ---
>The `residual connection` helps prevent the `vanishing gradient problem`, and `normalization` ensures the model trains effectively by keeping the outputs on a consistent scale.

##### **5.3. Feed-Forward Neural Network (`FNN`).**
After the `attention mechanism`, the data is processed by a `feed-forward neural network`, which applies additional transformations to the data.
- `Structure`: A simple `two-layer FNN` with a `ReLU` activation in between.

> **How It Works:**
>
> This layer takes the attention-processed data and applies more complex, non-linear transformations to extract higher-level features.


After the `feed-forward network`, another `Add & Norm` operation is applied to combine the input and output of this layer.

-----
### How the `Encoder` Works: A Deep Dive.

The `Encoder` works by processing the input sequence through multiple layers of attention and feed-forward networks, refining the representation of the sequence at each step. Each layer of the `Encoder` produces a new set of representations that capture more complex relationships and features in the data.

The whole process in steps:
1. `Input Tokens`: Start with the tokenized input sequence.
2. `Embedding`: Convert tokens to vectors.
3. `Positional Encoding`: Add position information.
4. `Self-Attention`: Each token attends to every other token, learning relationships.
5. `Feed-Forward`: Transform the attention output for higher-level feature extraction.
6. `Repeat`: Pass through multiple layers of attention and feed-forward networks.

`Output`: The final output of the `Encoder` is a set of vectors representing each token, enriched with contextual information from the entire sequence.


---

#### **5. Decoder**.
The `Decoder` **generates the output sequence by using the encoded information from the `Encoder`**. The `Decoder` is also composed of several layers similar to the `Encoder`, but with some key differences.

- Structure of the `Decoder`:
    1. `Masked Multi-Head Self-Attention`.
    2. `Multi-Head Attention over Encoder’s Output`.
    3. `Feed-Forward Neural Network`.

##### **5.1. Output Embeddings.**
The `Decoder` starts by embedding the output sequence tokens (which might be partially generated so far, or a start token if generating from scratch).

##### **5.2.  Positional Encoding.**
`Positional encodings` are added to the output embeddings to maintain the order of the sequence, just like in the `Encoder`.

##### **5.2. Masked Multi-Head Self-Attention.**
In the `Decoder`, `self-attention` is applied to the output sequence so far, but with a twist: it’s masked.

- `Masked Attention`: The `mask` ensures that the model only attends to the tokens before a given position, preventing it from "cheating" by looking at future tokens in the sequence. For Example, if you’re generating the third word, you can only consider the first and second words.

> Why Masking ❓
>
> ---
> `Masking ensures` the model generates the sequence one `token` at a time and doesn’t use information from future `tokens`, which wouldn’t be available during actual `prediction`.

##### **5.3.  Add & Norm.**
As before, the result of the `masked attention` is combined with the original input via a `residual connection` and then `normalized`.

##### **5.4.  Multi-Head Attention over Encoder’s Output.**
The `Decoder` also has a second multi-head attention mechanism, but this one focuses on the `Encoder`’s output.
    - `Cross-Attention`: Here, the `Decoder’s tokens attend to the Encoder’s tokens, allowing the model to align the input and output sequences. For instance, this helps the model understand which input tokens correspond to which output tokens during translation.

> The `Decoder` uses the context provided by the `Encoder` to generate each token in the output sequence.

##### **5.5.  Add & Norm.**
Again, the output from the attention over the Encoder’s output is combined with the input (via residual connection) and normalized.

##### **5.6.  Feed-Forward Neural Network.**
Similar to the `Encoder`, the `Decoder` has a `feed-forward neural network layer `to apply further transformations.
   - `Add & Norm` (Again): Another residual connection and normalization after the feed-forward layer.

##### **5.7.  Linear Layer and Softmax.**
Finally, the output of the `last Decoder layer` is passed through a `linear layer` and then a `softmax` function:
- `Linear Layer`: Projects the Decoder output into a higher-dimensional space, usually the size of the vocabulary in the case of language models.
- `Softmax`: Converts the linear layer’s output into probabilities, representing the likelihood of each possible next token.


`Output Probabilities`: The Decoder outputs a probability distribution over all possible next tokens. The highest probability token is usually selected as the next token in the sequence.

-----
### How the `Decoder` Works: A Deep Dive.
The `Decoder` works by taking in the encoded representation from the Encoder and the partially generated output sequence. It uses masked self-attention to process the output sequence so far and cross-attention to align it with the input sequence. Each layer of the `Decoder` refines this process, ensuring that the final output is a well-formed sequence that closely matches the desired target.

The whole process in steps:
1. `Input Sequence`: Start with a partially generated sequence or a start token.
2. `Embedding`: Convert tokens to vectors.
3. `Positional Encoding`: Add position information.
4. `Masked Self-Attention`: Attend to tokens up to the current position.
5. `Cross-Attention`: Attend to the Encoder’s output for context.
6.`Feed-Forward`: Further transform the attention output.
7. `Repeat`: Pass through multiple layers of masked attention, cross-attention, and feed-forward networks.

`Final Output`: A sequence of tokens, each generated one by one, forming the final output like a translated sentence or predicted text.


----
### Simpler than you think 😉.


Imagine a `Transformer` as a magical translator for sentences. When given the phrase *`'The cat ate my birthday cake'`* it first turns each word into a unique, numerical code through embedding. It then adds a special code to each word to remember its position in the sentence, like adding page numbers to a storybook. As it reads the sentence, the Transformer uses self-attention to let every word look at every other word, understanding that `'cat'` is linked to `'ate'` and `'cake'`. It processes these relationships through multiple layers of feed-forward networks, ensuring each word gets the context it needs. If the `Transformer` is translating or generating text, it uses this understanding to predict the next word or phrase, starting from a special beginning token and carefully choosing each subsequent word based on its learned knowledge. The result is a coherent and contextually accurate sentence or translation, showcasing the Transformer’s ability to weave together the story from its intricate understanding of each word's role.

## [3. Transformers for Visual data.]()

#### **The Idea.**
`Transformers`, originally designed for Natural Language Processing (`NLP`), have shown incredible success in tasks like translation and text generation. The core idea behind `Vision Transformers` (`ViTs`) is to leverage this same powerful architecture for `computer vision` tasks, such as `image classification`, `object detection`, and `image segmentation`. Instead of processing text sequences, `ViTs` treat an image as a sequence of smaller patches, enabling the model to capture complex patterns and relationships within visual data.

#### **A Little Bit of History.**
The concept of applying Transformers to visual data is relatively new. Before `ViTs`, Convolutional Neural Networks (`CNNs`) dominated the field of computer vision. `CNNs` are highly effective at capturing spatial hierarchies in images, but they have limitations, especially when it comes to handling global context and long-range dependencies within an image. In 2020, the `Vision Transformer` model was introduced by researchers at Google, marking a significant shift in how deep learning models can be applied to visual data.

#### **The Big Key Differences.**
The fundamental difference between `CNNs` and `ViTs` lies in **how they process images**:
- **`CNNs`**: Use `convolutional layers` to extract local features from images. They excel at capturing spatial hierarchies but often require deep architectures to capture global context.
- **`ViTs`**: Divide the image into `fixed-size patches` and treat them as input `tokens`, similar to words in a sentence. These patches are then processed using the `Transformer`’s `self-attention `mechanism, which allows the model to capture both local and global features simultaneously.

#### **The Motive.**
The **primary motivation** behind using `Transformers` for vision tasks **is their ability to model relationships across an entire image, no matter the distance between pixels.** This is particularly useful for tasks that require understanding the global context of an image, like `image classification` or `segmentation`. Additionally, `Transformers` can scale better with data and computational resources, often outperforming `CNNs` when trained on large datasets.

#### **The State-of-the-Art Result.**
`Vision Transformers` have quickly become state-of-the-art for various vision tasks. For instance, they have achieved top performance on benchmarks like `ImageNet`, often surpassing traditional `CNN` architectures. Their ability to handle large-scale data and learn complex patterns makes them ideal for modern `computer vision` challenges.

#### **The Biggest Innovations.**
- **`Patch Embeddings`**: Images are split into smaller patches (e.g., `16 x 16` pixels), which are then linearly embedded into vectors. This allows the `Transformer` to process images in a similar way it processes text.
- **`Self-Attention Mechanism`**: Enables the model to focus on different parts of the image when making predictions, capturing both local and global features.
- **Scalability**: `ViTs` can handle large datasets and can be scaled up efficiently, making them highly effective in real-world applications.

#### **The Terminology for Vision Transformers.**
- **`Patch Embedding`**: The process of splitting an image into fixed-size patches and embedding each patch into a vector that the model can process.
- **`Self-Attention`**: A mechanism that allows the model to weigh the importance of different patches in relation to each other.
- **`Positional Encoding`**: Since Transformers don’t have a built-in sense of order, positional encodings are added to the input patches to give the model information about the spatial relationships within the image.
- **`Layer Normalization`**: A technique used to stabilize and speed up the training of deep neural networks by normalizing the inputs to each layer.


## [4. Vision Transformers (`ViTs`).]()

<img src="https://www.researchgate.net/publication/357885173/figure/fig1/AS:1113907477389319@1642587646516/Vision-Transformer-architecture-main-blocks-First-image-is-split-into-fixed-size.png" alt="Example Image" width="800">


> ***As we  previously mentioned...***
>
> *`Vision Transformers` (`ViTs`) are a type of deep learning model that applies the `Transformer `architecture, originally designed for `NLP` tasks, to `computer vision` problems. Instead of processing text sequences, `ViTs` treat images as sequences of smaller patches, allowing the model to capture complex patterns and relationships across the entire image.*

`Vision Transformers` (`ViTs`) typically do not use a `decoder` as part of their architecture. Instead, they focus on a streamlined approach that leverages the `self-attention` mechanism for `image classification` tasks.

In `NLP` (`Natural Language Processing`), the `Transformer` architecture consists of an `encoder-decoder` structure where the `decoder` generates output sequences based on encoded information. This approach is particularly useful for tasks like [*`machine translation`*](https://www.techtarget.com/searchenterpriseai/definition/machine-translation), where the model needs to produce a sequence of words from another sequence.

For `Vision Transformers`, the **primary task** **is `image classification` rather than `sequence generation`**. The goal is to classify an entire image into a category based on its content. Instead of generating new sequences, `Vision Transformers` directly produce class predictions from the processed image features.


#### **The Paper and a Quick Review.**
The concept of `Vision Transformers `was introduced in the paper [*'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' by Alexey Dosovitskiy et al., published in 2020*](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://arxiv.org/abs/2010.11929&ved=2ahUKEwjF_4_4qYaIAxVfyAIHHa3CECsQFnoECBYQAQ&usg=AOvVaw0M2QdGs7HfTuzqO_pjfstS). The paper demonstrated that, with enough data and computational power, `Vision Transformer`s could outperform traditional `CNN`s on `Image Classification` tasks. The key innovation was treating images as sequences of patches, enabling the use of self-attention mechanisms to model long-range dependencies within an image.


### The Architecture.
Now, after the previous comprehensive explanation of Transformers for sequence data, understanding `Vision Transformers` (`ViT`s) should be quite straightforward. Let's dive into the architecture of `Vision Transformers` in detail.

> **Encoder: The Core of `Vision Transformers`.**
>
> In Vision Transformers (`ViTs`), the `encoder` is the heart of the architecture, where the magic happens.


#### **1. Embedded Patches as Input from the Image.**
The first step in a`Vision Transforme`r is to split the input image into smaller, non-overlapping patches. These patches are typically square-shaped (e.g., `16 x 16` pixels), and they represent small sections of the original image. Each patch is then flattened into a one-dimensional vector and passed through a linear projection layer. This layer converts the flattened patches into a set of dense vectors, known as patch embeddings. These embeddings are the ViT's equivalent of word embeddings in NLP, transforming raw visual data into a format that the Transformer can process.

#### **2. Adding Positional Embeddings.**
Unlike `CNNs`, `Transformers` do not inherently understand the order or position of the input tokens (or patches in this case). To provide this information, `positional embeddings` are added to each patch embedding. These `positional embeddings` are vectors that encode the spatial information of each patch within the original image, ensuring the model can interpret the relative positions of the patches. This step is crucial because the spatial arrangement of patches is what helps the model understand the overall structure and content of the image.

#### **3. Normalization (Layer `Norm`).**
Before the `patch embeddings` enter the `attention mechanism`, they undergo a process called layer `normalization`. This technique adjusts and scales the embeddings so that they have a consistent mean and variance. Normalization is essential for stabilizing the training process, as it helps the model converge more efficiently by ensuring that the inputs to each layer are on a similar scale. This step prevents issues like exploding or vanishing gradients, which can hinder the model’s ability to learn effectively.

#### **4. Multi-Head Attention.**
The core of the `Transformer` architecture is the `Multi-Head Attention` mechanism. In this step, the model attends to different parts of the image simultaneously. Each `head` within the attention mechanism focuses on different aspects of the image patches, learning various relationships and patterns within the data. For example, one head might focus on the texture within a patch, while another might look at the edges or boundaries between patches. This parallel processing allows the model to capture complex interactions between different parts of the image, making it highly effective at understanding visual content.

#### **5. Normalization Again.**
After the `attention mechanism `processes the patch embeddings, the data is passed through another layer of `normalization.` This second normalization step is similar to the first, ensuring that the output from the attention mechanism is stable and ready for further processing. By normalizing the data again, the model maintains consistency and prevents any imbalance that might have occurred during the attention phase.


#### **6. Feed-Forward Neural Network (`MLP`).**
Once the data has been attended to and normalized, it is passed through a `feed-forward neural network`, often referred to as a `Multi-Layer Perceptron` (`MLP`). The `MLP` introduces non-linearities to the model, enabling it to learn more intricate and complex patterns that linear operations alone cannot capture. Typically, the `MLP` consists of several linear layers interspersed with activation functions like ReLU, which allow the model to refine the features extracted from the image patches. This process enhances the model’s ability to distinguish between different types of visual information.

#### **7. Repeating the Process.**
The sequence of `Multi-Head Attention` and `MLP `operations is repeated multiple times within the encoder. Each repetition, or "layer," builds upon the previous one, progressively refining the features and representations learned from the image patches. As the data moves through these layers, the model develops a deeper and more nuanced understanding of the image content, gradually improving its ability to make accurate predictions.

#### **8. Final Output.**
After passing through the multiple layers of `attention` and `MLP`, the `final output` of the encoder is a set of rich, high-level features that encapsulate the essential information from the image. These features are then used for `downstream tasks` such as `image classification`. In the classification task, an additional linear layer, often called the MLP head, is used to map the extracted features to the final class labels, completing the image recognition process.

> **Takeaways.**
>
>---
>- `ViTs` process images by breaking them down into patches, each treated as a token similar to words in `NLP`, allowing the model to handle visual data as a sequence.
- `Positional embeddings` are critical for helping the model understand the spatial relationships between patches, enabling it to reconstruct the overall structure of the image.
- `Multi-Head Attention` allows the model to focus on multiple aspects of the image simultaneously, capturing complex relationships and patterns across patches.
- `Normalization` is a crucial step that stabilizes the training process and ensures consistent scaling, facilitating efficient learning.
- `MLP layers` introduce `non-linearities`, allowing the model to learn intricate patterns that are essential for accurate image recognition.



Let's walk through a simple example of how a `Vision Transformer` (`ViT`) processes an image of a cat.
### A Simple Fully Explained Example.
Suppose we have an image of a cat, and we want to classify it using a Vision Transformer. Here’s how the process would unfold:

1. `Splitting the Image into Patches`: The `256 x 256` cat image is divided into small `16 x 16` pixel patches. This results in `256` patches, each capturing a small part of the image.
2. `Flattening and Embedding the Patches`: Each `16 x 16` patch is flattened into a vector and converted into a lower-dimensional representation (embedding) with, for example, `512` values.
3. `Adding Positional Information`: Positional embeddings are added to each patch’s vector to retain the spatial information about where each patch is located in the original image.
4. `Multi-Head Attention`: The model uses multi-head attention to compare each patch with all other patches, learning how they relate to each other, such as how an ear patch relates to a whisker patch.
5. `Passing Through Transformer Layers`: The processed patch embeddings are passed through several Transformer layers, refining the model’s understanding of complex features like the cat’s shape and texture.
6. `Making the Classification`: The final output vector from the Transformer is used by a classifier to predict the image’s class. For a cat image, it would output a high probability for `'cat'`.



#### **The Best Parts.**
- **`Global Context`**: `ViTs` capture global information across the entire image, making them powerful for understanding complex visual scenes.
- **`Scalability`**: They perform exceptionally well with large datasets and can leverage substantial computational resources for improved accuracy.
- **`Flexibility`**: `ViTs` are versatile and can be adapted for various vision tasks beyond classification.

> **`Pros`**:
1. Superior performance on large datasets.
2. Effective at modeling long-range dependencies.
3. Can be applied to various vision tasks.


> **`Cons`**:
1. Requires a large amount of data and computational power.
2. May not perform as well on smaller datasets without sufficient fine-tuning.


#### The **Key Questions** You Must Remember.

1. **`What is a Vision Transformer?`**: A model that applies Transformer architecture to images by treating them as sequences of patches.
2. **`How Vision Transformers work?`**: By processing sequences of image patches with self-attention mechanisms to capture global and local features.
3. **`How Vision Transformers process image data?`**: Images are divided into patches, embedded, and processed through Transformer layers to extract features and make predictions.
4. **`What actually the attention mechanism does?`**: It enables the model to focus on different parts of the image simultaneously, capturing important relationships.
5. **What is patching?**: Splitting an image into smaller, manageable pieces or patches.
6. **What is positional encoding?**: Adding information about the position of each patch to its embedding to maintain spatial awareness.
7. **`Why in Vision Transformers we do not use Decoder?`**: The decoder is not used for classification tasks; instead, a class token is used to aggregate information.
8. **`When to Use a Vision Transformer?`**: When working with large datasets where capturing global context is crucial.
9. **`Why is so *state-of-the-art*?`**: It leverages self-attention to model complex visual relationships and patterns, leading to superior performance.
10. **`Where to use it?`**: In image classification, object detection, and other vision tasks, particularly when large-scale data is available.



## [5. Vision Transformers (`ViTs`) vs Convolutionan Neural Networks (`CNNS`).]()

<img src="https://media.licdn.com/dms/image/D4D12AQHRTZBK_PA2lw/article-cover_image-shrink_600_2000/0/1699420769277?e=2147483647&v=beta&t=Cv_NW4GQwMeHN-heBHIdJLmjS0rwX5ccu1tlv8JC8so" alt="Example Image" width="800">


### **The Pros and Cons of Each.**

#### **Vision Transformers (`ViTs`).**  
`Vision Transformers` come with several advantages. One of the biggest strengths is their ability to capture global context across an entire image, allowing them to understand complex and long-range dependencies. Additionally,` ViTs` are highly scalable; they tend to perform better as the dataset size increases, which makes them particularly effective for large-scale data. They are also very flexible, easily adapting to various visual tasks beyond just image classification, such as object detection and segmentation.

However, `Vision Transformers` have their downsides. They are often data-hungry, requiring large amounts of data to perform well. Without sufficient data, they may struggle to generalize effectively. Furthermore, `ViTs` are computationally intensive, demanding significant memory and processing power due to their self-attention mechanism. The training process for ViTs is also more complex, requiring careful tuning of hyperparameters and large training datasets 🔧.

#### **Convolutional Neural Networks (`CNNs`).**  
On the other hand, `Convolutional Neural Networks` have proven to be very efficient, especially on smaller datasets. They are excellent at capturing local patterns, such as edges and textures, which are crucial for image recognition. `CNNs` also generally have lower computational demands compared to `ViTs`, making them faster and more efficient to train and deploy.

However, `CNNs` are not without limitations. They may struggle with understanding global relationships across an image due to their local receptive fields. Additionally, `CNNs` rely on fixed inductive biases like translation invariance and locality, which may not be ideal for every task. There’s also a risk of overfitting, particularly on small datasets, if the model is not properly regularized.

#### **When to use `ViTs` and when `CNNs`?**

**Use `Vision Transformers`** if your project involves a large dataset with diverse and complex images that require an understanding of global context 🖼️. They are also the right choice if you need a model that can easily adapt to various tasks, such as classification, object detection, or segmentation 🎯. However, ensure that you have the computational resources necessary for training and deploying large models 🚀.

**Use `CNNs`** if you have a smaller dataset or limited computational resources 🧑‍💻. `CNNs` are particularly effective for tasks that benefit from localized feature extraction, like identifying simple patterns or textures 🎨. They are also ideal if you need a model that is quick to train and efficient to deploy, especially on edge devices 🖥️.

#### **The Big Fight: `ViTs` vs. `CNNs` 💣🥊.**

The debate between `Vision Transformers` and `CNNs` often boils down to the specific needs of your task. `Vision Transformers` excel in scenarios that require a deep understanding of global relationships across an image, but they do come with the demands of large datasets and significant computational power🏋️. In contrast, `CNNs` remain the go-to choice for tasks where local feature extraction is key, particularly when resources are constrained. While `Vision Transformers` are rapidly evolving and starting to close the gap in areas traditionally dominated by `CNNs`, the latter still provides a robust and efficient solution for many real-world applications.

#### **Some Thoughts 💭.**

In summary, while `Vision Transformers` are making significant strides in computer vision, `CNNs` remain a strong contender, particularly in scenarios where computational efficiency and localized feature extraction are critical. The choice between `ViTs` and `CNNs` should be guided by the specific requirements of your task, the data you have, and the computational resources available💡. In many cases, a hybrid approach or even an ensemble of both may yield the best results, leveraging the strengths of each architecture to complement the other.

## [6. `Vision Transformers` for Image Classification Tasks.]()

<img src="https://www.researchgate.net/publication/352093979/figure/fig2/AS:1053302234034182@1628138230759/Samples-of-flowers-species-from-the-Oxford-102-Flower-1.jpg" alt="Example Image" width="800">

#### **Overview / Abstract.**
`Vision Transformers` (`ViTs`) have revolutionized the approach to image classification by leveraging self-attention mechanisms traditionally used in NLP tasks. Unlike traditional `Convolutional Neural Networks` (`CNNs`), `ViTs` treat an image as a sequence of patches, allowing the model to capture global relationships between different parts of an image. This approach has proven to be highly effective for large-scale image classification tasks, achieving state-of-the-art results across various benchmarks.

#### **How `ViT` process an Image.**
In `Vision Transformers`, an image is first divided into a grid of patches, each of which is then flattened into a sequence of vectors. These vectors are embedded and combined with positional encodings to retain the spatial information. The resulting sequence is then passed through multiple layers of self-attention and feed-forward networks, where each layer allows the model to focus on different parts of the image, capturing both local and global context. Finally, a classification token is used to aggregate information from all patches, and the model outputs a prediction based on this token.

#### **The Goal in One Paragraph.**
The **primary goal** of using `Vision Transformers` for `image classification` is to leverage their ability to capture long-range dependencies and global context within an image, leading to more accurate and robust predictions. By moving away from the localized processing of `CNNs`, `ViTs` aim to improve the model's understanding of the overall structure and relationships in an image, which is particularly beneficial for complex classification tasks involving large and diverse datasets.

#### **The `step-by-step` Process.**
1. **`Image to Patches`:** The input image is split into a grid of fixed-size patches (e.g., `16 x 16` pixels). Each patch is flattened into a vector.
2. **`Patch Embedding`:** These vectors are linearly embedded into a lower-dimensional space, creating a sequence of patch embeddings.
3. **`Positional Encoding`:** Positional encodings are added to each patch embedding to retain information about the position of each patch within the original image.
4. **`Transformer Layers`:** The sequence of patch embeddings is passed through several layers of the Transformer encoder, where each layer consists of multi-head self-attention and feed-forward networks. These layers enable the model to focus on different parts of the image, capturing both local and global context.
5. **`Classification Token`:** A special classification token is prepended to the sequence, which interacts with the other tokens throughout the Transformer layers. This token aggregates information from all patches.
6. **`Output Layer`:** The final output from the classification token is passed through a fully connected layer to produce the final classification output.

#### **The Different Pretrained Models and Their Key Architecture Differences.**

Several pretrained `ViT` models 🧩 are available, each tailored to different datasets and tasks. For instance:

- **`ViT-B/16`:** A base model with `12 layers`, `16 attention heads`, and a `patch size of 16 x 16`. Suitable for general-purpose image classification on medium to large datasets.
- **`ViT-L/32`:** A larger model with `24 layers`, `32 attention heads`, and a `patch size of 32 x 32`. This model offers more capacity and is better suited for extremely large datasets.
- **`DeiT` (`Data-efficient Image Transformers`):** A variant of `ViT` **that is designed to work efficiently even with smaller datasets**, thanks to the use of strong data augmentation techniques and training strategies.


## [7. `PyTorch` code, fine-tunning example with `vit-base-16` for `OxfordFlower102` classification.]()

In this example, we dive into the fascinating world of image classification using a Vision Transformer model. While we utilize [Hugging Face]() to conveniently load a pre-trained transformer, the rest of the setup, training, and evaluation is done manually to give you a deeper understanding of the underlying processes. In a [future tutorial](), we'll explore Hugging Face in greater detail, showcasing how it can simplify our workflow even further.

We’re working with a dataset of labeled images, which allows our model to learn how to recognize patterns and make accurate predictions. The Vision Transformer we’re using is designed to process images effectively by leveraging attention mechanisms, making it powerful for classification tasks.

It’s worth noting that we’ve chosen not to implement data augmentation in this tutorial. This decision stems from the need to keep the code straightforward and focused on the core concepts of model training and evaluation. However, data augmentation can be an excellent way to enhance model performance in more complex scenarios.


#### **Understanding our Data.**
<img src="https://user-images.githubusercontent.com/16590868/69524725-064d5180-0f67-11ea-8e35-f4153513f379.png" alt="Example Image" width="600">

The `Oxford Flowers 102` dataset is a comprehensive collection designed for the automated classification of flowers across 102 distinct categories. The images in this dataset were gathered through web searches and direct photography, ensuring a diverse representation of each flower species. Each category contains a minimum of 40 images, providing sufficient samples for training machine learning models.

> 🔗Read more about this dataset [here](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/).

Now that we've had a quick introduction, let’s delve into the details of the code

1. Install the dependent packages:

In [None]:
!pip install transformers
!pip install torch transformers evaluate datasets

2. Import the required libraries:

In [None]:
import torch
import torch.nn.functional as F
import torch.nn as nn
import torchvision
import torch.optim as optim
import torchvision.transforms as transforms
from torchvision import datasets
from torch.utils.data import random_split, DataLoader
from transformers import ViTForImageClassification
from PIL import Image

This section brings in all the necessary libraries for building a deep learning model. It includes PyTorch for computations and torchvision for handling images. Plus, it pulls in the Vision Transformer from Hugging Face to make image classification easier.

3. Base Initializations:
    - Initialize some basic Hyperparameters.
    - Define the data transformations.
    - Download the dataset and feed it into the dataloaders.

In [None]:
# Define Hyperparameters
batch_size = 32
num_classes = 102
num_epochs = 15
learning_rate = 0.001

# Data transformations
data_transforms = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010])
])

# Load the datasets
train_dataset = torchvision.datasets.Flowers102(root='./data', split="train", transform=data_transforms, download=True)
valid_dataset = torchvision.datasets.Flowers102(root='./data', split="val", transform=data_transforms, download=True)
test_dataset = torchvision.datasets.Flowers102(root='./data', split="test", transform=data_transforms, download=True)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

4. Load the Model:

In [None]:
# Import Visual Transformer model | p.16 -> the basic architecture
# Load pre-trained Vision Transformer
model_name = "google/vit-base-patch16-224"
vitP16 = ViTForImageClassification.from_pretrained(model_name)
vitP16.config.num_labels = num_classes

# Move model to the appropriate device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vitP16.to(device)

# Define Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(vitP16.parameters(), lr=learning_rate)

# Learning rate scheduler with warm-up
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.001, total_steps=len(train_loader) * 20) # NOTE: I do not actually use it for only 5 epochs.

Here, we load a pre-trained Vision Transformer model that’s ready for image classification. We specify the model name and set it up to recognize the number of classes we have. This means the model is all set to start working with our specific dataset!

5. Move the model to the available device:

In [None]:
# Move model to the appropriate device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vitP16.to(device)

Next, we check if we have a `GPU` available for faster computations. If we do, we use it; otherwise, we stick with the `CPU`. This helps ensure our model runs efficiently on the right hardware.

6. Define Loss Criteriion and Optimization function:

In [None]:
# Define Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(vitP16.parameters(), lr=learning_rate)

We define the loss function, which helps us measure how well our model is doing during training. The optimizer will adjust the model’s weights based on the loss. This combo is crucial for guiding the model to learn effectively!

7. Set Up a learning rate SchedualerQ

In [None]:
# Learning rate scheduler with warm-up
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.001, total_steps=len(train_loader) * 20) # NOTE: I do not actually use it for only 5 epochs.

Here, we set up a learning rate scheduler to adjust how quickly our model learns during training. This helps improve performance by varying the learning rate over time. Just a note: the scheduler isn't being used for a full training period here.

8. Train the model and adapt it to the new task:

In [None]:
# Train and Validate
best_validation_accuracy = 0.0

# Fine Tunning
for epoch in range(num_epochs):
    vitP16.train()
    running_loss = 0.0

    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = vitP16(inputs).logits
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    # Validation loop
    vitP16.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for inputs, labels in valid_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = vitP16(inputs).logits
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        validation_accuracy = 100 * correct / total

        # Save the model if validation accuracy improved
        if validation_accuracy > best_validation_accuracy:
            best_validation_accuracy = validation_accuracy
            torch.save(vitP16.state_dict(), "best_model.pth")

    print(f"Epoch [{epoch + 1}/{num_epochs}] "
          f"Loss: {running_loss / len(train_loader):.4f} "
          f"Validation Accuracy: {validation_accuracy:.2f}%")

    # Adjust learning rate
    scheduler.step()

In this fine-tuning process, the pre-trained Vision Transformer model is trained on the specific flower dataset to adapt its learned features for this task. Each epoch involves passing batches of images through the model to calculate the loss, which is then minimized by updating the model's weights. After training, the model is evaluated on a validation set to measure accuracy, and the best-performing model is saved for later use. The learning rate is also adjusted throughout training to optimize performance.

9. Load the best model for testing:

In [None]:
# Load the best model for testing
vitP16.load_state_dict(torch.load("best_model.pth"))

We load the best model we saved earlier, which ensures we retain the highest accuracy achieved during training. This way, we can utilize the most effective version of our model without losing the valuable progress we've made. Now, we can test its performance on new, unseen data and evaluate how well our model has truly learned!

10. Test models accuracy to unseen data:

In [None]:
# Testing the model
vitP16.eval()
test_correct = 0
test_total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = vitP16(inputs)
        _, predicted = torch.max(outputs.logits, 1)
        test_total += labels.size(0)
        test_correct += (predicted == labels).sum().item()

test_accuracy = 100 * test_correct / test_total
print(f"Test Accuracy: {test_accuracy:.2f}%")

Congratulations on finishing the tutorial! You've now got a solid understanding of Vision Transformers and are ready to apply your skills in image classification!

## [8. Overall Sum Up and Further Explanation: Vision Transformers (ViTs)]().
- ***Changing the Game in CV Tasks 🚀***

In this journey, we’ve explored how Vision Transformers (`ViTs`) are reshaping the landscape of computer vision. Here’s a quick and clear recap of what we covered:

1. **What is a Vision Transformer? 🌐**: `ViTs` bring the power of transformers—originally made for NLP—into image processing. They treat an image as a sequence of patches, rather than using convolutions like `CNNs`.
2. **How Transformers Work ⚙️**: Transformers rely on self-attention to learn relationships between inputs, allowing them to capture global context in ways that CNNs can't.
3. **`ViTs` with Visual Data 🖼️**: In ViTs, images are divided into patches, and these are processed by the self-attention mechanism. This allows ViTs to capture both local and global features.
4. **Vision Transformers (`Vits`) vs. `CNNs` ⚔️**: ViTs offer better scalability and capture global dependencies more effectively than CNNs, but they need more data to perform at their best.
5. **ViTs for Image Classification 📊**: `ViTs` can be fine-tuned for tasks like image classification, as we saw with the OxfordFlowers102 dataset, yielding impressive results after training.
6. **Fine-Tuning with ViT: PyTorch Code Example:**
We fine-tuned a pre-trained ViT (`vit-base-16`) for flower classification. Fine-tuning lets us adapt the general knowledge from the pre-trained model to a specific task, delivering strong performance.

### **Final Thoughts 🎯**
ViTs are a major leap forward in computer vision. While CNNs are still strong contenders, ViTs offer unique advantages—especially with large datasets. They’re flexible, scalable, and ready to push the boundaries of what we can do in CV tasks.

💡 **In short**: Vision Transformers are here to stay, and they’re opening up exciting new possibilities for the future of computer vision!