# Image Classification with Transformers

## Introduction

In this notebook, we will use the `transformers` library to fine-tune a pre-trained transformer model for image classification. We will use the `ViT` model, which is a transformer model that was designed for image classification tasks. We will fine-tune the model on the `CIFAR-10` dataset, which is a dataset of 60,000 32x32 color images in 10 classes, with 6,000 images per class. The dataset is divided into 50,000 training images and 10,000 testing images.

`ViT` is a transformer model that was designed for image classification tasks. It works by dividing an image into patches, and then processing each patch with a transformer encoder. The model then uses a classification head to predict the class of the image. The model is pre-trained on the `ImageNet` dataset, which is a large dataset of natural images.

The architecture of the `ViT` model is as follows:

1. Input Embeddings: The input to the model is an image, which is divided into patches. Each patch is then linearly embedded to the same dimension as the model's hidden dimension. The formula for the number of patches are as follows:
2. Positional Embeddings: The model uses learnable positional embeddings to encode the position of each patch in the image.
3. Transformer Encoder: The model uses a transformer encoder to process the patches. The encoder consists of multiple layers which are:
    - Multi-Head Self-Attention: The model uses multi-head self-attention to capture the relationships between different patches in the image.
    - Feed-Forward Neural Network: The model uses a feed-forward neural network to process the output of the self-attention layer.
    - Residual Connection: The model uses residual connections around each sub-layer, followed by layer normalization.
4. Classification Head: The model uses a classification head to predict the class of the image.

The formula for different blocks are as follows:

1. Embedding Block: The embedding block consists of a linear layer to embed the patches, followed by a positional embedding layer.
    - `x = patch_embeddings(x) + position_embeddings`
    - `patch_embeddings`: Linear layer to embed the patches
      - $PE_{(patch, i)} = xW_{patch} + b_{patch}$
    - `position_embeddings`: Positional embeddings to encode the position of each patch
      - $PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})$

2. Transformer Encoder Block: The transformer encoder block consists of a multi-head self-attention layer, followed by a feed-forward neural network.
    - `self_attention`: Multi-head self-attention layer
      - $softmax(\frac{QK^T}{\sqrt{d_k}})V = Attention(Q, K, V)$
      - $concat(head_1, head_2, ..., head_n)W^O = MultiHead(Q, K, V)$
    - `feed_forward`: Feed-forward neural network
      - $FFN(x) = max(0, xW_1 + b_1)W_2 + b_2$
      - `W_1`, `b_1`: Weight and bias of the first linear layer
    - `LayerNorm(x)`: Layer normalization
      - $LayerNorm(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$
    - `Residual(x, y)`: Residual connection
      - $Residual(x, y) = x + y$

[This paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_FDViT_Improve_the_Hierarchical_Architecture_of_Vision_Transformer_ICCV_2023_paper.pdf) proposes a novel hierarchical architecture for vision transformers called FDViT to address the challenge of high computational costs in vision transformers. The key ideas are:
1. Introducing a flexible downsampling (FD) layer that is not limited to integer stride, allowing for smooth reduction of spatial dimensions to avoid excessive information loss.
2. Using a masked auto-encoder architecture to facilitate the training of the FD layers and generate informative outputs.
The proposed FDViT achieves better classification performance with fewer FLOPs and parameters compared to existing hierarchical vision transformer models. Experiments on ImageNet, COCO, and ADE20K datasets demonstrate the effectiveness of the method.

Also [this paper](https://research.google/blog/improving-vision-transformer-efficiency-and-accuracy-by-learning-to-tokenize/) discusses a module called TokenLearner that can be used to improve the efficiency and accuracy of Vision Transformer (ViT) models. TokenLearner is a learnable module that generates a smaller set of adaptive tokens from the input image or video, rather than using a fixed, uniform tokenization. This reduces the number of tokens that need to be processed by the subsequent Transformer layers, leading to significant savings in memory and computation without compromising performance. The document presents experiments showing that inserting TokenLearner at different locations within a ViT model can achieve comparable or better accuracy than the baseline ViT, while reducing the computational cost by up to two-thirds. TokenLearner is particularly effective for video understanding tasks, where it achieves state-of-the-art performance on several benchmarks.