<a name='0'></a>

# Introduction to Transformers for Visual Recognition

Vision Transformers are visual recognition architectures that were inspired by Language Transformers. The famous Vision Transformer(ViT) was introduced in the paper entitled [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929v2) by Lucas Beyer and his colleagues at Google Zurich. We will talk about this paper and other similar works in details in this notebook.

***Outline***:

- [1. Introduction](#1) 
- [2. Vision Transformer(ViT)](#2)
  - [2.1 Introduction](#2-1)
  - [2.2 Vision Transformer (ViT) Architecture](#2-2)
  - [2.3 Comparison of ViTs and ConvNets(ResNets)](#2-3)
  - [2.4 Scaling Vision Transformers](#2-4)
  - [2.5 Visualizing the Internal Representations of Vision Transformer](#2-5)
- [3. Training and Improving Vision Transformers](#3)
  - [3.1 Improving ViTs with Regularization and Augmentations](#3-1)
  - [3.2 Improving ViTs with Knowledge Distillation](#3-2)

- [4. Vision Transformers Beyond Image Classification](#4)
- [5. Implementations of Vits](#5)
- [6. Conclusion](#6)
- [7. Further Reading](#7)

<a name='1'></a>

## 1. Introduction

Since the introduction of [Attention is All You Need paper by Vaswani et al.](https://arxiv.org/abs/1706.03762), Transformers have revolutionized natural language processing(NLP). Transformer not only just showed excellent performance in most NLP tasks, but it also went on to outperform Convolutional Neural Networks(CNNs or ConvNets) on most computer vision benchmarks.

Convolutional Neural Networks(CNNs or ConvNets) have been the primary architectures in various visual recognition tasks since 2012 when [AlexNet](https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html) won [ILSVRC](https://image-net.org/challenges/LSVRC/index.php). CNNs work great, so why would anyone care about designing other vision architectures? The effort to replace CNNs is due to the fact that they possess inductive biases. On one hand, inductives biases are great since they help CNNs to learn representations from small data. But the drawback is that to get to meaningful representations, you have to stack many convolution layers to grow the receptive field enough the point where you can attend to all pixels of the input image and that usually happens in far layers. CNNs are still the go-to architecture in most vision tasks, but Vision Transformers seems like a promising architectures as well since they don't have inductive biases and they are compute efficient.

[Image Transformer](https://arxiv.org/abs/1802.05751) is one of the earliest application of Tranformers in computer vision(in image generation to be precise). Image Transformer is a standard Transformer that takes a sequence of input image pixels and generate the next pixel auto-regressively. Due to the quadratic complexity of self-attention, treating each pixel as a token doesn't work for images of large resolutions. Thus, papers that exploits Transformer in vision tends to combine attention and convolutions or remove convolutions entirely. Also, most modern Vision Transformers use a sequence of patches rather than a sequence of individual pixels to alleviate the computational time complexity of self-attention.

There are many examples of architectures that augment self-attentions and convolutions. An early example of those papers is [Visual Transformers: Token-based Image Representation and Processing for Computer Vision](https://arxiv.org/abs/2006.03677). Since this is an introductory notebook, we will not dive deep into every paper but in later notebooks, we will revisit some notable Vision Transformer papers.

[Stand-Alone Self-Attention in Vision Models](https://arxiv.org/abs/1906.05909) is one of the first papers that attempted to use self-attention in vision without any convolution. Self-attention surpringly works great in image classification, but one of the things they noted is that using convolutions in first layers and self-attention in later layers works better than either using standalone convolutions or self-attention. Yann LeCun also [made a similar point that his favorite architectures are those that use ConvNets in first layers and then Transformer blocks for object-based reasoning on top layers](https://twitter.com/ylecun/status/1481198016266739715?s=20&t=5AyyvFxi5h-ju_AJGLALFg). LeCun was commenting on [ConvNeXt](https://arxiv.org/abs/2201.03545), a pure CNNs architecture that mimicked and applied the traits of transformes to CNNs.

The paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929v2) changed the landscape of Transformers in vision after introducing the Vision Transformer(ViT) we know to day. Let's see that paper in details since it's regarded as one of the most important papers in computer vision.

<a name='2'></a>

## 2. Vision Transformer(ViT)

<a name='2-1'></a>

### 2.1 Introduction

Vision Transformer(ViT) is the first architecture that showed that fully self-attentions can yield the same or better performance than CNNs on image classification task.

Vision Transformer is a very simple architecture. In fact, it is the standard Transformer encoder applied on images patches. There are two key things about the performance of Vision Transformer:

* Unlike the prior works that applied Transformer to individual pixels or small group of pixels, ViT brought the idea of dividing the image into patches of `16x16` or `14x14` size. Using patches rather than individual pixels reduced the computation overhead of self-attention.

* Transformers are extremely hungry of large datasets than their counterpart. To overcome this, ViT was pretrained on a very large dataset of 300M images(JFT-300M, not available to the public) and then fine-tuned on standard vision datasets like ImageNet. Pretraining ViT on such large dataset yieled better performance than merely training it on normal datasets.

Let's see the main details of ViT architecture in the following section.

<a name='2-2'></a>

### 2.2 Vision Transformer (ViT) Architecture

Vision Transformer(ViT) is a very straightforward architecture. It takes a sequence of flattenned image patches, applies a linear projection to each independent patch, adds positional embeddings, and feeds the output to a standard Transformer encoder followed by MLP(Multi-Layer Perceptrons) head for classification purpose. Below is the illustration of the ViT architecture.


![image](https://drive.google.com/uc?export=view&id=1uW5J4yvcpTeqs00mtZN5oZMwT0dkk_bB)


ViT is really a simple architecture. Here are some important notes about its architecture:

* Splitting the image into fixed size patches(such as `16x6` or `14x14` denoted by `PxP`) is the main key ingredient of Vision Transformer. Patching can be done by a normal convolution layer with same kernel size as patch size(`PxP`) and stride of `P`.

* The normal Transformer encoder takes 1 dimensional(1D) input. Thus, the patches(which are 2D) must be flattened into single vector that is also fed to a linear or dense layer(this is what we are referring to linear projection). Before feeding the patches embeddings to Transformer encoder, we also add 1D learnable positional embeddings to keep the order of patches(self-attention can not reason about the order of patches, that's why we inject the positional embeddings into the mix). As the authors found, there is a large performance gap between a ViT with and without positional embeddings. We can also use 2D positional embeddings but the exact type of embedding doesn't make a huge difference in the accuracy.

* The authors of ViT added extra learnable class(or label) token to the patches. In the follow up papers such as [this](https://arxiv.org/pdf/2205.01580v1.pdf), they removed the class token and added standard global average pooling(GAP) after MLP head instead. GAP is clear and keeps the architecture simple.

* The rest of ViT architecture are Transformer encoder which is normal language encoder and MLPs(Multi-Layer Perceptrons) at the output of Transformer encoder for classification purpose.

ViT clearly looks like a Language Transformer. The only difference is that we have images at the input that we have to process a little bit differently than how we would do for sentences.

<a name='2-3'></a>

### 2.3 Comparison of Vision Transformer(ViT) and ConvNets(ResNets)

As we also alluded to previously, the key ingredient of ViT success is pretraining it on a very large dataset and finetuning it on small datasets. Pretraining the ViT on standard image dataset yielded poor results than CNNs. The image belows shows the comparison of ViT and ConvNets.

![image](https://drive.google.com/uc?export=view&id=1vC0Z0t1mrV8XTzSs6lmGdLzRsqx9o_dm)


The key takeaway here is that Vision Transformers(ViTs) perform better than ConvNets in large data regimes. In small data regime, ConvNets outperform Vision Transformers. The authors explained that the reason why ViTs are better when pre-trained on large dataset is that they don't possesses inductive biases as ConvNets and so, large dataset compensantes for less inductive biases in ViTs.

<a name='2-4'></a>

### 2.4 Scaling Vision Transformers

Language Transformers are very scalable and so are Vision Transformer(ViT). ViTs have a great performance and compute trade-off. ViTs uses less compute(roughly 2-4 times) than ResNet to achieve the same accuracy. Put it the other way, for the same compute, ViT outperforms ResNet. Hybrid architectures(self-attention augmented with ConvNets) outperforms pure ViTs on small compute, but their performance plateau on large compute. The surprising things about ViT is that it maintains its performance over large computational budget. That's the whole idea of scaling. A scalable model attains its performance when dataset and compute are increased and all major success of Transformers across different modalities are due to their extreme scaling property.


![image](https://drive.google.com/uc?export=view&id=1XTIl4Rdlp5tZCg6eVXvfnz_a5bjnnP-q)


>Compute is often measured in FLOPs standing for floaping point operations. Taking an example: the expression `wx + b` has two operations which are multiplication(wx) and addition(adding b to wx). Conventionally, 1 FLOP is 1 multiply-add operation.

Model scaling is a hot topic in deep learning research today. Scaling Vision Transformers was studied in details [here](https://arxiv.org/abs/2106.04560).

<a name='2-5'></a>

### 2.5 Visualizing the Internal Representations of Vision Transformer

The first layer of ViT is a linear layer that projects the flattenned images patches into low dimensional vector. Looking at the visualizations of the principal components(reduced features), this actually looks like low-level features(edges, lines, corners) that early ConvNet layers produce.

After the linear projection, we add learnable positional embeddings to the patches. As you can see in the image below, patches have same embeddings on the same row and column. ViT literally learns the position of patches rows and columns. You can see that for every corresponding patch row, the similarity is very high(close to 1) and this is maintained over all rows and columns.

Furthermore, self-attention attends to most of the images in early layers. This is illustrated with attention distance which can be compared to receptive field in ConvNets. You can see that some attention heads capture the most information of the image in lower layers. In ConvNets, to grow the receptive field, you usually have to stack many layers. But in ViT, you can attend to the most of image even on the first attention layer.


Finally, on the global level, due to the attention mechanism, ViT can attend to the regions of the image that are relevant for the task while ignoring other regions. This and other internal mechanisms discussed previously are perfectly illustrated in the image below.


![image](https://drive.google.com/uc?export=view&id=1w3NZvfIQNf46Tqx_z3w6__tyjqmHlbi0)

<a name='3'></a>

## 3. Training and Improving Vision Transformers

<a name='3-1'></a>

### 3.1 Improving ViTs with Regularization and Augmentations

Vision Transformers outperformed ConvNets on different computer vision benchmarks, but their performance usually comes from using large datasets during pretraining and finetuning on small datasets. Compared to ConvNets, Vision Transformers have low inductive biases which is the reason why they need large datasets. Unfortunately, most very large datasets such as JFT-300M are not available to the public and thus, it is not possible to for most practioners to train Vision Transformers from scratch. So, can we train Vision Transformers without extremely large datasets as ConvNets? Lucas Beyer and his colleagues studied this question and shared their findings in the paper [How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers](https://arxiv.org/abs/2106.10270).

One of their findings is that using different regularization and data augmentation strategies can compensate for the need of large dataset in training ViTs. Example of regularization methods they used are weight decay, [stochastic depth](https://arxiv.org/abs/1603.09382?context=cs), and dropout while data augmentations are [mixup](https://arxiv.org/abs/1710.09412) and [RandAugment](https://arxiv.org/abs/1909.13719).

In particular, they found that regularization and data augmentation helps in the face of small datasets such as ImageNet-1k(it has 1.3M images).

![image](https://drive.google.com/uc?export=view&id=1zw8Zpx9WJoJlM7btybg-UYB2o7m8QfBl)


While you can train ViTs with standard datasets and regularization and augmentations, you probably don't need to train them from scatch. For most downstream datasets, transferring the pretrained models is all you need! It's free to get most pretrained models, and yes, you don't have to spend time and compute training the models from scratch. Transfer learning is a norm in deep learning to day!

>If you had access to JFM-300, yes, you could train ViT from scratch. [But why would you](http://karpathy.github.io/2022/03/14/lecun1989/)?



<a name='3-2'></a>

### 3.2 Improving ViTs with Knowledge Distillation

In addition to training ViTs on standard vision datasets with strong regularization and augmentation techniques, there is also another technique that surprisingly improves ViTs without large datasets. The technique is called Knowledge Distillation(KD) and it was initially introduced by [Hinton et al](https://arxiv.org/pdf/1503.02531.pdf).

The idea of knowledge distillation is to train a ***teacher*** model on datasets like ImageNet and then train another separate ***student*** model to match the predictions of teacher model using KL divergence loss. The teacher model is typically large than student model and it can be either a ConvNet or ViT and student model is ViT in our case. [The paper that introduced knowledge distillation in ViTs](https://arxiv.org/abs/2012.12877) demonstrated that using ConvNet teachers yield better performance than ViT teachers due to the inherent inductive biases of ConvNets. The upside of knowledge distillation is that ConvNet inductive biases are not transferred to ViTs.


![image](https://drive.google.com/uc?export=view&id=1NT23A787dYOnjpD--NkJcEDn1PiPvadS)


Knowledge distillation can be compared to ensembe learning but different to ensemble learning where we average predictions of different models, we instead distill the predictions made by a master or teacher model into a small student model. You can learn more about knowledge distillation in the [paper that orginally introduced it](https://arxiv.org/abs/1503.02531) and [Training data-efficient image Transformers & distillation through attention(DeiT)](https://arxiv.org/abs/2012.12877).

<a name='4'></a>

## 4. Vision Transformers beyond Image Classification

Vision Transformers initialy showed potential for image classification task but then the computer vision community gradually adapted them to other visual recognition tasks such as object detection and image segmentation.

The key challenge to using Vision Transformers to object detection and dense prediction tasks (such as image segmentation) is that those tasks typically use images of large resolutions. You maybe wondering what that has to do with Vision Transformers. Well, unlike ConvNets that has hierachical structure(image resolution in ConvNets is downsampled and channels increase at every convolution block), Vision Transformers have same resolutions and channels for all Transformer blocks. It's thus not feasible to run self-attention on images of large resolutions since the self-attention is already a bottleneck on medium resolutions.

Taking object detection as example. Most object detection backbones are ConvNets. The earliest Vision Transformers based detector also employed ConvNets backbones. A popular example is [DETR(DEtection Transformer)](https://arxiv.org/abs/2005.12872) which uses ResNet backbone and Transformer encoder-decoder to predict the object labels and box coordinates without using traditional techniques such as Non-Max Suppression(NMS). The introduction of DETR was a groundbreaking thing in vision community and it outperformed previous detectors such as Faster R-CNN, but notice that it still relied on ConvNets as the backbone network not Vision Transformers.

![image](https://drive.google.com/uc?export=view&id=1cN7W9G3-6ExhgKK06K2fsh3u4sWFSgAb)


In fact, Vision Transformers weren't used as a backbone networks in object detection until the introduction of [Swin Transformer](https://arxiv.org/abs/2103.14030). Swin Transformer is a Vision Transformer architecture that has a hierachical structure as ConvNets and window self-attention of a linear complexity(rather than a standard self-attention of quadratic time complexity). Such hierachical structure made Swin Transformer a suitable backbone network for object detection and panoptic segmentation(another example of dense prediction tasks).

![image](https://drive.google.com/uc?export=view&id=1Esf8UhKvROT92Dh1tAHGHnpU2pUxG6XT)


DETR and Swin Transformers are few examples of scenarios where Vision Transformers are increasingly achieving competitive results compared to ConvNets. Since DETR and Swin Transformers are two of the most important papers in Vision Transformers family architectures, we will revisit them in the next notebooks. Vision Transformers are used for other tasks not just image classication or object detection or segmentation. Other tasks are starting to employ Vision Transformers are depth estimation, image colorization, etc...

Language Transformer being the key architecture in NLP today and Vision Transformer(that looks like  Language Transformer) starting to be in computer vision, it's increasingly being possible to solve visual language tasks or tasks that involves vision and language modalities such as image captioning and visual question answering with a singe unified model. To put it in other words, a single Transformer can solve tasks that previously involved using multiple different models. Unifying vision and language tasks is one of the hottest area of research nowdays. Some examples of the most popular unified models are [UViM](https://arxiv.org/abs/2205.10337), [Unified-IO](https://unified-io.allenai.org), [OFA](https://arxiv.org/abs/2202.03052), [Gato(in reinforcement learning)](https://www.deepmind.com/publications/a-generalist-agent), etc...

For more about different applications of Transformers in a wide-array of visual recognition and visual language tasks, checks this [survey paper](https://arxiv.org/abs/2101.01169).

<a name='5'></a>

## 5. Implementations of ViTs

The purpose of this notebook is not to provide an implementation of Vision Transformers from scratch, but it's important to share some clean implementations that are publicly available.

Below are some of the implementations of orginal ViTs and other related models:

* [Official implementation of Vision Transformer](https://github.com/google-research/vision_Transformer) contains implementations of orginal ViT and other models released by the same team that designed ViT such as [MLP-Mixer](https://arxiv.org/abs/2105.01601)(MLP-Mixer looks like ViT but rather than using attention, it uses MLPs). All models are implemented in [JAX](https://jax.readthedocs.io/en/latest/#).

* [Big Vision](https://github.com/google-research/big_vision) also provides implementation of ViTs and other related models. Big Vision seems like an extension of [vision_Transformer](https://github.com/google-research/vision_Transformer) and it's the codebase that powers most of the ViTs designer works.

* [Vision Transformer - Pytorch](https://github.com/lucidrains/vit-pytorch) by [Phil Wang](https://github.com/lucidrains) provides implementations of various Vision Transformers. Similar implementations in TensorFlow are found [here](https://github.com/taki0112/vit-tensorflow).

* [PyTorch Image Models(Timm)](https://github.com/rwightman/pytorch-image-models) offers pretrained versions of popular Vision Transformer models.

* [Hugging Face Transformers repository](https://github.com/huggingface/Transformers), one of the most popular NLP and AI repositories not only provides implementations of Language Transformers, but also Vision Transformers as well.

* Finally, there are a number of awesome code tutorials of Vision Transformers on [Keras.io](https://keras.io/examples/). Some relevant tutorials are [Image classification with Vision Transformer](https://keras.io/examples/vision/image_classification_with_vision_Transformer/), [Train a Vision Transformer on small datasets](https://keras.io/examples/vision/vit_small_ds/), [A Vision Transformer without Attention](https://keras.io/examples/vision/shiftvit/), [Investigating Vision Transformer representations](https://keras.io/examples/vision/probing_vits/), [Distilling Vision Transformers](https://keras.io/examples/vision/deit/).

<a name='6'></a>

## 6. Conclusion

In this notebook, we have been learning about Vision Transformers. Vision Transformers are new computer vision architectures that were inspired by Language Transformers. Just as Language Transformers operate on sequence of words, most Vision Transformers operate on sequence of patches.

Vision Transformers have got lots of attention in the past 2 years in research community. There is a debate whether they will replace ConvNets entirely or whether we can use them in conjuction or whether both of them will be keep to be primary architectures in computer vision community. The nice traits of ViT is that they have less inductive biases compared to ConvNets. 

The popular notion about ViT is that they require lots of training data. While that is marginally true, we have seen that using various modern regularization and data augmentation techniques can help training ViTs on standard computer vision datasets such as ImageNet.

Finally, below are what some notable deep learning practitioners think about the debate of Vision Transformers and ConvNets:

> Am I going to argue that "Conv is all you need"? No! My favorite architecture is DETR-like: ConvNet (or ConvNeXt) for the first layers, then something more memory-based and permutation invariant like Transformer blocks for object-based reasoning on top - Said by Yann LeCun via [Twitter](https://twitter.com/ylecun/status/1481198016266739715?s=20&t=HlHYZRgx4A_rVOhoVoxc5w)].

>  Vision Transformers are an evolution, not a revolution. We can still fundamentally solve the same problems as with CNNs. Main benefit of ViTs is probably speed: Matrix multiply is more hardware-friendly than convolution, so ViTs with same FLOPs as CNNs can train and run much faster. Said by Justin Johnson in his Deep Learning for Computer Vision course.

> So far, they(ViTs) haven't dethroned pure convnets(unless params is all you care about). Ross Wightman via [Twitter](https://twitter.com/wightmanr/status/1545526140445462528?s=20&t=_kJzn97UEf515efPXMvUzA).

In the next notebooks, we will explore some interesting Vision Transformers architectures.

<a name='7'></a>

## 7. Further Reading

In additional to papers that we discussed, here are further resources if you want to learn more about Vision Transformers:

* [Transformers In Vision - IAML Distill Blog](https://iaml-it.github.io/posts/2021-04-28-Transformers-in-vision/)

* [Self-Attention for Vision - ICML 2021 Tutorials](https://slideslive.com/icml-2021/tutorial-selfattention-for-computer-vision)

### [BACK TO TOP](#0)