<a name='0'></a>

# Swin Transformer: Hierarchical Vision Transformer using Shifted Windows


[Swin Transformer](https://arxiv.org/abs/2103.14030) is one of the first Vision Transformers that demonstrated that Transformers can be used as backbone networks in computer vision. Upon its introduction, Swin Transformer outperformed CNNs on various computer vision benchmarks such as in image classification and object detection. In this notebook, we will have a high-level overview of Swin Transformer.

***Outline***:

- [1. Introduction](#1) 
- [2. Swin Transformer Architecture](#2)
- [3. Shifted Window Self-attention](#3)
- [4. Swin Transformer Results](#4)
- [5. Swin V2 and Other Hierarchical Vision Transformers](#5)
- [6. Implementations of Swin Transformers](#6)
- [7. Conclusion](#7)
- [8. Further Learning](#8)

<a name='1'></a>

## 1. Introduction

For a long time, Convolutional neural networks(CNNs) served as a primary architecture in virtually all computer vision tasks. When Vision Transformers emerged, they showed remarkable performance in image classification, but they were not able to be used in object detection and other dense prediction tasks such as semantic segmentation due to various reasons. Unlike language tokens that have fixed scale, images have varying scale. Also, real-world images tend to have large resolution. It's not practically feasible to apply self-attention to images of large resolution due to quadratic time complexity of self-attention. So, the challenges of applying Transformers in vision are mainly the varying scales and large resolutions of images.

The reason why CNNs are able to handle varying scale and large resolutions is that they have a hierarchical structure. In CNNs, the resolution of the input image is downsampled while channels are increased as the layers increase. This is not the case in Vision Transformers. A standard Vision Transformer maintains the same resolution and channels over the whole network(i.e, all multi-head self-attention heads have same resolution and channels).

Although CNNs can handle the challenges that we previously, it requires to stack many and many convolution layers to attend to the whole image while Vision Transformers can attend to the whole of image in fewer self-attention layers. There are clearly more gains in building hierarchical Vision Transformer and that is indeed what Swin Transformer addresses.

So, to summarize, Swin Transformer(Shifted *Win*dow Tranformer) is a Vision Transformer that employs shifted windows approach to achieve hierarchical structure and as results, it can acts as a backbone network in visual recognition tasks that use large resolution images such as object detection. As a computation merit over previous Vision Transformers, Swin Transformer has linear time complexity as it applies self-attention to few patches in a fixed size local window rather than attending to all patches. The rest of the notebook is about its architecture and other few  take-aways from the paper.

<a name='2'></a>

## 2. Swin Transformer Architecture

The architecture of Swin Transformer is illustrated below.

![image](https://drive.google.com/uc?export=view&id=1jOQiWD9qrQgiYDyzACFoR1Fu4dh46KZY)

Swin Transformer architecture is pupulated by multiple Swin Transformer blocks but there are other important elements that are worth talking about. Let's review the whole architecture in brief:

* The first element of Swin Transformer is patch partition layer that splits input image into non-overlapping patches of fixed size(4x4). Each patch is treated as a token just in language modelling. Since each patch has 3 color channels, the feature dimension of each patch is 4x4x3=48.

* A linear embedding layer is applied to every independent patch feature to project it to a low dimensional vector `C`.

* After a linear embedding layer, there are several stages, each stage containing several Swin Transformer blocks and patch merging layers(the first stage contains linear embedding layer and Swin Transformer blocks). Swin Transformer blocks are pretty much like normal transformer encoder but there is a slight modification of multi-head self-attention layer(MSA) used in Swin Transformer. So, a Swin Transformer block is made of a shifted window based multi-head self-attention module(W-MSA), followed by 2 MLP layers with GELU non-linearity in between, and a LayerNorm(LN) layer before each W-MSA and MLPs.

* To achieve a hierarchical network where the resolution decreases and channels(or dimensions) increase after every Swin Transformer stage, we insert a patch merging layer between two successive Swin Transformer stages. To merge the patches, we concatenate groups of 2x2 features and then apply a linear projection layer(or 1x1 conv layer) to the results. The resolution and number of channels increases and descreases with a factor of 2 respectively. Look at the decription on top of every stage in the image above.

That's it about the architecture of Swin Transformer. Since shifted window self-attention is the most important part of Swin Transformer, let's discuss it in details.

<a name='3'></a>

## 3. Shifted Window Self Attention

The [standard self-attention](https://arxiv.org/abs/1706.03762)(orginally introduced in Transformer architecture) computes the relationship between all tokens(or patches in image recognition standpoint) globally thus resulting in quadratic time complexity since each token has to attend to other tokens. As we alluded to in the beginning, some visual recognition tasks such as object detection uses large resolution images, and so it's not feasible to apply global attention to large resolution images.

Rather than computing self-attention over all patches, we can compute it within non-overlapping local windows(denoted as M) of fixed size(7 by default). That's in fact what **window-based self-attention** refers to. As the size of the window is fixed, the time complexity is linear with respect to the resolution of input image.

Window-based self-attention is great and compute efficient, but there is a problem. The patches only attend to patches within the same window, there is no connection across windows after all. To overcome that, the authors introduced **shifted-window based attention** for maintaining cross-window connections. Window-based self-attention and shifted-window based self attention alternates with each other in successive Swin Transformer blocks(see the image of Swin Transformer in previous section for clarity, W-MSA & SW-MSA).

![image](https://drive.google.com/uc?export=view&id=1quVWCGkF9XtQMgcU7e1obUIWywWWubNJ)

Shifted-window based attention introduces connections between neighboring non-overlapping windows in the previous layer. The authors showed that shifted-window approach improves the performance in both image classification, object detection, and semantic segmentation.

From mathematical point of view, shifted window attention is like normal self-attention except the additional relative bias $B$ that is added to $QK^T$ in the self-attention formula. Using relative bias introduces positional information in the scene and that removes the need of normal positional encoding or positional embeddings layers.

$$
Attention(Q, K, V) = Softmax(\frac {QK^T}{\sqrt {d}} + B)V
$$

From the formula above, Q, K, V refers to query, key, value matrices respectively.

To summarize, the main difference between window-self attention and normal self-attention is that the former is computed over fixed size local windows whereas the latter is computed over all image patches, making window-based attention a linear operation.

<a name='4'></a>

## 4. Swin Transformer Results

Swin Transformer demonstrates excellent performance on various computer vision datasets in image classification, object detection, and semantic segmentation. Compared to other state-of-the-arts visual recognition networks such as RegNets, EfficientNets, and DeiT, Swin Transformer has a great accuracy-speed trade-offs.

![image](https://drive.google.com/uc?export=view&id=1svQMpnwGUvCdimF6OIihiyE3IxfkIsiu)



<a name='5'></a>

## 5. Swin Transformer V2 and Other Hierarchical Vision Transformers

The authors of Swin Transformer did a follow-up work to improve its performance and scaling its capacity. They mainly modified self-attention replacing dot-product attention with [cosine attention](https://paperswithcode.com/method/content-based-attention) and changed the position of the layer norm. Rather than having layer norm before attention and MLPs, [Swin Transformer V2](https://arxiv.org/abs/2111.09883) puts layer norm after attention and MLPs. Changing the configuration of layer norm significantly improves scaling behavior of Swin Transformer.

![image](https://drive.google.com/uc?export=view&id=1cls4hFXWybfFWqRwrm8hXBElTP_aBRQY)

Swin Transformer inspired other hierarchical Vision Transformers such as [Multiscale Vision Transformers](https://arxiv.org/pdf/2104.11227.pdf) and its improved version [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/pdf/2112.01526.pdf). We won't go deep into these papers, but you can read skim through them if you have a time.

<a name='6'></a>

## 6. Implementations of Swin Transformer

Due to the applicability of Swin Transformer mostly in computer vision dense prediction tasks, there are numerous and awesome open-source implementations of Swin Transformers. The [official implementation](https://github.com/microsoft/Swin-Transformer) contains model codes and pre-trained weights in PyTorch.

Hugging Face [Transformers](https://github.com/huggingface/transformers) library also contains [implementation of Swin Transformer](https://huggingface.co/microsoft/swin-base-patch4-window7-224-in22k). It takes less than 10 lines of codes to use pretty much any model from Hugging Face! The implementation provided by Hugging Face supports both PyTorch and TensorFlow.

Also, [PyTorch Image Models(known to timm)](https://github.com/rwightman/pytorch-image-models) by [Ross Wightman](https://github.com/rwightman) contains PyTorch implementations of various versions of Swin Transformer. Using any timm model as [feature extractor](https://rwightman.github.io/pytorch-image-models/feature_extraction/) or for fine-tuning purpose is also blazingly fast! Since Ross, the designer of timm is now part of Hugging Face, we can expect [a tighter integration](https://twitter.com/wightmanr/status/1539649051267780608) between two frameworks!

There are also [pretrained models of Swin Transformer on TensorFlow Hub](https://tfhub.dev/sayakpaul/collections/swin/1). The host repository of those models is found [here](https://github.com/sayakpaul/swin-transformers-tf#using-the-models). Another [complete unofficial TensorFlow implementation](https://github.com/VcampSoldiers/Swin-Transformer-Tensorflow) is found here. There is also an interactive tutorial of Swin Transformer on [keras.io](https://keras.io/examples/vision/swin_transformers/).

Lastly, Swin Transformer is available in [PyTorch Vision](https://github.com/pytorch/vision). The implementations of Swin Transformers provided here are not exhaustive. You can find other implementations [here](https://paperswithcode.com/paper/swin-transformer-hierarchical-vision).



<a name='7'></a>

## 7. Conclusion

We have been learning Swin Transformer, a hierarchical Vision Tranformer that made it possible to apply Transformers in various computer vision such as objetc detection and image segmentation.

<a name='8'></a>

## 8. Further Learning

* [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Liu et al.](https://arxiv.org/abs/2103.14030)

* [Swin Transformer V2: Scaling Up Capacity and Resolution
, Liu et al.](https://arxiv.org/abs/2111.09883)

* [Swin Transformer paper animated and explained, AI Coffee Break with Letitia](https://www.youtube.com/watch?v=SndHALawoag)

### [BACK TO TOP](#0)