<a name='0'></a>

# DETR: End-to-End Object Detection with Transformers

DETR(DEtection TRansformer) is one of the earliest object detection systems based on Transformers. In this notebook, we will look at the motivation behind DETR, its architecture and its training recipes, and its generalization in panoptic segmentation.

***Outline***:

- [1. Introduction](#1) 
- [2. DETR Architecture](#2)
- [3. Visualizing Attention of DETR Encoder and Decoder](#3)
- [4. Why DETR Does Not Use NMS?](#4)
- [5. DETR for Panoptic Segmentation](#5)
- [6. Implementations of DETR](#6)
- [7. Conclusion](#7)
- [8. References and Further Reading](#8)

<a name='1'></a>

## 1. Introduction

Object detection is a computer vision task that deals with recognizing the objects and where they are exactly located in the image. Traditional object detectors such as Faster R-CNN relies on hand-engineered techniques such as Non-max suppression(NMS) and anchor generation. Not only that they are complicated in design and hard to understand, they also introduces computation bottleneck during training and inference time. 

From the architecture standpoint, most object detectors are based on Convolutional Neural Networks(CNNs). Tansformers have been showing remarkable performance in NLP due to self-atttention that focus on the most important part of input data and it's worth trying to use self-attention in object detection since there are object regions that are important than others. How can we improve detection systems by jointly combining CNNs and Transformers? That's the main question that DETR seeks to explore.

DETR standing for DEtection TRansformer is an object detection system that uses CNNs backbone(ResNet to be precise) and tranformer encoder and decoder. It's a simple and straightforward architecture and it does not use any traditional techniques other than CNNs and transformers layers. 

When DETR was introduced, it outperformed previous detectors such as Faster R-CNN on various object detection benchmarks. In the following sections, we will walkthough the architecture of DETR and its training recipes.

<a name='2'></a>

## 2. DETR Architecture

DETR treates object detection as a direct set prediction problem. In a single pass, it predits the sets of objects labels and the bounding boxes using bipartite matching loss.

> In computer science, sets are data structures that stores unique elements. In object detection context, modelling detection as a set problem means we don't get overlapping objects in the predictions. Hence removing the need of non-max suppression(NMS).

![image](https://drive.google.com/uc?export=view&id=1cN7W9G3-6ExhgKK06K2fsh3u4sWFSgAb)

As you can see below, it's architecture is very and very simple. It contains A CNN backbone, transformer encoder and decoder and MLPs, both of which are standard components(or layers) provided in modern deep learning frameworks. Let's discuss the architecture of DETR in detail:

* The first component of DETR is a CNN backbone network that extract the object features in the input image. The authors use pre-trained ResNet-50 & ResNet-101 from [Torchvision](https://github.com/pytorch/vision).

* Since the dimension of extracted features is high, it is reduced with a normal 2D convolution layer with kernel size of 1 and the resulting feature maps are converted into one dimensional vector since it's what the next stage expects.

* A transformer encoder takes the set of extracted features. A standard transformer encoder contains multi-head self-attention and MLP layers. A self-attention layers attend to the most important parts of the object features.

* A transformer encoder can not reason about the order of features. We thus add the positional encoding information to the input of each attention layer. Different to DETR, in a [standard transformer encoder](https://arxiv.org/abs/1706.03762), positional information are added once to the input of encoder. The authors reports that injecting positional encodings to every attention layer improves results.

* A transformer decoder takes the output of encoder and object queries. Object queries are learnt positional encodings and similar to encoder, they are also added to input of every attention layer in the decoder. The number of object queries is equivalent to the number of features or embeddings and is denoted by `N` in the paper. Also, different to standard transformer decoder that produce the output embeddings auto-regressively(or produce one token at time), DETR decoder transforms object queries into output embeddings in parallel manner.

* The last stage of DETR architecture is MLPs(the paper uses FFNs or Feed-Forward Networks to refer to MLPs which is a little bit confusing). MLPs are made of 3 linear layers, each followed by RELU activation. MLPs take the output embeddings from decoder and compute the object labels and bounding boxes which are final predictions. Each final prediction is computed independently. The total number of object predictions are equivalent to `N` object queries. In additional to `N` box predictions, there is additional class that denote if no object is detected. In the experimentation, the authors showed that removing the MLPs reduce the number of parameters but halt the performance. So, MLPs improves the results.

DETR training uses auxiliary losses that we won't discuss about for now. It also uses other relavant training recipes for transformers. You can read all technical details about DETR in the paper.


>On Multi-Layer Perceptrons(MLP) and Feed-Forward Networks(FFNs): most people still use FFNs to refer to MLPs but it's confusing since FFN refers to any model that maps input to output without any loop. MLPs are a good example of FFNs since they take input data and process it in feed-forward passion. RNNs are not FFNs since they process input data in recurrent fashion or to put it in other words, RNNs have feedback loop at every timestep.


<a name='3'></a>

## 3. Visualizing Attention of DETR Encoder and Decoder

The main component of both DETR encoder and decoder is self-attention. Self-attention in encoder allows DETR model to reason about the global features of the objects. Looking at the visualized attention below, encoder is able to separate object instances already. Since the encoder does the hard job, decoder comes to fill in the blanks.


![image](https://drive.google.com/uc?export=view&id=1fG03D1xfrz8eIVOag3uYkGGEsTh957tP)

The job of decoder is to extract the objects and their locations from the output of encoder and object queries. Looking in the image below, decoder attends to the ends or boundaries of the objects.

![image](https://drive.google.com/uc?export=view&id=1ysxd0EQDTtskdzBXI5uhhF99SDqvZqwF)


On the visualizations of encoder and decoder self-attention of DETR model, the authors concluded that encoder separates object instances via global attention(every object attend to all objects) while decoder only attends to the extremities or ends of the objects.

<a name='4'></a>

## 4. Why DETR Does Not Use NMS?

It's a norm for most object detectors to use Non-Max Suppression to remove the overlapping boxes since it's pretty likely to have many similar boxes in the detection outputs. But DETR doesn't use NMS. Why?

The designers of DETR recorded its performance(AP or Average Precision & AP50) with and without NMS for every decoder layer. One of their findings is that NMS improves the performance(both AP and AP50) in the first decoder layer because one layer of decoder is not enough for DETR "to compute any cross-correlations between the output elements, and thus it is prone to making multiple predictions for the same object."

For final layers, NMS decreases the performance of DETR. Thus, the designer of DETR chose not to use it since more layers of decoder can extract a set of non-overlapping objects.


![image](https://drive.google.com/uc?export=view&id=1Yngrl9UoA9hDrHtQzzNNtHcrco_x1H9a)

<a name='5'></a>

## 5. DETR for Panoptic Segmentation

DETR is not limited to object detection. It can also be used for panoptic segmentation, a scene understanding task that combines semantic segmentation(assigning a label to every pixel) and instance segmentation(detecting and delineating objects with bounding box or and segmentation mask).

By just only adding the extra panoptic head to DETR transformer decoder output(i.e predicted objects), we can also have a unified system that can segment stuffs(semantic segmentation) and things(instance segmentation). In brief, panoptic head takes detected objects from DETR transformer decoder, generate segmentation mask for every object, and merge all segmentation masks with pixel-wise argmax. Before merging the masks, we have segmented objects or things already(instance segmentation). Pixel-wise argmax seems like one that actually segments stuffs(semantic segmentation).

![image](https://drive.google.com/uc?export=view&id=1MbJ7dTS0IEGxi0INSxLJJQgg84Rm2RP2)

Panoptic DETR can be trained in two ways. The first way is to combine DETR and panoptic head and train them jointly from scratch. The second way is to train a separate DETR, freeze its weights, and train a panoptic mask on top of pretrained DETR. The authors reported that both ways yield the same results, but as you can imagine, the later requires less compute(i.e can train faster) and it is what the authors preferred.

That's it about using DETR for panoptic segmentation. Below are some qualitative results.


![image](https://drive.google.com/uc?export=view&id=1IPZ1VKk7Mp8ALCx2-0Y5y4YpAcfZi9zV)


<a name='6'></a>

## 6. Implementations of DETR

DETR is [officially](https://github.com/facebookresearch/detr) implemented in PyTorch. It is also available in [PyTorch Hub](https://pytorch.org/hub/#model-row), a repository of pretrained state-of-the-art architectures. There are also notebooks that provides a clear guidance on implementing [minimal version of DETR](https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_demo.ipynb#scrollTo=RSnU5JFxGeDe), [performing inference and visualizing the attention weights](https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_attention.ipynb#scrollTo=ztbiY_b4YTEn), and using DETR for [panoptic segmentation](https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/DETR_panoptic.ipynb#scrollTo=rYoKWqFyBWE9).

There are also TensorFlow implementations of DETR such as [this](https://github.com/Visual-Behavior/detr-tensorflow) and [this](https://github.com/Visual-Behavior/detr-tensorflow). [Hugging Face Transformers](https://github.com/huggingface/transformers) library also provides implementation of DETR for [object detection](https://huggingface.co/facebook/detr-resnet-50) and [panoptic segmentation](https://huggingface.co/facebook/detr-resnet-50-panoptic).

Also, [OpenMMLab object detection library](https://github.com/open-mmlab/mmdetection/) provides [implementations of DETR](https://github.com/open-mmlab/mmdetection/tree/master/configs/detr).

Lastly, [Aman Arora](https://github.com/amaarora) provides a fantastic blog post of [Annotated DETR](https://amaarora.github.io/2021/07/26/annotateddetr.html).


<a name='7'></a>

## 7. Conclusion

In this notebook, we have seen a high-level overview of DETR object detection architecture. DETR is one of the first papers that demonstrated remarkable success of Transformers in computer vision and [combining CNNs with self-attention](https://twitter.com/ylecun/status/1481198016266739715?s=20&t=yFhih0nLn5s5VyZ2mfccsA).

DETR is inarguably a simple architecture: just a pretrained CNN, a transformer encoder and decoder, and MLPs both of which are standard components in modern deep learning frameworks. The authors also elegently provided the implementation of DETR inference in 50 lines of codes in the paper!!

<a name='8'></a>

## 8. References and Further Reading

* [End-to-end object detection with Transformers - webpage](https://alcinos.github.io/detr_page/)
* [End-to-end object detection with Transformers - official implementation](https://github.com/facebookresearch/detr)
* [End-to-end object detection with Transformers - blog post](https://ai.facebook.com/blog/end-to-end-object-detection-with-transformers/)
* [End-to-end object detection with Transformers - video](https://www.youtube.com/watch?v=utxbUlo9CyY)

## [BACK TO TOP](#0)