<a name='0'></a>

# Introduction to Object Detection

Object detection is one of the core computer vision tasks. In this notebook, we will learn what it is, its difference with other recognition tasks, applications, algorithms, datasets, and tools landscape.

***Outline***:

* [What is Object Detection?](#1)
* [Classification Vs Localization Vs Detection Vs Segmentation](#2)
  * [Image Classification](#2-1)
  * [Object Localization](#2-2)
  * [Object Detection](#2-3)
  * [Image Segmentation](#2-4)
- [Applications of Object Detection](#3)
- [Modern Object Detectors](#4)
  * [Region Proposals](#4-1)
  * [Two Stage Object Detectors](#4-2)
  * [Single-Stage Object Detectors](#4-3)
  * [How to Choose an Object Detector](#4-4)
- [Object Detection Metrics](#5)
  * [Intersection over Union](#5-1)
  * [Mean Average Precision](#5-2)
- [Object Detection Datasets and Tools Landscape](#6)
  * [Object Detection Datasets](#6-1)
  * [Object Detection Tools](#6-2)
- [The Challenges of Object Detection](#7)
- [References and Further Learning](#8)

<a name='1'></a>
## 1. What is Object Detection?

Object detection is one of the core computer vision tasks that involves recognizing the objects and their position in images and drawing the bounding boxes around the detected objects.

Object detection answers this really simple problem: given an image containing one or multiple objects, can you recognize their categories(or labels or classes) and their positional coordinates in the image? In other words, what are the objects that are in image and where are they? Object detection is complicated. Easy to pose the question,  hard to make it work 😞

![image](https://drive.google.com/uc?export=view&id=1cjpCULzh0mdnELVL-P7pn2KdUofry8MW)

Object detection is a supervised learning task. Most object detectors are trained with images that are labelled(each image in the training set contains objects with specified categories/labels and boxes coordinates). The detection algorithm will output a set of detected objects, each object containing the predicted label and bounding boxes.

We talked alot about the bounding boxes...



<a name='2'></a>

## 2. Classification Vs Localization Vs Detection Vs Segmentation

In previous section, we grimpsed about object detection. Image classification, object localization, object detection and image segmentation are the key computer vision tasks. Most deep learning practioners understand image classification, but confuse the other 3 tasks. Let's distinguish those 4 tasks.

<a name='2-1'></a>
### 2.1 Image Classification


Image classification is one of the most popular visual recognition tasks. In image classification, we are interested in determining the category/label of the given image. *In brief, we answer the question: given an image, can you identity the label of the image?*

![image](https://drive.google.com/uc?export=view&id=1UHP-mxYd7HwUaRynnNN27TJK1DjHfle4)

In image classification, we are not interested in the spatial extent of the objects in image but rather the label/category of the image.




<a name='2-2'></a>
### 2.2 Object Localization



Object localization deals with determining the category of a single object in image and and predicting its the coordinates of the bounding boxes.

In essence, object localization is the combination of image classification and normal regression. The label/category of the object is determined by a classification algorithm and the coordinates of the box are determined by a regression method.

![image](https://drive.google.com/uc?export=view&id=1I81G1q_T7q56RjsOEFv1qSSbcmAFZ1Ev)

Object localization assumes we are dealing with images that contain a single object, but this rarely the case in real world. In most visual applications, we are interested in detecting multiple objects in an image as accurately and fast as possible. We need a better way to detect the objects in image beyond just localization!



<a name='2-3'></a>
### 2.3 Object Detection

Object detection is a task of recognizing the categories and bounding boxes(for spatial extent) of objects in image.

![image](https://drive.google.com/uc?export=view&id=1I_6T54NOuY8pgR5Y5nWlHko2lvTEgchV)

***In the image above, the detection was done by Vision-Explorer(with [Faster-RCNN](https://arxiv.org/abs/1506.01497)). P.S: I use the table for dinning as well :-)***





<a name='2-4'></a>
### 2.4 Image Segmentation


Image segmentation is the task of grouping the parts/pixels of image that belongs to the same object category. The most popular type of segmentation is semantic segmentation that associates each pixel of an image with a class label(such as boat in the image below). Semantic segmentation can also be seen as pixel-level prediction as each pixel is associated with a class label.

More about segmentation in the next notebooks!!

![image](https://drive.google.com/uc?export=view&id=1ykeaPbar3tPiLYu1KCU65vemXfxP1kGM)
***In the image above, the segmentation maps was done by Vision-Explorer(with [DeepLabV3](https://arxiv.org/abs/1506.01497)). P.S: I use the table for dinning as well :-)***

*************

>***Fun facts:***
- Object detection and segmentation rely on classification algorithms for identifying the objects in image.
- Object detection, localization, and segmentation typically use pretrained ConvNets(such as ResNet) as backbone networks.


<a name='3'></a>

## 3. Applications of Object Detection

In a visual world, objects are everywhere. Giving the computer the ability to detect real-world objects has opened up many applications. Let's take few examples.

* Autonomous vehicles such as driveless cars use object detection algorithms to detect the pedestrians, traffic signs, and other nearby cars.
* Face detection is one of the most popular application of object detection. Nearly all smartphones and digital cameras possess face detectors for enhanching camera focus.

* Modern security systems use object detection algorithms to detect the people that are allowed for a certain access, detecting harmful tool, etc...

* Modern retail shops (such as [this](https://www.youtube.com/watch?v=NrmMk1Myrxc)) use object detection algorithms to make shopping experience simpler.

Object detection is also largely used in machine inspection(such as detecting faults or defects in products), medicine, etc... The list of applications is not meant to be exhaustive.

<a name='4'></a>

## 4. Modern Object Detectors

Modern object detectors are based on neural networks most notably Convolutional Neural Networks(the latest being visual transformers). The traditional object detection algorithms were based on region proposal networks and sliding window approaches. Before we glimpse on some modern detectors, let's review the region proposal networks.









<a name='4-1'></a>
### 4. 1 Region Proposals

Object detectors came long way. The earliest detector was region proposal networks that were used to generate all possible regions that contains objects in image using traditional image processing techniques. Those regions could then later be fed to a classifier(SVM in most papers).

[Selective search](http://www.huppelen.nl/publications/selectiveSearchDraft.pdf) is a popular region proposal network that can generate all possible object locations. Selective search(or regional proposals in general) are no longer used but it was used in [R-CNN](https://arxiv.org/pdf/1311.2524.pdf)(which we will talk about later) which is one of the popular and influential paper in object detection and computer vision as whole.

![image](https://drive.google.com/uc?export=view&id=1DAG6XXvqj_26JiP67fqK0kxyek4yN0pt)

Another region propal network that were used in 2010s is [Edge Boxes](https://pdollar.github.io/files/papers/ZitnickDollarECCV14edgeBoxes.pdf) that could generate object bounding boxes that contain objects using edges.

Now that we have an idea of object proposal networks, let's review the two types of modern object detectors that are two stages and single stage detectors.

<a name='4-2'></a>
### 4.2 Two Stage Object Detectors

Two stage object detectors comprise of two stages. The first stage is a network that propose the regions containing the objects and the second stage contains a classifier that determine the class of features in a region and a regressor that predict the coordinates of the bounding boxes.

For now, let's review some popular two stages networks on high-level note. In later labs, we will practice them.



#### R-CNN: Region-Based-CNN

[R-CNN or Region Based CNN](https://arxiv.org/pdf/1311.2524.pdf) is one of the first detectors that has some neural networks components. Before R-CNN, most detectors were based on classical and ensemble methods.

R-CNN is two stage detector but in broad it has has 3 main parts. The first part in R-CNN is `selective search` network that generate all possible regions that contain objects. The second part is a convolutional neural network(CNN) that extracts features(of fixed length) from each region. The CNN used in R-CNN was pre-trained AlexNet. Using a pretrained backbone network in object detection is a common and recommended thing but today, AlexNet is not used as backbone anymore. You will find most backbones being ResNet, ResNeXT and other better CNNs architectures.

The third part is SVM classifier that run on top of features extracted by CNN to identify the class of the regions and a bounding box linear regressor that predict the spatial extent of the regions in image.

![image](https://drive.google.com/uc?export=view&id=11FN116Gg1vxJdWLeIWCX6izpObrNxRKr)


During training, at first, selective search methods generate 2k regions or object locations, we classify each region as positive if the proposed region contain the actual object(or ground truth GT) and negative if the region doesn't contain the actual object. We then feed each region to a pretrained CNN(back then AlexNet), predict its class with Linear SVM and the box coordinates with linear regressor.

Below image represents the training procedure of R-CNN. It's taken from [Justin Johnson Deep Learning for Computer Vision course slides](https://www.youtube.com/watch?v=TB-fdISzpHQ&list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r&index=15&t=1438s).

![image](https://drive.google.com/uc?export=view&id=1kH08ETZWGBywc7-YoeGHn5gU36uMFG38)

At test time, we apply selective search to the image to get 2k region proposals, forward pass the regions to CNN to extract their features and then find the class score of each extracted feature and box using trained linear SVM and regressor respectively. To get rid of overlapping regions, "we apply a greedy non-maximum suppression (for each class indepen- dently) that rejects a region if it has an intersection-over- union (IoU) overlap with a higher scoring selected region larger than a learned threshold."(Quoted from [paper](https://arxiv.org/pdf/1311.2524.pdf)). Below slide from Justin Johnson Deep Learning for Computer Vision summarizes the R-CNN test time well.

![image](https://drive.google.com/uc?export=view&id=17yh_drUecs9SyricRslPpHv8oP2dCEDV)

R-CNN was a very important detector back then because it swapped out traditional ensembles with CNN(that were hitting at the time) and inspired other later efficient detectors such as Faster-RCNN but it was also very and very slow.


#### Fast R-CNN

Fast R-CNN or Fast Region Based CNN is a successor of R-CNN. In the last section, we saw that although R-CNN advanced the state of object detectors by replacing CNN with traditional ensembles, it was very and very slow.

As [Fast R-CNN paper](https://arxiv.org/abs/1504.08083) outlined, R-CNN had the following drawbacks:

- It had multiple training stages that are computing the region proposals/object locations with selective search, extracting features in each object proposal using pre-trained CNN, classifying the features with linear SVMs, and predicting the box coordinates with a linear regressor. If we omit region propal network, R-CNN had 3 training stages(running CNN on 2k object regions, detecting objects with SVMs, and bounding-box regressors).

- As results of using 3 training stages, R-CNN had high time and space complexity.

- Finaly, it had high test time. At test time, we have to almost do the same training procedures expect that it for one image and we use trained SVMs and box regressors. We compute the object locations from test image, run CNN on object locations, classify the object class and predict the box coordinates. That's a lot of things to do at test time!

As the paper also noted, "R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation." In other words, ConvNet/CNN doesn't share weights across region proposals. Below images(taken from Justin) illustrates it better:

![image](https://drive.google.com/uc?export=view&id=13p1zMp_MdpCGyvDnWJae6uJDzkVjs8Ea)

To address the R-CNN slowness issue, Fast R-CNN replaces SVMs and bounding box linear regressors with a single ConvNet(the backbone network, it can be AlexNet, VGG, ResNet, etc...) that has two output fully connected layers(hence two different output loss functions: `multi-task loss` ): one with a softmax classifier that computes the object class (probabilities) and other fully connected layer that predicts the box coordinates of the regions.

Fast R-CNN(with pretrained VGG backbone) is 9x faster than R-CNN during training, and it is 213x faster at test time. That's a huge difference in performance and it's due to using the same CNN as a feature extractor(backbone network) and as object class classifier and box coordinates predictor!!

![image](https://drive.google.com/uc?export=view&id=1RlyA34cXnBTCN0O_ymihSHGCSzvwgtUi)

As Justin also mentioned in his [object detection lecture](https://www.youtube.com/watch?v=TB-fdISzpHQ&list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r&index=16&t=2829s), "most of the computation happens in backbone network; this saves work for overlapping region proposals".

![image](https://drive.google.com/uc?export=view&id=11H9qjgi8seoRff70gkSelY97TlrH5AYF)

Fast R-CNN was faster than R-CNN but still slow but there was still room of improvements. It was still expensive to compute region proposals using selective search. And that is exactly what [Faster R-CNN](https://arxiv.org/abs/1506.01497v3) adressed!



#### Faster R-CNN

Fast R-CNN was faster than R-CNN but computing region proposals with selective search method was still expensive. It was a computational bottleneck. In fact, most runtime of Fast R-CNN was consumed by region proposals.

[Faster R-CNN](https://arxiv.org/abs/1506.01497v3) is an improved and efficient version of Fast R-CNN(titles in research community are sometimes fun :-)). Rather than using selective search algorithm for generating all possible region proposals, Faster R-CNN used Region Proposal Network (RPN), a CNN-based region proposal network that share convolution layers with the backbone network(such as AlexNet, VGG, ResNet, etc..). Essentially, Faster R-CNN is Fast R-CNN with Region Proposal Network (RPN), a CNN-based network that generate object locations. Given that in Fast R-CNN, the object class and box coordinates are computed by finetuned pretrained ConvNet(or backbone), Faster R-CNN is fully convolutional. As the authors noted, "Faster R-CNN is a single, unified network for object detection."

Below image(from Justin) illustrates Faster R-CNN and its training time compared to Fast R-CNN and R-CNN.

![image](https://drive.google.com/uc?export=view&id=1Mqa34VoZSSfQyE8WH8GtAGT0XrHFA6W8)


Due to sharing computation across the whole network, Faster R-CNN is a lot faster(11x) than Fast R-CNN and lot lot faster than R-CNN(245x)
![image](https://drive.google.com/uc?export=view&id=1jQ7QyrY6BboFbE-suxiFX3jbU9vmHMDQ)

We have been reviewing two stages detectors. Two stages detectors employ more than stage in the training pipeline. In Faster R-CNN for example, the first stage use Regional Proposal Network(RPN) to compute region proposals and the second stage computes the object class and box coordinates.

One might wonder if we even actually need the second stage. In the next section, we will review single stage detectors.

For more about two stage detectors we seen so far, below are their papers:

* [R-CNN - Rich feature hierarchies for accurate object detection and semantic segmentation](https://arxiv.org/abs/1311.2524)

* [Fast R-CNN](https://arxiv.org/abs/1504.08083)

* [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497v3)
 



<a name='4-3'></a>
### 4.3 Single-Stage Object Detectors

Single-stage detectors use a single network(typically ConvNets) to predict the objects class and the box coordinates. Most single-stage object detectors don't have a separate region proposal network that must be run(in case of R-CNN) or train independently(RPN in Faster R-CNN). They instead have a single network that predict the objects class and the bounding boxes from image directly. Let's take some examples of single-stage detectors.

Two of the early single-stage detectors are [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325v5) and [YOLO v1(You Only Look Once: Unified, Real-Time Object Detection)](https://arxiv.org/abs/1506.02640v5). Both SSD and YOLO are fully convolutional as they use one straight convolutional neural network to predict the class and bounding boxes of the object. 

Expanding on the later network, YOLO is one of the fastest and real-time object detectors that have undergone many incremental improvements year after year since its initial release in 2015. Below is the architectural overview of the first version of YOLO detection system.



![image](https://drive.google.com/uc?export=view&id=15cbLnB4VuUiF55cFevE_VWwzb_5B7dB6)

The next versions([YOLOv2 2016](https://arxiv.org/abs/1612.08242v1), [YOLOv3 2018](https://arxiv.org/abs/1804.02767v1), [YOLOv4 2020](https://arxiv.org/abs/2004.10934v1), [YOLOvX 2021](https://arxiv.org/abs/2107.08430v2) and its [doc](https://yolox.readthedocs.io/en/latest/)) of YOLO detection systems kept working better and faster at the point that they also outperformed some two-stage detectors on benchmark datasets. [You can learn more about YOLOs at Joseph Redmon’s website](https://pjreddie.com/darknet/). Check also this funny [demo](https://www.youtube.com/watch?v=MPU2HistivI) of YOLOv3.

SSD and YOLO detectors are not the only single-stage detectors. Another popular single-stage and perhaps one of the efficient object detectors is RetinaNet. [RetinaNet](https://arxiv.org/abs/1708.02002v2) is a single unified object detector that has a backbone network([feature pyramid network(FPN)](https://arxiv.org/abs/1612.03144v2) on top of pre-trained ResNet) and two subnetworks that predict the object class and bounding boxes respectively. RetinaNet also introduced focal loss, a dynamically scaled [cross entropy loss](https://en.wikipedia.org/wiki/Cross_entropy) that is used to address the class imbalance problem during training. 
![image](https://drive.google.com/uc?export=view&id=1f6_oGqPd1X8gKX_5mo9-5XNI7YWFM0kZ)



There are many more single-stage detectors such as [EfficientDet](https://arxiv.org/abs/1911.09070v7), [CenterNet](https://arxiv.org/abs/1904.08189v3), [RetinaMask](https://arxiv.org/abs/1901.03353v1)(improved RetinaNet), etc...You can read them if you want to learn more about single-stage detectors.

Up to this point, we understand pretty much what we need to know about single-stage and two-stage detectors. In the next section, we will see the difference between them.


<a name='4-4'></a>
### 4.4 How to Choose an Object Detector 

Selecting an object detection algorithm can be complicated. There are many choices that you have to make such as whether to use single stage or two-stage, the specific architecture, target hardware, the feature extractor to use, the input image resolution, etc...

As this paper from 2016([Jonathan Huang et al.](https://arxiv.org/abs/1611.10012)) noted, "*modern object detectors based on convolutional neural networks are now good enough to be deployed in consumer products(ex: Google Photos, Pinterest Visual Search) and some have been shown to be fast enough to be run on mobile devices. However, it can be difficult for practitioners to decide what architecture is best suited to their application. Standard accuracy metrics, such as mean average precision (mAP), do not tell the entire story, since for real deploy- ments of computer vision systems, running time and memory usage are also critical. For example, mobile devices often require a small memory footprint, and self driving cars require real time performance.*" So, it's really important that we understand the trade-off between different detectors and factors that affect their performance.

Single-stage detectors(such as SSD, YOLO, RetinaNet, EfficientDet, etc..) tend to be fast because they are made of a single unified network that does the whole computation(from feature extraction, object class, to boounding boxes prediction) but they generally don't have high detection accuracy(or mean average precision, a standard metric for object detectors). On the other hand, two-stage detectors(Fast R-CNN, Faster R-CNN) tend to have high mean average precision but they are slower because they have two separate networks that don't share computations.

Another important thing that largely affects the performance of the object detector is the backbone network or a feature extractor which is most of the time convolutional neural networks(and vision transformers that are hanging around). As [Ross et al.](https://arxiv.org/abs/1311.2524) in R-CNN(one of the earliest detectors that used CNNs) highlighted, "features matter". Thus, the choice of feature extractor has a tremendous effects on detection accuracy, speed, number of parameters, etc...On a similar but slightly different note, as the [paper](https://arxiv.org/abs/1611.10012) we mentioned earlier said,  "*there is indeed an overall correlation between classification and detection performance. However this correlation appears to only be significant for Faster R-CNN and R-FCN while the performance of SSD(single-stage detectors) appears to be less reliant on its feature extractor’s classification accuracy.*" The common feature extractors are ResNet(50/101/152), Inception ResNet, ResNeXt, ConvNeXt, etc...When it comes selecting a feature extractor, whatever works on ImageNet will work best for you too. Below image show the performance of different detectors with different feature extractors.


![image](https://drive.google.com/uc?export=view&id=1Xmnseq_kTC-PjnmjokYhOv7ujXthxYPl)

Another key finding from the same paper is that the resolution of the input image affects the accuracy and speed of object detectors. High resolution(600px) tends to give high accuracy, but it is slower. Lower resolution tends to be faster, but it doesn't have better accuracy(very likely to miss smaller objects). As the authors articulated, "high resolution inputs allow for small objects to be resolved." With the exception of SSD that has high accuracy on large objects and low accuracy on small objects, they also noticed that high accuracy on small objects signifies high accuracy on large objects. Below image shows that well!

![image](https://drive.google.com/uc?export=view&id=147k8ZVRl2VXktlEvSK6CNoGFB_fkpWl9)


[Jonathan Huang et al.](https://arxiv.org/abs/1611.10012) presented a really nice framework for selecting object detectors based-off different factors. As we saw, feature extractors matter a lot. To get better detection accuracy, better and perhap bigger feature detector is needed. Also, two-stage detectors tend to have better accuracy but they are slower. On the contray, single-stage detectors tend to be fast, but they have poor accuracy especially on smaller objects. And finally, resolution of the input image affects the speed and accuracy. High resolution leads to better accuracy(but slow), and low resolution are fast(but doesn't perform better).

Throwing a little caveat here, the paper is a old and with the vibrance of computer vision research community, many things have obviously changed. There are some single-stage detectors that perform better than old two-stages detectors, feature extractors have improved, GPUs have became faster, and new neural network algorithms such as Transformer has emerged([Vision transformers](https://arxiv.org/abs/2010.11929), most notably [Swin Transformer](https://arxiv.org/abs/2103.14030) is the current state-of-the-art detector on COCO dataset). Below image from [Paper With Code shows the progress of object detectors](https://paperswithcode.com/sota/object-detection-on-coco) from 2016 to 2022. Notice that the current state-of-the-art is [DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection](https://arxiv.org/abs/2203.03605v3). This is really a vibrant field!!




![image](https://drive.google.com/uc?export=view&id=1kK68k_GmsS1MzdKOIctVixnEzFjBzmIC)


<a name='5'></a>

## 5. Object Detection Evaluation Metrics

In image classification, it is fairly easy to evaluate the performance of the classifier with standard accuracy. But in object detection, it's different thing and the metrics used can be confusing.

Recall that in object detection, we are interested in two things: predicting the class of the objects and their bounding boxes. So, the metrics we used should take those things into accounts. The commonly used metric for evaluating object detectors is Mean Average Precision(mAP). Before talking about mAP, let's talk about Intersection over Union(IoU).












<a name='5-1'></a>
### 5.1 Intersection over Union(IoU)



Most object detectors output the box coordinates and object class scores. 

Intersection over Union(IoU) is the ratio of overlapping area(or intersection) and the area of union of two boxes(predicted box and groundtruth/actual box).

$$ IoU = \frac{Intersection}{Union} $$

Below image illustrate the IoU between two boxes. The image is borrowed from [Jonathan Huang lecture on object detection](https://courses.cs.washington.edu/courses/csep576/20sp/lectures/8_object_detection.pdf). 

![image](https://drive.google.com/uc?export=view&id=17JE7lCdvzdB37ZImBEKyFDnv8o9M72bf)


Intersection over Union is always a value between 0 and 1 and the higher is better. If the predicted box and groundtruth/actual box don't overlap at all, their IoU is 0. If they are fully similar, their IoU is 1 and this almost impossible in practice. An IoU of 0.5 is usually considered as a threshold and anything greater than the threshold is good. If you get 0.9 for example, that's a perfect result!

Also, if the detector recognizes object in predicted box but its IoU is less than the threshold, the predicted box is classified as `false positive`. Otherwise, it is `true positive`. The (actual) objects that the detector fails to detect are `false negatives`. Earlier, we said that metrics we use should take the box coordinates into account. That make sense because false negatives, false positives, and true positives are accounted in the Mean Average Predicion(mAP), a standard metric of object detectors that we will see soon!.

Presumably, there is going to be many overlapping bounding boxes in the output of detectors. The algorithm that is used to discard those overlapping boxes is called [Non Max Suppression(NMS)](https://paperswithcode.com/method/non-maximum-suppression). NMS is a classical computer vision technique, but in object detection context and on a high-level note, it works this way: we iterate through the predicted boxes, we look a box that has a highest box confidence score(or class score), and then eliminate all boxes whose confidence scores are less than the current highest box confidence score and whose IoUs(with reference to groundtruth box) is greater than the threshold.

NMS helps to get rid of overlapping predicted boxes but when there are many similar objects in image, it's very challenging to differentiate predicted boxes that are duplicates or different given that there are many similar actual boxes that close to each other in the image.

<a name='5-2'></a>
### 5.2 Mean Average Precision(mAP)

Mean Average Precision is a standard metric used in evaluating the performance of the object detectors.

Normally, precision is the accuracy/percentage of the positive predictions. It answers the question: *what is the percentage of positive predictions that are actually positive?* On the other hand, recall is the percentage of correct predictions over all groundtruth samples. It answers the question: *what is the percentage of actual positives(groundtruth) that were predicted correctly?* Both precision and recall are always between 0 and 1, and the higher is better but there is usually a trade-off between them.

$$
Precision = \frac{True \ positives(TP)}{True \ positives(TP) + False \ positives(FP)}
$$


$$
Recall = \frac{True \ positives(TP)}{True \ positives(TP) + False \ negatives(FN)}
$$

The average precision(AP) is the mean/average of precisions measured at different values of recall. Average precision is also called Area Under Curve-Precision Recall(AUC-PR).


![image](https://drive.google.com/uc?export=view&id=1Lmn04KbkLEwigSEipn_JRqEqk027tQik)

Average precision is computed for each object category. That is, we compute the average of precisions measured at different recall thresholds. Mean Average Precision(mAP), our ultimate detection metric is then the mean of average precision computed over all categories/classes.

In most papers, mean average precision is reported as `mAP@0.8 = 85`: 0.8 stands for IoU and 85 is mean average precision(mAP). Some papers just make it simple and ony say `mAP`. For benchmark datasets like COCO and Pascal VOC for example, mAP can be reported as `COCO mAP = 0.8`.


<a name='6'></a>

## 6. Object Detection Datasets and Tools Landscape

Computer vision community is blessed with many open source tools and datasets. Over the past decade, there has been massive development of not just only object detection algorithms, but tools and datasets(2D datasets although real world is not 2D)

Object detection is a very complicated task. So, as much as you can, unless you are solving a problem that is completely new(rarely happen), you should try to use existing detectors that works great on benchmark datasets(such as COCO). Getting enough data, annotating it yourself, and then building your own detector is a complete hurdle. The nice thing is there are open-source tools that provides nearly all object detectors and it's possible to customize them on your own dataset rather than building and training your detector from scratch.

Let's review some common object detection datasets and tools.







<a name='6-1'></a>

### 6.1 Object Detection Datasets

* **PASCAL VOC (PASCAL Visual Object Classes Challenge):** Pascal VOC is one of the earliest object detection datasets. The challenge run between 2005-2012 and it largely contributed to the development of object detection algorithms. The dataset contains 20 object categories and their annotations. More about the dataset can be found on its [homepage](http://host.robots.ox.ac.uk/pascal/VOC/) and the final [report](http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham15.pdf) on retrospect and challenges.

* **Microsoft COCO(Microsoft Common Objects in Context):** Microsoft COCO is one of the most popular object detection benchmarks datasets in computer vision community. COCO contains about 330K images (most of which are labelled) and 80 object categories. COCO is not only just for object detection. It also extends to image segmentation and image captioning. Most state-of-the-art object detectors are evaluated on COCO dataset. More about COCO dataset can be learned from its [website](https://cocodataset.org/#home) and [paper](https://arxiv.org/abs/1405.0312). You can also visualize the dataset using [FiftyOne](https://voxel51.com/docs/fiftyone/).

* **Open Images** is also a popular and largest object detection dataset that contains 9M images(approx), 15,851,536 boxes on 600 categories. Open Images also goes beyond object detection, it contains data for image classification, visual relationship detection, instance segmentation, and multimodal image descriptions. You can learn more about the data [here](https://storage.googleapis.com/openimages/web/index.html) or explore it [here](https://storage.googleapis.com/openimages/web/visualizer/index.html).



<a name='6-2'></a>
### 6.2 Object Detection Tools



Object detectors based on deep learning started somewhere around 2014. That time, there were no deep learning frameworks like [PyTorch](https://pytorch.org) and [TensorFlow](https://www.tensorflow.org)(TensorFlow was released in 2015, PyTorch was released in year that followed). It's impressive that few years after also, the community got open source detection tools that are build on top of those deep learning frameworks. Let's review some popular object detection tools:

* **TensorFlow Object Detection API(TF OD)**: TF OD is probably the earliest object detection API. It contains different object detection models that are ready to be used such as SSD with MobileNet, RetinaNet, Faster R-CNN, Mask R-CNN, EfficientDet, and CenterNet. TF OD also contains COCO pre-trained weights, means you can use those models ut-of-the-box inference. See the 8[release article](https://blog.tensorflow.org/2020/07/tensorflow-2-meets-object-detection-api.html) for more about the API, the [model zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md), and its [GitHub repository](https://github.com/tensorflow/models/tree/master/research/object_detection).

* **Detectron2** is an object detection tool from Meta AI(former FAIR) that runs on top of PyTorch. It's one of the simplest and best detections tools! It contains object detection state-of-the-arts models such as Fast R-CNN, Faster R-CNN, and RetinaNet. Detectron2 also contains models for image segmentation. See its [GitHub repository](https://github.com/facebookresearch/detectron2) and [documentation](https://detectron2.readthedocs.io/en/latest/index.html) for more, this [blog](https://ai.facebook.com/blog/detectron-everingham-prize/) and [this](https://research.facebook.com/blog/2018/01/facebook-open-sources-detectron/), and [model zoo](https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md).

* **OpenMMLab** is a latest object detection library that is built on top of PyTorch. OpenMMLab goes beyond object detection, it also contains other tasks such as segmentation, text detection, self-supervised learning, video tracking, pose estimation, etc...Learn more [here](https://github.com/open-mmlab) and [here](https://openmmlab.com).


The list of tools is not meant to be exhaustive. There are other object detection tools that you can check out: [KerasCV(under-build)](https://github.com/keras-team/keras-cv), [SimpleDet](https://github.com/TuSimple/simpledet), [GluonCV](https://cv.gluon.ai).

Most of those tools are pretty good at what they do, so, you better use them instead of building your own object detectors. The choice of tools is probably going to depend on your major deep learning framework(TF OD or KerasCV for TensorFlow, Detectron2 or OpenMMLab for PyTorch), the task, deployment option, etc...

<a name='7'></a>

### 7. The Challenges of Object Detection

As [Syed et.al](https://arxiv.org/pdf/2104.11892.pdf) said in their object detection survey paper, "computer vision has come a long way in the past decade, however it still has some major challenges to overcome." In addition to the challenges they mentioned in paper, below are more challenges:

- **Dealing with multiple outputs:** *In image classification, life is easy(CC: Jonathan Huang)* because you only deal with the class of the images, but in object detection, you also need to locate objects in image. Given that real-world images contain many objects, it's hard to deal with outputs of every object in image.

- **Object detectors require high computational resources** due to a number of factors such as using image of high resolution(such as 800x600), using very large networks that as you might guess need to be trained longer, etc...With that computational bottleneck, it's also hard to deploy object detectors in devices that have low memory such as mobile phones and all edge devices in general.

- **Dealing with zillions of categories:** Real-world images contain many and many object categories and their spatial arrangement can be challenging. I think there is no better way to say how challenging this than below image. Hat tip to Peyman MilanFar for the [photo](https://twitter.com/docmilanfar/status/1513335231532597251).

![image](https://drive.google.com/uc?export=view&id=11PAPF0q0DhLxdP_djvdcRYb2mp44meo1)

In additional to those challenges, the metric(mean average precision, mAP) used in evaluating object detectors is not intuitive at all.


<a name='8'></a>

## 8. References and Further Learning

Object detection is one of the core tasks in computer vision. It answers the question: given an image containing one or more objects, ***what*** are those objects and ***where*** are they located in image?

If you would like to learn more about object detection, check the following materials:

* [Lecture 15 - Object Detection Deep Learning for Computer Vision by Justin Johnson](https://www.youtube.com/watch?v=TB-fdISzpHQ&list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r&index=16&t=535s)

* [Paper With Code Object Detection](https://paperswithcode.com/task/object-detection)

* [A Survey of Modern Deep Learning based Object Detection Models](https://arxiv.org/abs/2104.11892)

#### [BACK TO TOP](#0)