# Key architectures of object detection models

## 1. Classic algorithms
### Viola-Jones Detector (2001) (Haar Cascade classifier)
1. Classify images under the sliding window
2. Mine features:
<img src="1_images/haar_features.jpg" width="800">

(sum of black - sum of white)
We have ~16000 features for 24*24 image
4. Use "integral image" to compute features efficiently.
5. Use AdaBoost of weak classifiers (1-2 layered trees).

<img src="1_images/haar_classifier.jpeg" width="800">

### HOG Detector (2006)
Mine features. Train SVM to classify in sliding fashion.
### DPM (2008)
Uses HOG features (or other).
1. A coarse root filter defines a detection window that approximately covers an entire object. A filter specifies weights for a region feature vector..
2. Multiple part filters that cover smaller parts of the object. Parts filters are learned at twice resolution of the root filter.
3. A spatial model for scoring the locations of part filters relative to the root.

Features:
<img src="1_images/dpm_features.png" width="1000">

Model:
<img src="1_images/dpm_model.png" width="1000">

### Overfeat (2013)
1. Train CNN for classification in sliding window fashion.
2. Replace the last classification layer with a regression layer (4 coordinates) and train it.
3. Use these networks to detect objects and merge highly overlapped regions.

## 2. stages NNs:
### RCNN (2014)
1. Selective search algorithm to mine candidates.
Selective search selects regions of similar color. Candidates are these regions + their unions.
2. Classify them using CNN.
~2000 candidates per image results in 47 seconds to classify 1 image.
<img src="1_images/rcnn_model.jpg" width="800">

### ZF-Net
Based on RCNN.
1. Selective Search.
2. Feed image into CNN.
3. Map regions found by Selective Search to further layers.
This speeds up the solution a lot.

### SPPNet (2014)
Spatial Pyramid Pooling. Based on ZF-Net.
Spatial Pyramid Pooling layer - pooling with window size and stride proportional to the input size. Allows to achieve fixed size output.

### Fast RCNN (2015)
Based on the SPPNet.
1. Run image through pre-trained CNN
2. Run selective search on the image to generate region proposals
3. These proposals are then projected onto the feature maps, just like they are in SPPNet.
4. Sampling strategy is applied for proposals
5. For each region proposal in the feature maps, an ROI Pool layer computes a fix length vector. That vector goes through a couple of fully connected layers, which then split into two sibling branches.
6. One branch to predict classes
7. And the other to predict bounding boxes.
8. Finally, we do NMS to reduce duplicate boxes.

<p style="background-color: white;">
<img src="1_images/fastrcnn_model.png" width="1000">
</p>

### Faster RCNN (2015)
1. CNN
2. Region Proposal Network
2.1 Put regular grid on the last feature map
2.2 Extract anchors of different scale and aspect ration for each grid cell
2.3 Put extracted regions into network predicting classes and bounding boxes relatively to the each anchor (implemented by convolutional layer)
2.4 Apply Non Maximum Suppression
3. Finetune class and bbox prediction for each candidate from RPN

Architecture:
<p style="background-color: white;">
<img src="1_images/faster_rcnn_model.png" width="1000">
</p>

### Mask R-CNN (2017)
Faster RCNN + additional output for segmentation. Segmentation could be enhanced by the PointRend algorithm (more precise up-scaling of segmentation to the original size)

### Feature Pyramid Networks (FPN) (2017)
Save feature maps at different stages of backbone CNN evaluation, upsample further feature maps and combine them with previous ones. This gives us semantic reach feature maps in high resolution to detect objects precisely. FPN itself is just a feature extractor.
For object detection separate layers for classification and bounding box coordinates prediction are applied to each feature map in a sliding window fashion.
Could be used in conjunction with Faster RCNN architecture. Region Proposal Network predicts ROI. Based on the size of the ROI, we select the feature map layer in the most proper scale to extract the feature patches.
<img src="1_images/fpn_model.png" width="800">

### G-RCNN (2021)
Key feature - application to the video stream. RPN considers spacial (colour) and temporal (from the frame sequence) information, combines it and extracts ROIs using CNN from these features. Then, classification and bounding-box regression is applied.
<img src="1_images/g-rcnn_model.png" width="800">

## 3. 1 stage NNs:
### YOLOv1 (2016)
CNN pre-trained on ImageNet. The last layers are replaced by extra convolutions and FCLs. Output 7*7*30 tensor which represents 98 bounding boxes, each pair belongs to one of 20 classes (49 class predictions). Lower accuracy compared to 2-stage detectors, but real-time level speed.
### SSD (2016)
Accuracy on the level of 2-stage detectors. Real-time speed. Architecture is very similar to Faster R-CNN but uses several feature maps and for each feature map cell outputs (C+4)*K numbers (C - number of classes, 4 - bounding box coordinates, K - anchors).
### RetinaNet (2017)
### YOLOv2,v3 (2017, 2018) -
Different pre-train strategy (classification with low resolution -> classification with high resolution -> object detection with high resolution).
Dropout is replaced by the Batch Normalization.
Anchors are used and they are learned using K-Means.
Fine-grained information for the last feature map (features from previous layers).
Different classes for each anchor.
Softmax was replaced by logistic classifiers (1 object could belong to several classes now).
Box parametrization is more stable
Boxes are predicted by convolutions instead of linear layers.
Training using different scale images (the network is fully convolutional, so it is possible).
A backbone network is replaced by custom DarkNet architecture.
### YOLOv4 (2020)
### YOLOv5 (2020)
### YOLOv6
### YOLOv7
### YOLOR (2021)
### YOLOv8 (2023)
Modifications in convolutions
Anchor-free prediction (or these boxes are optional?)
Segmentation and classification capabilities
Performance improvements
