# Deep Learning for Object Detection
1. [Overview](#Overview)

3. [R-CNN Family](#R-CNN-Family)
    - [R-CNN](#R-CNN)
    - [Fast R-CNN](#Fast-R-CNN)
    - [Faster R-CNN](#Faster-R-CNN)
    
4. [YOLO Family](#YOLO-Family)
    - [YOLO v1](#YOLO-v1)
    - [YOLO v2](#YOLO-v2)
    - [YOLO v3](#YOLO-v3)
    
5. [Appendix](#Appendix)
7. [Reference](#Reference)

---

## Overview

### Computer vision tasks
![cv_tasks](https://drive.google.com/uc?export=view&id=1ZuXHkVFyfxzgtBOncmhAlNz1uH80q5z-)
- **Image Classification**
    - What is the object in the image?
- **Object Localization**
    - Where is the object in the image?
- **Object Detection**
    - Classification + Localization
- **Segmentation**
    - Pixel-wise predictions
    
    
### Two Stage vs One Stage
- Two Stage: __Region Proposal__ first, then __Classification__
![Two Stage](https://drive.google.com/uc?export=view&id=1qh8ynXBFm_rZozWsG6_57Ta2rLV22dV2)

- One Stage: End-to-End
![One Stage](https://drive.google.com/uc?export=view&id=1A33ZiA3EscSMemPeCPFg-WfAXmbj-IYL)

|Method|Pros|Cons|
|----|----|----|
|Two Stage|High Accuracy| Slow |
|One Stage|Low Accuracy| Fast, Real-time |


---

## R-CNN Family

### R-CNN
[_"Rich feature hierarchies for accurate object detection and semantic segmentation."_](https://arxiv.org/abs/1311.2524) Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2014)

#### Model
![r-cnn-1](https://drive.google.com/uc?export=view&id=1wDB75843Do-uspPNkl78Yenltd64s9jB)

- Region proposals: [Selective search](http://www.huppelen.nl/publications/selectiveSearchDraft.pdf)

- Feature extraction: CNN, AlexNet

- Non-maximum suppresion
![NMS](https://drive.google.com/uc?export=view&id=1Gf_cBFRlrd8pFmhLt5Bu0cESeH8MckIz)
![IOU](https://drive.google.com/uc?export=view&id=14PYqELFvKWkQFzNkF9LXm0KA2d8Z1qHU)


### Fast R-CNN
[_"Fast R-CNN"_](https://arxiv.org/abs/1504.08083)Girshick, R. (2015)

#### Cons of R-CNN
- Training is a multi-stage pipeline
- Training is expensive in space and time
- Object detection is slow

#### Model
![r-cnn-1](https://drive.google.com/uc?export=view&id=19vEeoYVROlCi4h3S59s95hnybuy1as9v)

#### RoI pooling layer
![spp_3](https://drive.google.com/uc?export=view&id=1Mg7cmrPz3Lubn9aMtQAb5Z78G6dgV1Xa)

![roi pooling](https://drive.google.com/uc?export=view&id=1d0myxH-6YjKrv1W5uKpjHvQLBUEAUtT9)

#### ROI Align (Mask R-CNN)
![roi align 1](https://drive.google.com/uc?export=view&id=1ybpWdrXlSA40bKD7IcbJdFbipF2QttmW)

![bi-linear](https://drive.google.com/uc?export=view&id=1Yfz8m86rVaS6_gBnAmb3M5YGXWZ9B5bB)
![bi-linear-f1](https://drive.google.com/uc?export=view&id=1HgXGfKC8dOeNqGTAznaGMwpaa0-vYcIT)
![bi-linear-f2](https://drive.google.com/uc?export=view&id=1NJzGrm7Mb7uOBZqrPspcjIes77vCHq5B)
![bi-linear-f3](https://drive.google.com/uc?export=view&id=17aqLveNsWF8KAsjP0-YR9f4JBe5oLFuA)


### Faster R-CNN
[_"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks."_](https://arxiv.org/abs/1506.01497) Ren, S., He, K., Girshick, R. and Sun, J. (2015)

#### Model
![faster r-cnn-1](https://drive.google.com/uc?export=view&id=1EZeno8hyt3fMhN_z4WNqFICM7UO1XjFY)

#### Region Proposal Network
![rpn](https://drive.google.com/uc?export=view&id=1EZeno8hyt3fMhN_z4WNqFICM7UO1XjFY)



---

## YOLO Family

### YOLO v1
[_"You Only Look Once: Unified, Real-Time Object Detection."_](https://arxiv.org/abs/1506.02640)Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2015)

#### Model
![yolo_v1_0](https://drive.google.com/uc?export=view&id=1IZaW3SpLFtuP3JuWbQq56LyRInmYnSNi)

![yolo_v1_2](https://drive.google.com/uc?export=view&id=1BUAUYF3x9cJheqQfSdwT-9w07B0VNX_5)

- tensor dimenstion: S × S × (B ∗ 5 + C)
- S: W and H of tensor, 7
- B: num bounding box, 2
- C: num class, 20
- 5: (x, y, w, h, confidence)


![yolo v1 1](https://miro.medium.com/max/700/1*JniWRt-ceWLNlkOULjhdpg.png)


### YOLO v2
[_"YOLO9000: Better, Faster, Stronger."_](https://arxiv.org/abs/1612.08242)Redmon, J. and Farhadi, A. (2016).

#### Better
- Batch Normalization
- High Resolution Classifie (224 X 224 -> 448 X 448)
- Convolutional With Anchor Boxes
- Dimension Clusters (k-means)
![k means](https://drive.google.com/uc?export=view&id=1DnGl95YPBgfr7efPuhDQYZl1o2Le_82S)
![distance](https://miro.medium.com/max/539/1*4UeShDFUuddbOOMAh7KTdg.png)
- Direct Location Prediction
- Fine-Grained Features
- Multi-Scale Training

#### Faster
- VGG-16 (30.69 billion floating point operations per 224*224 image) -> Darknet-19 (8.52 billion)
[darknet19](https://miro.medium.com/max/548/1*iPHGuCWfCOTjrEW187fSZQ.png)

#### Stronger
- WordTree
![wordtree_1](https://miro.medium.com/max/700/1*1rpDaEiL-4NuTBlk9p0oAg.png)
![wordtree_2](https://miro.medium.com/max/700/1*Js9qWV9taiuZHzTgA65OxQ.png)
![wordtree_3](https://miro.medium.com/max/700/1*YiX61mdylOzZYlBFXl9HjA.png)

![it's over 9000](https://i.kym-cdn.com/entries/icons/original/000/000/056/itsover1000.jpg)


### YOLO v3
[_"YOLOv3: An Incremental Improvement."_](https://arxiv.org/abs/1804.02767)Redmon, J. and Farhadi, A. (2018).

> I managed to make some improvements to YOLO.
> But, honestly, nothing like super interesting, just a bunch of small changes that make it better.
> I also helped out with other people’s research a little.
>
>
> So here’s the deal with YOLOv3: We mostly took good ideas from other people. 

#### Bounding box prediction 
- No change

#### Class prediction
- Softmax is not used
- independent logistic classifiers are used and binary cross-entropy loss is used

#### Prediction Across Scales
- 3 different scales are used
![fpn](https://miro.medium.com/max/690/1*D_EAjMnlR9v4LqHhEYZJLg.png)

#### Feature Extractor: Darknet-53
![darknet-53](https://miro.medium.com/max/490/1*tF1fK8-D5PVDb4khxvIH_g.png)

![yolov3_1](https://miro.medium.com/max/700/1*RFpjH8D6TStBaYuZYehe_g.png)

---

## Appendix

### [History of Object Detection](https://medium.com/@nikasa1889/the-modern-history-of-object-recognition-infographic-aea18517c318)

---

## Reference
> [深度學習-什麼是one stage，什麼是two stage 物件偵測](https://medium.com/@chih.sheng.huang821/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-%E4%BB%80%E9%BA%BC%E6%98%AFone-stage-%E4%BB%80%E9%BA%BC%E6%98%AFtwo-stage-%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%AC-fc3ce505390f)
>
> [A Gentle Introduction to Object Recognition With Deep Learning
](https://machinelearningmastery.com/object-recognition-with-deep-learning/)
>
> [Review of Deep Learning Algorithms for Object Detection](https://medium.com/zylapp/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852)
>
> [Joseph Chet Redmon's website](https://pjreddie.com/)
>
> [Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3](https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088)
>
> [機器/深度學習: 物件偵測 Non-Maximum Suppression (NMS)](https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8-%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%AC-non-maximum-suppression-nms-aa70c45adffa)
>
>[Bilinear interpolation](https://wiki.mbalib.com/zh-tw/%E5%8F%8C%E7%BA%BF%E6%80%A7%E6%8F%92%E5%80%BC)