# Object Detection

### Problem Statement
You are given an image of dimension $H$ by $W$, and a set of labels (eg, `cat, tree, sky`) $(\| \text{classes}\| = C)$. Your goal is to predict the image class, and to return a set of coordinates which bounds the object in the image. Except this time, there are a variable number of things in the image. Could be 1, could be 12, you do not know ahead of time.

For instance,
```
(image)               # classification, (x, y, w, h)
[[0.23, 0.05], ---->  (cat, [1, 0, 1, 1])
 [0.15, 0.03]]        
```
Or
```
(image)               # classification, (x, y, w, h)
[[0.23, 0.05], ---->  (cat, [1, 1, 1, 1]), (dog, [0, 0, 1, 1)
 [0.15, 0.03]]        
```

### Naive Approach - Sliding Window
Fix some sliding window, (eg 2x2), and iterate through the image passing that window to the CNN, where the window is classified as `dog, cat` or `background`

The biggest, most obvious problem is how do we choose the crop? In order to tackle this problem with a brute force approach we would need to try 10,000+ different window sizes.



### Good Approach - Region Proposals
Within traditional computer vision there are algorithms which find regions likely to contain an object (note that this is all traditional CV, using things like image gradients, edge detection, etc...). These methods will spit out 1000-2000 potential regions in 1-2 seconds, with lots of noise but very high recall, meaning that if an object exists within the image, it is likely to be proposed as a region.

We now have reduced the number of things to run inference on from several million to several thousand, making the problem much more tractable. This idea all came together in a paper called **R-CNN**

But how do we handle variable sizes? Due to the fully connected layers in the network, all images are expected to have the same shape, so we first have to warp the image regions to a pixel $H$ and $W$ before running the same 2 step process we are used to seeing in **Classification + Localization**, classifying the image, and predicting the bounding box (sometimes trimming slightly)

So what are the problems
- It is still slow. We still have thousands of regions which we need to iterate through. (30 sec - 1 minute inference)
- We aren't learning region proposals so if that is a bottlneck in performance we arent going to improve very much.
- Training takes a long time. 

### Better Approach - Fast R-CNN

To account for some of the slowness in the previous step we are going to create a high resolution feature map for the entire image, which we can then reuse. Thus rather than taking crops directly from the image, we can take crops from the feature map.

Before (R-CNN)

```
region-proposals -> crop image regions -> pass regions through network
```

After (Fast R-CNN)

```
region-proposals -> crop feature map regions
```

Note that the steps from here are then all the same
- region features must be rescaled to squares
- rescaled features are then passed through 2 separate fully connected layers (classification and bounding box)

This is now muuch faster (inference taking .25 - .5 seconds, rather than 30 - 60 seconds!)

Problem
- Region proposal is now a bottleneck (still takes around 2 seconds)
- We still arent learning region proposals

### Even Better Approach - Faster R-CNN

Similar to the Fast R-CNN, except now we have 2 additional fully connected networks which preceed the Classifcation and Bounding Box layers. We call these 2 layers the Region Proposal Network (RPN) because they
1. Identify whether or not a region contains an object
2. Identify the bounding box within the region.

This is the primary distinction between Fast and Faster R-CNN, but it does create complications as we are using 1 feature map for 4 different tasks
1. Identify object not object.
2. Identify bounding box.
3. Classify object within region.
4. Trim bounding box as necessary.

And as we already alluded to, increasing the number of tasks in Multi-Task learning increases the difficulty in training the network.

But its very fast now (inference taking .25 - .5 seconds)

### Alternatives - YOLO/ SSD
What if we eliminated the "region proposal" component altogether, instead 
- breaking the image up into a $7x7$ grid
- iterating through all 49 regions
- iterating through 5 potential "base boxes" (rectangle of different orientation within the grid)
    - looking to expand each of those regions by some fixed amount ($dx, dy, dh, dw$)
- finally giving a classification which reflects the confidence in that region.

The output then being a tuple of 49 x 5 x 5
- 49 for the 49 different regions
- 5 for the 5 different potential base boxes we start with
- 5 for the tuples of 5 tied to each base box (confidence, ($dx, dy, dh, dw$))

### Comparisons
- YOLO and SSD are much faster but not as accurate
- Faster R-CNN is slower but more accurate

### Aside
- The R in R-CNN stands for region
- YOLO stands for "you only look once"
- SSD stands for "single shot detection"