## 7.1 Intro to Two Stage Object Detection
- Two-stage detectors typically consist of a <font color="orange">region proposal</font> stage followed by a <font color="orange">refinement stage</font>.
- Historical timeline of prominent two-stage object detection models, <br><br>
<img src="resource/two-stage-obj-det-timeline.png" width="100%"> <br><br>
- Historical <font color="orange">innovation</font> in two-stage object detection models, <br><br>
<img src="resource/two-stage-obj-det-innovation.png" width="100%"> <br><br>


________
<br><br><br><br>
### 7.1.1 <font color="orange">R-CNN</font> (Regions with Convolutional Neural Networks)
- Author : Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik
- Release Date : 2014
- Innovation : 
    - Introduced the use of <font color="orange">region proposals</font> followed by CNN-based feature extraction for each region. 
    - R-CNN extracts around 2,000 region proposals using <font color="orange">selective search</font> and classifies each region with a CNN.
- Architecture : 
    - Consists of three stages: 
        1. <font color="orange">Region proposal</font> using <font color="orange">selective search</font>, 
            - convert image into <font color="orange">small segments</font> based on <font color="orange">color</font> similarity and <font color="orange">texture</font>,
            - <font color="orange">merge</font> segmented region based on <font color="orange">nearest similarity</font> in color, texture, size, or shape,
            - produce list of <font color="orange">proposed regions</font> (box) where objects might be located.<br>
            <img src="resource/selective_search.png" width="600px"><br>
        2. <font color="orange">Feature extraction</font> for each <font color="orange">wrapped region</font> using a <font color="orange">pre-trained CNN</font> (AlexNet or similar), 
            - wrapping procedure is necessary to standardize proposed regions dimension for input into the network 
        3. <font color="orange">SVM classifier</font> to <font color="orange">predict object labels</font> and <font color="orange">Linear regression</font> to <font color="orange">predict object box</font> for each proposal.<br>
        <img src="resource/r-cnn-arch2.png" width="900px"><br>
- Benchmark : PASCAL VOC 2007 achieved 58.5% mAP.
- Paper : ['Rich feature hierarchies for accurate object detection and semantic segmentation' - arxiv.org](https://arxiv.org/pdf/1311.2524)

<br><br><br><br>
### 7.1.2 <font color="orange">Fast R-CNN</font>
- Author: Ross Girshick
- Release Date: 2015
- Innovation: 
    - Improved efficiency by <font color="orange">sharing computation</font> over all region proposals. 
    - It introduces the <font color="orange">ROI pooling</font> layer to crop features directly from a <font color="orange">shared feature map</font> rather than processing each region independently.
- Architecture: 
    - Similar to R-CNN but uses a <font color="orange">single forward pass</font> through the CNN to generate feature maps,
        - Find region proposal using <font color="orange">selective search</font>,
        - Apply forward pass image through the CNN producing <font color="orange">feature map</font>,
        - <font color="orange">Projecting region</font> into generated feature maps. <br>
        <img src="resource/fast r-cnn-arch2.png" width="95%"><br>
    - Then applies <font color="orange">ROI pooling</font> to extract features for each region in <font color="orange">fixed size</font>.
        - <font color="orange">Divide</font> the Region into Sub-Regions (Grids),
        - Apply <font color="orange">Max-pooling</font> in each grid,
        - Producing a smaller, <font color="orange">fixed-size</font> region that preserves important features.
            - We are no longer need to wrap the region to standarize regions dimension size.<br>
        <img src="resource/roi-pooling.png" width="900px"><br>
    - Uses a <font color="orange">Softmax</font> classifier instead of SVM to predict label.
- Benchmark: PASCAL VOC 2012 achieved 70% mAP.
- Paper : ['Fast R-CNN' - arxiv.org](https://arxiv.org/pdf/1504.08083)

<br><br><br><br>
### 7.1.2 <font color="orange">Faster R-CNN</font>
- Author: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
- Release Date: 2016
- Innovation: 
    - Introduced a <font color="orange">Region Proposal Network (RPN)</font> to replace selective search, making the model end-to-end trainable. 
    - The <font color="orange">RPN</font> shares features with the detection network, improving speed and accuracy.
- Architecture: 
    - Combines <font color="orange">RPN</font> and <font color="orange">Fast R-CNN</font> into a unified network. <br>
    <img src="resource/faster r-cnn-arch2.png" width="95%"><br><br><br>
    - The <font color="orange">RPN</font> generates <font color="orange">region proposals</font>, which are then refined by the <font color="orange">Fast R-CNN</font> detection head using <font color="orange">ROI pooling</font> and a <font color="orange">fully connected layer</font>.<br>
        <img src="resource/RPN2.png" width="95%"><br>
        - RPN generate multiple <font color="orange">anchor boxes</font> centered at the current sliding window position,
            - <font color="orange">anchor box</font> is <font color="orange">predefined bounding boxes</font> of different <font color="orange">scales</font> and aspect <font color="orange">ratios</font> covering various object shapes and sizes.
        - <font color="orange">RPN Head</font> use <font color="orange">convolution operation</font> with 3x3 kernel.
            - Then passed through the result to two separate branches ($1×1$ <font color="orange">convolutional layers</font>).
                - <font color="orange">Objectness Score Prediction</font> : A <font color="orange">binary classifier</font> outputs an objectness score for each anchor (<font color="orange">object</font> or <font color="orange">background</font>),
                - <font color="orange">Bounding Box Regression</font> : A regressor predicts the adjustments (offsets) needed to <font color="orange">refine</font> the anchor boxes to better fit the ground-truth bounding boxes.
        - <font color="orange">NMS</font> is applied to <font color="orange">filter</font> out <font color="orange">redundant</font> and <font color="orange">overlapping</font> proposals, leaving only the most confident ones.
        
        
    
- Benchmark: COCO 2015, achieving around 42.7% mAP.
- Paper : ['Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks' - arxiv.org](https://arxiv.org/pdf/1506.01497)