In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## VGG-16

Before we start describing SSD model let's have a quick look on VGG-16 architecture which is used as a backbone or feature extractor for SSD model

Model VGG (visual geometry group) was first presented at 2013 ImageNet competition and got high results (not winner ZENet was winner) but it was fast and had a simple architecture.
VGG-16 is still used in
- Backbone for many models
- Perceptual loss for style transfer and auto-encoder (super-resolution, variational auto-encoder) models
- From my experience, VGG-16 has a place, color and size invariant feature extraction property because of pooling layers and convolutional stack.

VGG-16 architecture:
<img src="images/od/vgg_16_1.png" height="800" width="800">

One of the best features extractor for background jobs:
- High dimension
- Spark job for image clustering

## SSD (Single shot multi box detection)

SSD is one of the single shot detection family members.
<br>
For object detection, as we remember from localization models, we need fixed number of bounding boxes.
<br>
SSD does this by assigning $n$ bounding boxes per pixel on feature map
<br>
Why on features map, because they have smaller number of pixels, smaller special (height and with) size
<br>
The number of bounding boxes might be different for the layer


General architecture with non max suppression:
<img src="images/od/ssd_2.png" height="1000" width="1000">

Detector and classifier block:
<img src="images/od/ssd_3.png" height="1000" width="1000">

We have a feature map with low spatial size for block, generate three bounding boxes per pixel:
<img src="images/od/ssd_4.png" height="1000" width="1000">

With corner correction:
<img src="images/od/ssd_7.png" height="1000" width="1000">

Train with appropriated loss:
<img src="images/od/ssd_10.png" height="1000" width="1000">

Different amount of bounding boxes per layer and correction at the end:
<img src="images/od/ssd_12.png" height="1000" width="1000">

## ResNet and notion of the skip-connections

ResNet by the Microsoft was winner of the 2015 ImageNet competition and the first model which outperformed the human

The main idea behind the ResNet architecture is so called skip-connections or residual blocks:


Residual block:
<img src="images/od/resnet_1.png" height="800" width="800">

With different interpretations:
<img src="images/od/resnet_2.png" height="800" width="800">

With the skip connections models models depth limit was increased with performance gain

ResNet vs other models
<img src="images/od/resnet_3.png" height="800" width="800">

Almost all modern architectures for classification (which are used as backbones), or encoders use some kind of modification of residual connections:
- Ineption-ResNet
- ResNeXt
- MobileNet
- EfficientNet

Residual connections in other architectures:
<img src="images/od/resnet_4.jpg" height="800" width="800">

## Feature pyramid models

Lets combing down-sampling and up-sampling

UNet
<img src="images/od/unet.png" height="800" width="800">

Let's make it slightly faster

<img src="images/od/fpn_1.jpeg" height="800" width="800">

<img src="images/od/fpn_2.jpeg" height="800" width="800">

## RetinaNet

Feature pyramids with ResNet50 as backbone:

<img src="images/od/retina_1.jpg" height="1000" width="1000">

Feature pyramid extracts and preserves semantically and spatially, after some feature pyramid block (two) predictors are applied, bounding box regression with $H \times W$ spacial size and $4A$ channels and classification networks with $H \times W$ spacial size and $KA$ channels, class per anchor box:

<img src="images/od/retina_2.jpg" height="1000" width="1000">

And classification:
<img src="images/od/retina_3.png" height="1000" width="1000">

Regression loss:
$$
\begin{align}
T^i_x &= (G^i_x - A^i_x) / A^i_w  \\
T^i_y &= (G^i_y - A^i_y) / A^i_h \\
T^i_w &= log(G^i_w / A^i_w) \\
T^i_h &= log(G^i_h / A^i_h)
\end{align}
$$

And loss is applied as:
$$
L_{loc} = \sum_{j \in \{x, y, w, h\}}smooth_{L1}(P^i_j - T^i_j)
$$
<br>
where:
$$
\begin{equation}
smooth_{L1}(x) = 
\begin{cases}
0.5x^2 &|x| < 1 \\
|x| - 0.5 &|x| \geq 1
\end{cases}
\end{equation}
$$
<br>
and
$$
P^i = (P^i_x, P^i_y, P^i_w, P^i_h)
$$

Focal loss:
$$
L_{cls} = -\sum_{i=1}^{K}(y_ilog(p_i)(1-p_i)^\gamma \alpha_i + (1 - y_i)log(1 - p_i)p_i^\gamma (1 - \alpha_i))
$$

Here we have two dependent hyper-parameters:
- Weighting parameter $\alpha$ which is responsible for class imbalance $\alpha_i \in [0, 1]$
- Focusing parameter $\gamma$ for background foreground distinguish $\gamma \in (0, +\infty)$
- During the weight initialization, biases on the last layer are initialized bigger with some rules

This parameters are balancing for meaning hard negatives.
<br>
Because of big amount of background images, data might be classified as background easily

Here's the example of training with and without focal loss:
<img src="images/od/focal_1.png" height="800" width="800">

At the end we'll add that each detector outputs $1$K boxes and then reduced by the non max suppression algorithm 

## Multi-shot object detection

- Problems with detecting small objects
- Different size of sliding windows
- Run classifier network per patch
- Computationally expensive

#### Region proposals

With classical CV (greedy) algorithm calculated regions of images - Selective Search
<img src="images/od/selective_search.jpg" height="800" width="800">

#### R-CNN

We calculate 2000 region proposals and run them through the pre-trained cnn classifier (VGG-16 or ResNet-50)
<img src="images/od/r_cnn_1.png" height="800" width="800">

We train SVM classifiers per extracted features and also BBox regression to improve boxes:
<img src="images/od/r_cnn_2.png" height="800" width="800">

- Improves accuracy
- Computationally expensive for training
- Computationally expensive for testing / inference
- Slow

#### Fast-RCNN

Instead of calculation of region proposals on input image, we first run image through CNN model and calculate region proposals on smaller feature map

Region proposals have high recall and many of them are classified as background

Region proposals $N \times 5$ image index and coordinates
<img src="images/od/region_proposal_cat.png" height="800" width="800">

Then we use ROI pooling to adjust features and flatten pooled image parts for classification
ROI poolint takes a section of the input feature map that corresponds to it and scales it to some pre-defined size
<img src="images/od/roi_pooling_1.gif" height="800" width="800">

<img src="images/od/fast_r_cnn.png" height="800" width="800">

- Less computational resources
- Faster training time
- Faster testing / inference time
- The same performance

#### Faster-RCNN

Instead of using selective search, let's use separate convolutional neural network which calculates region proposals

<img src="images/od/faster_r_cnn_1.png" height="800" width="800">

<img src="images/od/faster_r_cnn_2.png" height="800" width="800">

- Less computationally expensive
- Region proposals calculation is learnable
- Faster for training
- Faster for test / inference

## Questions
<img src="images/od/questions_2.jpg" height="800" width="800">

## Thank you