![](./images/FFD06809.png)

https://arxiv.org/pdf/1504.08083.pdf
https://github.com/rbgirshick/fast-rcnn

[R-CNN](./Rich%20feature%20hierarchies%20for%20accurate%20object%20detection%20and%20semantic%20segmentation.ipynb) drawbacks:
* Training is a multi-stage pipeline
* Training is expensive in space and time
* Object detection is slow

SPPnet (Spatial pyramid pooling networks) [11]:
* Sharing computation to accelerate
* Training is a multi-stage pipeline

Contributions:
1. Higher detection quality (mAP)
2. Training is single-stage, using a multi-task loss
3. Traning can update all network layers
4. No disk storage is required for feature caching

## 2.1. The RoI pooling layer

RoI: $(r, c, h, w)$ max pooling to $H \times W$

## 2.2. Initializing from pre-trained networks

https://github.com/BVLC/caffe/wiki/Model-Zoo

* CaffeNet [14]
* VGG_CNN_M_1024 [3]
* VGG16 [20]

## 2.3. Fine-tuning for detection

Sampling $N$ images and then sampling $R / N$ RoIs from each image

### Multi-task loss

Outputs:
* Discrete probability distribution: $p = (p_0, \dots, p_K)$ (softmax over fc layer)
* bounding-box regression offsets: $t^k = (t_x^k, t_y^k, t_w^k, t_h^k)$

Ground-truth
* $u$: class, background is 0
* $v$: bounding-box regression target

$$L(p, u, t^u, v) = L_{\text{cls}}(p, u) + \lambda [ u \ge 1 ] L_{\text{loc}}(t^u, v)$$

$$L_{\text{cls}}(p, u) = - \log p_u$$

$$L_{\text{loc}}(t^u, v) = \sum_{i \in \{ x, y, w, h \}} \text{smooth}_{L_1}(t_i^u - v_i)$$

$$\text{smooth}_{L_1}(x) = \left \{ \begin{array}{ll}
0.5 x^2 & \text{if}~|x| < 1 \\
|x| - 0.5 & \text{otherwise}
\end{array} \right .$$

less sensitive to outliers

Normalize the ground-truth regression targets $v_i$ to have zero mean and unit variance. All experments use $\lambda=1$.

In [1]:
import numpy as np
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import Scatter

x = np.arange(-2, 2, 0.01)
y = 0.5 * x ** 2 * (np.abs(x) < 1) + (np.abs(x) - 0.5) * (np.abs(x) >= 1)

init_notebook_mode(connected=True)
iplot({'data': [Scatter(x=x, y=y)]})

### Mini-batch sampling

$N = 2$, $R=128$

* 25% $\ge 0.5$ IoU as object
* 75% $[0.1, 0.5)$ IoU as background

* 50% chance horizontally flip

### Back-propagation through RoI pooling layers

* $x_i \in \mathbb{R}$: the $i$-th activation input into the RoI polling layer
* $y_{rj}$: $j$-th output from the $r$-th ROI

* $y_{rj} = x_{i^* (r, j)}$
* $i^*(r, j) = \arg \max_{i' \in R(r, j)} x_{i'}$
* $R(r, j)$: the index set of inputs

$$\frac{\partial L}{\partial x_i} = \sum_r \sum_j [i = i^*(r, j) ] \frac{\partial L}{\partial y_{rj}}$$

### SGD hyper-parameters

Init: zero-mean Gaussian distributions with standard deviations 0.01 and 0.001

All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001.

30k mini-batch then lower the learning rate to 0.0001 and train for another 10k iterations

momentum of 0.9 and parameter decay of 0.0005

## 2.4. Scale invariance

* brute force: each image is processed at a pre-defined pixel size 
* image pyramids

### 3.1. Truncated SVD for faster detection

Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23]

$$ W \approx U \Sigma_t V^T$$

=> 2 fc layers
* First: $\Sigma_t V^T$, $t \times t * t \times v = t \times v$
* Second: $W$, $u \times t$