Source: NIPS

Year: 2019

Authors: Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, Qixiang Ye

Institutions: University of Chinese Academy of Sciences, Xiamen University, Peng Cheng Laboratory

Anchor-based detectors leverage spatial alignment, i.e., Intersection over Unit (IoU) between objects and anchors, as the criterion for anchor assignment.

On the one hand, for objects of acentric features, the most representative features are not close to object centers. A spatially aligned anchor might correspond to fewer representative features, which deteriorate classification and localization capabilities. On the other hand, it is infeasible to match proper anchors/features for objects using IoU when multiple objects come together.

1. To achieve a high recall rate, the detector is required to guarantee that for each object at least one anchor’s prediction is close to the ground-truth.
2. In order to achieve high detection precision, the detector needs to classify anchors with poor localization (large bounding box regression error) into background.
3. The predictions of anchors should be compatible with the non-maximum suppression (NMS) procedure, i.e., the higher the classification score is, the more accurate the localization is. Otherwise, an anchor with accurate localization but low classification score could be suppressed when using the NMS process.

The loss function of an one-stage detector is
$$\mathcal{L}(\theta)=\sum_{a_j\in A_+}\sum_{b_i\in B}C_{ij}\mathcal{L}_{ij}^{cls}(\theta)+\beta\sum_{a_j\in A_+}\sum_{b_i\in B}C_{ij}\mathcal{L}_{ij}^{loc}(\theta)+\sum_{a_j\in A_-}\mathcal{L}_j^{bg}(\theta)$$
where $\theta$ denotes the network parameters to be learned. $\mathcal{L}_{ij}^{cls}(\theta)=BCE(a_j^{cls},b_i^{cls},\theta)$,$\mathcal{L}_{ij}^{cls}(\theta)=BCE(a_j^{cls},b_i^{cls},\theta)$,$\mathcal{L}_{ij}^{loc}(\theta)=BCE(a_j^{cls},b_i^{cls},\theta)$
It is converted into a likelihood probability:
$$\mathcal{P}(\theta)=e^{-\mathcal{L}(\theta)}=\prod_{a_j\in A_+}(\sum_{b_i\in B}C_{ij}\mathcal{P}_{ij}^{cls}(\theta))\prod_{a_j\in A_+}(\sum_{b_i\in B}C_{ij}\mathcal{P}_{ij}^{loc}(\theta))\prod_{a_j\in A_-}\mathcal{P}_j^{bg}(\theta)$$

To achieve the optimization of object-anchor matching, the detection framework is extended by introducing detection customized likelihood.

To implement the likelihood, first construct a bag of candidate anchors for each object $b_i$ by selecting top-ranked anchors $A_i\subset A$ in terms of their IoU with the object.

To optimize the recall rate, for each object $b_i\in B$ we require to guarantee that there exists at least one anchor $a\in A$, whose prediction is close to the ground-truth.
$$\mathcal{P}_{recall}(\theta)=\prod_i\max_{a_j\in A_i}(\mathcal{P}_{ij}^{cls}(\theta)\mathcal{P}_{ij}^{loc}(\theta))$$
To achieve increased detection precision, detectors need to classify the anchors of poor localization into the background class.
$$\mathcal{P}_{precision}(\theta)=\prod_j(1-P(a_j\in A_-)(1-\mathcal{P}_j^{bg}(\theta)))$$
where $P(a_j\in A_−) = 1 − max_iP(a_j\to b_i)$ is the probability that $a_j$ misses all objects and $P(a_j\to b_i)$ denotes the probability that anchor $a_j$ correctly predicts object $b_i$.

To be compatible with the NMS procedure, $P(a_j\to b_i)$ should have the following three properties:
1. $P(a_j\to b_i)$ is a monotonically increasing function of the $IoU$ between $a_j$ and $b_i$, $IoU_{ij}^{loc}$.
2. When $IoU_{loc}$ is smaller than a threshold $t$, $P(a_j\to b_i)$ is close to 0.
3. For an object $b_i$ , there exists one and only one $a_j$ satisfying $P(a_j\to b_i)$ = 1.

These properties can be satisfied with a saturated linear function, as:
$$F(x,t_1,t_2)=\left\{\begin{matrix}0,&x\leq t_1\\\frac{x-t_1}{t_2-t_1},&t_1<x<t_2\\1,&x\geq t_2\end{matrix}\right.$$
$$P(a_j\to b_i)=F(IoU_{ij}^{loc},t,max_j(IoU_{ij}^{loc}))$$
$$\mathcal{P}(\theta)=\mathcal{P}_{recall}(\theta)\times\mathcal{P}_{precision}(\theta)=\prod_i\max_{a_j\in A_i}(\mathcal{P}_{ij}^{cls}(\theta)\mathcal{P}_{ij}^{loc}(\theta))\times\prod_j(1-P(a_j\in A_-)(1-\mathcal{P}_j^{bg}(\theta)))$$

The max function is used to select the best anchor for each object. At early training epochs, the confidence of all anchors is small for randomly initialized network parameters. The anchor with the highest confidence is not suitable for detector training. Use Mean-max function instead, defined as:
$$MeanMax(X)=\frac{\sum_{x_j\in X}\frac{x_j}{1-x_j}}{\sum_{x_j\in X}\frac{1}{1-x_j}}$$

Replacing the max function with Mean-max, adding balance factor $w_1$, $w_2$, and applying focal loss, the detection customized loss function of an FreeAnchor detector is concluded as:
$$\mathcal{L}(\theta)=-log\mathcal{P}(\theta)=-w_1\sum_ilog(MeanMax(X_i))+w_2\sum_jFL(P(a_j\in A_-)(1-\mathcal{P}_j^{bg}(\theta)))$$
where $X_i=(\mathcal{P}_{ij}^{cls}(\theta)\mathcal{P}_{ij}^{loc}(\theta)|a_j\in A_i)$ is a likelihood set correspoinding to the anchor bag $A_i$. By using parameters $\alpha$ and $\gamma$ from focal loss, $w_1=\frac{\alpha}{|B|}$,$w_2=\frac{1-\alpha}{n|B|}$, and $FL(x)=-x^\gamma log(1-x)$