# RPN(Region Proposal Network)

## Architecture

To generate region proposals, we slide a small network over the conv feature map output by the last
shared conv layer. This network is fully connected to an n × n spatial window of the input conv feature map. 
Each sliding window is mapped to a lower-dimensional vector (256-d for ZF and 512-dfor VGG). 
This vector is fed into two sibling fully-connected layers—a box-regression layer (reg)
and a box-classification layer (cls).

This architecture is naturally implemented with an **n × n conv layer followed by two sibling 1 × 1 conv
layers (for reg and cls, respectively)**. ReLUs [15] are applied to the output of the n × n conv layer.


At each sliding-window location(at each anchor point), we simultaneously predict k region proposals, so the reg layer has **4k outputs encoding the coordinates of k boxes**. The cls layer outputs **2k scores that estimate
probability of object / not-object for each proposal**.2 The k proposals are parameterized relative to
k reference boxes, called anchors. Each anchor is centered at the sliding window in question, and is
associated with a scale and aspect ratio. We use 3 scales and 3 aspect ratios, yielding k = 9 anchors
at each sliding position. For a conv feature map of a size W ×H (typically ∼2,400), there are W Hk
anchors in total.


## Training set labeling

For training RPNs, we assign a binary class label (of being an object or not) to each anchor. We
assign a positive label to two kinds of anchors: 

(i) the anchor/anchors with the highest Intersectionover-Union (IoU) overlap with a ground-truth box, 

(ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box. 

Note that a single ground-truth box may assign positive labels
to multiple anchors. 

We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.

## Loss Function

L({pi}, {ti}) = 1
Ncls
X
i
Lcls (pi
, p∗
i
) + λ
1
Nreg
X
i
p
∗
i Lreg (ti
, t∗
i
).

Here, i is the index of an anchor in a mini-batch and pi
is the predicted probability of anchor i being
an object. The ground-truth label p
∗
i
is 1 if the anchor is positive, and is 0 if the anchor is negative. ti
is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t
∗
i
is that
of the ground-truth box associated with a positive anchor. The classification loss Lcls is log loss over
two classes (object vs. not object). For the regression loss, we use Lreg (ti
, t∗
i
) = R(ti − t
∗
i
) where
R is the robust loss function (smooth L1) defined in [5]. The term p
∗
i Lreg means the regression loss
is activated only for positive anchors (p
∗
i = 1) and is disabled otherwise (p
∗
i = 0). The outputs of
the cls and reg layers consist of {pi} and {ti} respectively. The two terms are normalized with Ncls
and Nreg , and a balancing weight λ.
3



For regression, we adopt the parameterizations of the 4 coordinates following [6]:

tx = (x − xa)/wa,   ty = (y − ya)/ha,   tw = log(w/wa),   th = log(h/ha),
t
∗
x = (x
∗ − xa)/wa, t∗
y = (y
∗ − ya)/ha, t∗
w = log(w
∗
/wa), t∗
h = log(h
∗
/ha),

where x, y, w, and h denote the two coordinates of the box center, width, and height. Variables
x, xa, and x
∗
are for the predicted box, anchor box, and ground-truth box respectively (likewise
for y, w, h). This can be thought of as bounding-box regression from an anchor box to a nearby
ground-truth box.


## Mini batch sampling and hyper parameters

Each mini-batch arises from a single image
that contains many positive and negative anchors. It is possible to optimize for the loss functions of
all anchors, but this will bias towards negative samples as they are dominate. Instead, we randomly
sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled
positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples
in an image, we pad the mini-batch with negative ones.

We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution
with standard deviation 0.01. 

All other layers (i.e., the shared conv layers) are initialized by pretraining a model for ImageNet classification, as is standard practice. We tune all conv3 1 and up for the VGG net to conserve memory [5]. 

We use a **learning rate
of 0.001** for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL dataset.
We also use a momentum of 0.9 and a weight decay of 0.0005 [11].





# Detection Network

## Architecture

It takes as input an conv feature map and a set
of object proposals. For each object proposal from the RPN a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.


Each feature vector is fed into a sequence of fully connected
(fc) layers that finally branch into two sibling output layers: 

one that produces softmax probability estimates over K object classes plus a catch-all “background” class and


another layer that outputs four real-valued numbers for each
of the K object classes. Each set of 4 values encodes refined
bounding-box positions for one of the K classes.


## Training set labeling

 As
in [9], we take 25% of the RoIs from object proposals that
have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise
the examples labeled with a foreground object class, i.e.
u ≥ 1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval , following . These are the background
examples and are labeled with u = 0.

## Loss function


Each training RoI is labeled with a ground-truth class u
and a ground-truth bounding-box regression target v. 

We
use a multi-task loss L on each labeled RoI to jointly train
for classification and bounding-box regression:

L(p, u, tu
, v) = Lcls(p, u) + λ[u ≥ 1]Lloc(t
u
, v), 


in which Lcls(p, u) = − log pu is log loss for true class u.
The second task loss, Lloc, is defined over a tuple of
true bounding-box regression targets for class u, v =
(vx, vy, vw, vh), and a predicted tuple t
u = (t
u
x
, tu
y
, tu
w, tu
h
),
again for class u

## Mini batch sampling and hyper parameters 

  During fine-tuning, each SGD
mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches
of size R = 128, sampling 64 RoIs from each image. The sampling starategy is described in the labeling part

The fully connected layers used for softmax classification and bounding-box regression are
initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0

# Proposal flow from RPN To detection network

 During training, we
ignore all cross-boundary anchors so they do not contribute to the loss. For a typical 1000 × 600
image, there will be roughly 20k (≈ 60 × 40 × 9) anchors in total. With the cross-boundary anchors
ignored, there are about 6k anchors per image for training.

During testing, however, we still apply the fully-convolutional RPN to the entire
image. This may generate cross-boundary proposal boxes, which we clip to the image boundary.

Some RPN proposals highly overlap with each other. To reduce redundancy (during training and testing), we adopt nonmaximum suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU
threshold for NMS at 0.7, which leaves us about 2k proposal regions per image.

After NMS, we use the top-N ranked proposal regions for detection(during testing most probably). 

We
train Fast R-CNN using 2k RPN proposals, but evaluate different numbers of proposals at test-time.

# Testing procedure


At test time, the 20k anchors from each image go through a series of post-processing steps to send in the object proposal bounding boxes.

The regression coefficients are applied to the anchors for precise localization. This gives precise bounding boxes.

All the boxes are arranged according to their cls scores. Then, a non-maximum suppression (NMS) is applied with a threshold of 0.7. From the top down, all of the bounding boxes which have an IoU of greater than 0.7 with another bounding box are discarded. Thus the highest-scoring bounding box is retained for a group of overlapping boxes.
This gives about 2k proposals per image.

The cross-boundary bounding boxes are retained and clipped to image boundary.At train time, they are ignored 

While using these object proposals to train the Fast R-CNN detection pipeline, all 2k proposals from the RPN are used. At test time for Fast R-CNN detection, only the Top N proposals from the RPN are chosen.

# Four-step training

a)The RPN is trained independently as described above. The backbone CNN for this task is initialized with weights from a network trained for an ImageNet classification task, and is then fine-tuned for the region proposal task.

b) The Fast R-CNN detector network is also trained independently. The backbone CNN for this task is initialized with weights from a network trained for an ImageNet classification task, and is then fine-tuned for the object detection task. The RPN weights are fixed and the proposals from the RPN are used to train the Faster R-CNN.

c) The RPN is now initialized with weights from this Faster R-CNN, and fine-tuned for the region proposal task. This time, weights in the common layers between the RPN and detector remain fixed, and only the layers unique to the RPN are fine-tuned. This is the final RPN.

d) Once again using the new RPN, the Fast R-CNN detector is fine-tuned. Again, only the layers unique to the detector network are fine-tuned and the common layer weights are fixed.

# Good reading material on Faster R-CNN

* https://towardsdatascience.com/faster-r-cnn-for-object-detection-a-technical-summary-474c5b857b46
* https://dongjk.github.io/code/object+detection/keras/2018/06/10/Faster_R-CNN_step_by_step,_Part_II.html
