## Object Detection and Bounding Box

In [4]:
import matplotlib.pyplot as plt
%matplotlib inline
from PIL import Image

import sys
sys.path.append('/home/kesci/input/')
import d2lzh1981 as d2l

In [5]:
# display an image as an example
d2l.set_figsize()
img = Image.open('/home/kesci/input/img2083/img/catdog.jpg')
d2l.plt.imshow(img) # display the image and return the file output, add semicolon to only display image

<matplotlib.image.AxesImage at 0x7fc84edff8d0>

## Bounding Box

In [6]:
# bbox stands for bouding box
# takes (upper left x, upper left y, lower right x, loer right y) of the bbox
dog_bbox, cat_bbox = [60, 45, 378, 516], [400, 112, 655, 493]

In [7]:
def bbox_to_rect(bbox, color):  
    # define the function to change the bbox representation from
    # (upper left x, upper left y, lower right x, loer right y)to matplotlib format：
    # ((upper left x, upper left y), width, height)
    return d2l.plt.Rectangle(
        xy = (bbox[0], bbox[1]), 
        width=bbox[2]-bbox[0], 
        height = bbox[3]-bbox[1],
        fill = False, edgecolor = color, linewidth = 2)

In [1]:
fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));

NameError: name 'd2l' is not defined

## Anchor Box

There are a wide array of algos in object detection field, basically by sampling spaces in the input images and to adjust the area edge to approach the ground-truth bounding box of obejcts.

Here, in this notebook I will implement Anchor Box generated at different pixel point with different aspect ratio to solve our object detection problem!

> Note: using PyTorch for Object Detection look at :   [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)

In [9]:
import numpy as np
import math
import torch
import os
IMAGE_DIR = '/home/kesci/input/img2083/img/'
print(torch.__version__)

1.3.0


### Create multiple anchor boxes

Suppose the input image has height\*width of $h$ and $w$. We generate anchor boxes of different shapes with each pixel of the image as the center. If the size is $s\in (0,1]$ and the aspect ratio is $r > 0$, then the width and height of the anchor box will be $ws\sqrt{r}$ and $hs/\sqrt{r}$. When the center position is given, the anchor frame with known width and height is determined.


Below we set a set of sizes $s_1, \ldots, s_n$ and a set of aspect ratios $r_1, \ldots, r_m$. If using all the size and aspect ratio combinations with each pixel as the center, the input image will get a total of $whnm$ anchor boxes. Although these anchor boxes may cover all real bounding boxes, the computational complexity is way too high. Therefore, we are usually only interested in size and aspect ratio combinations that include $s_1$ or $r_1$, ie


$$
(s_1, r_1), (s_1, r_2), \ldots, (s_1, r_m), (s_2, r_1), (s_3, r_1), \ldots, (s_n, r_1).
$$


That is, the number of anchor frames centered on the same pixel is $n+m-1$. For the whole input image, we will generate a total of $wh(n+m-1)$ anchor boxes. The above method of generating anchor boxes has been implemented in `MultiBoxPrior` function. Specifying an input, a set of sizes, and a set of aspect ratios, the function returns all anchor boxes of the input.


In [10]:
d2l.set_figsize(figsize=(8,5))
img = Image.open(os.path.join(IMAGE_DIR, 'catdog.jpg'))
w, h = img.size
print("w = %d, h = %d" % (w, h)) # print image's width&height

#d2l.plt.imshow(img)

w = 728, h = 561


In [11]:
# define the function
def MultiBoxPrior(feature_map, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5]):
    """
    anchor uses (xmin, ymin, xmax, ymax)
    Args:
        feature_map: torch tensor, Shape: [N, C, H, W]
        sizes: List of sizes (0~1) of generated MultiBoxPriores  #the size of anchor box to the image
        ratios: List of aspect ratios (non-negative) of generated MultiBoxPriores #height/width ratio
    Returns:
        anchors of shape (1, num_anchors, 4) 
        every batch is same, so the 1st-dim is 1
    """
    pairs = [] # pair of (size, sqrt(ration))
    
    # generate (n + m -1) anchor boxes
    for r in ratios:
        pairs.append([sizes[0], math.sqrt(r)])
    for s in sizes[1:]:
        pairs.append([s, math.sqrt(ratios[0])])
    
    pairs = np.array(pairs)
    
    # generate box at the center of coordinates (x,y,x,y)
    ss1 = pairs[:, 0] * pairs[:, 1] # size * sqrt(ration)
    ss2 = pairs[:, 0] / pairs[:, 1] # size / sqrt(ration)
    
    base_anchors = np.stack([-ss1, -ss2, ss1, ss2], axis=1) / 2
    
    # combine cooridnates and anchor to get hw(n+m-1) box output
    h, w = feature_map.shape[-2:]
    shifts_x = np.arange(0, w) / w # generate coordinates on x/y axis
    shifts_y = np.arange(0, h) / h # divide width/height to standardize
    shift_x, shift_y = np.meshgrid(shifts_x, shifts_y) # combine x,y coordinates together
    
    shift_x = shift_x.reshape(-1)
    shift_y = shift_y.reshape(-1)
    
    shifts = np.stack((shift_x, shift_y, shift_x, shift_y), axis=1) # have all coordinates
    # reshape to keep base_anchor last dim, expand for shifts and base_anchor
    # 1st-dim is base_anchor, 2nd-dim is shifts
    anchors = shifts.reshape((-1, 1, 4)) + base_anchors.reshape((1, -1, 4)) 
    
    return torch.tensor(anchors, dtype=torch.float32).view(1, -1, 4) # .view() to reshape again

In [12]:
X = torch.Tensor(1, 3, h, w)  # create input data to test
Y = MultiBoxPrior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
Y.shape

torch.Size([1, 2042040, 4])


We can see that, the return `y` shape is (1, total_num_anchors, 4). After we transform `y` to (h, w, num_anchors_for_pixel, 4 ), we can get all anchors centered at one pixel by any pixel points. The below example shows how to get the first anchor for pixel（250, 250）.

It has 4 elements, upper left $x$ and $y$ coordinates and lower right $x$ and $y$ coordinates. The $x$ and $y$ coordinates are divided by image height and width, so the value is between 0 and 1.

In [13]:
# display anchor of one pixel point
boxes = Y.reshape((h, w, 5, 4)) # reshape to(h, w, anchor for every pixel)
# 5 because we have 3 sizes and 3 ratios, total we have 3+3-1 anchors for every pixel 

boxes[250, 250, 0, :] # * torch.tensor([w, h, w, h], dtype=torch.float32)
# the first size and ratio is 0.75 and 1
# the width and height is 0.75 = 0.7184 + 0.0316 = 0.8206 - 0.0706

tensor([-0.0316,  0.0706,  0.7184,  0.8206])

> To test the output, the size and ratio is 0.75 and 1, so after standardization, the height and width is both 0.75, so the output is correct（0.75 = 0.7184 + 0.0316 = 0.8206 - 0.0706

To display all the anchor centered at one pixel, we define a function `show_bboxes` to draw multiple bboxes.

In [14]:
def show_bboxes(axes, bboxes, labels=None, colors=None):
    
    def _make_list(obj, default_values=None):
        if obj is None:
            obj = default_values
        elif not isinstance(obj, (list, tuple)):
            obj = [obj]
        return obj

    labels = _make_list(labels)
    colors = _make_list(colors, ['b', 'g', 'r', 'm', 'c'])
    
    for i, bbox in enumerate(bboxes):
        color = colors[i % len(colors)]
        rect = d2l.bbox_to_rect(bbox.detach().cpu().numpy(), color)
        axes.add_patch(rect)
        
        if labels and len(labels) > i:
            text_color = 'k' if color == 'w' else 'w'
            axes.text(rect.xy[0], rect.xy[1], labels[i],
                      va='center', ha='center', fontsize=6, color=text_color,
                      bbox=dict(facecolor=color, lw=0))

We can see that the coordinates of the $x$ and $y$ axes in the variable `boxes` have been divided by the width and height of the image, respectively. When plotting, we need to restore the original coordinate value of the anchor box, and therefore define the variable `bbox_scale`. 

Now we can draw all the anchor boxes centered at (250, 250) in the image. It can be seen that the anchor box with a size of 0.75 and an aspect ratio of 1 covers the dog in the image pretty good.

In [15]:
# display anchor for 250*250 pixel points
d2l.set_figsize()
fig = d2l.plt.imshow(img)
bbox_scale = torch.tensor([[w, h, w, h]], dtype=torch.float32)
show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale,
            ['s=0.75, r=1', 's=0.75, r=2', 's=0.75, r=0.5', 's=0.5, r=1', 's=0.25, r=1'])

## IoU (Intersection over Union)

How do we evaluate our anchor boxes if we know the ground truth bouding boxes for our objects?

One obvious way is to measure the similarity between anchor and the real bbox, as we know the Jaccard index can measure the similarity of two sets. Given Set $\mathcal{A}$ and $\mathcal{B}$, the Jaccard index is the intersection divided by the union:


$$
J(\mathcal{A},\mathcal{B}) = \frac{\left|\mathcal{A} \cap \mathcal{B}\right|}{\left| \mathcal{A} \cup \mathcal{B}\right|}.
$$




In fact, we can think of the pixel area within the bounding box as a collection of pixels. In this way, we can use the Jaccard index of the pixel set of the two bounding boxes to measure the similarity of the two bounding boxes. When measuring the similarity of two bounding boxes, we usually refer to the Jaccard index as the Intersection over Union (IoU), that is, the ratio of the intersection and the union, as shown below. The value of the intersection ratio ranges between 0 and 1, with 0 means that the two bounding boxes have no overlapping pixels, and 1 means that the two bounding boxes are identical.

![Image Name](https://cdn.kesci.com/upload/image/q5vs9jkw9f.png?imageView2/0/w/640/h/640)


In [18]:
def compute_intersection(set_1, set_2):
    """
    compute intersections between anchors
    Args:
        set_1: a tensor of dimensions (n1, 4), anchor as(xmin, ymin, xmax, ymax)
        set_2: a tensor of dimensions (n2, 4), anchor as(xmin, ymin, xmax, ymax)
    
    Returns:
        intersection of each of the boxes in set 1 with respect to each of the boxes in set 2, shape: (n1, n2)
    """
    # PyTorch auto-broadcasts singleton dimensions
    lower_bounds = torch.max(set_1[:, :2].unsqueeze(1), set_2[:, :2].unsqueeze(0))  # (n1, n2, 2)
    upper_bounds = torch.min(set_1[:, 2:].unsqueeze(1), set_2[:, 2:].unsqueeze(0))  # (n1, n2, 2)
    intersection_dims = torch.clamp(upper_bounds - lower_bounds, min=0)  # (n1, n2, 2)
    return intersection_dims[:, :, 0] * intersection_dims[:, :, 1]  # (n1, n2)


def compute_jaccard(set_1, set_2):
    """
    计算anchor之间的Jaccard系数(IoU)
    Args:
        set_1: a tensor of dimensions (n1, 4), anchor表示成(xmin, ymin, xmax, ymax)
        set_2: a tensor of dimensions (n2, 4), anchor表示成(xmin, ymin, xmax, ymax)
    Returns:
        Jaccard Overlap of each of the boxes in set 1 with respect to each of the boxes in set 2, shape: (n1, n2)
    """
    # Find intersections
    intersection = compute_intersection(set_1, set_2)  # (n1, n2)

    # Find areas of each box in both sets
    areas_set_1 = (set_1[:, 2] - set_1[:, 0]) * (set_1[:, 3] - set_1[:, 1])  # (n1)
    areas_set_2 = (set_2[:, 2] - set_2[:, 0]) * (set_2[:, 3] - set_2[:, 1])  # (n2)

    # Find the union
    # PyTorch auto-broadcasts singleton dimensions
    union = areas_set_1.unsqueeze(1) + areas_set_2.unsqueeze(0) - intersection  # (n1, n2)

    return intersection / union  # (n1, n2)

## Mark anchor for training set


In the training set, we treat each anchor box as a training sample. In order to train the object detection model, we need to label two types of labels for each anchor box: 

- one is the category of the target contained in the anchor box, referred to as the category or class; 
- the second is the offset of the real bounding box from the anchor box, referred to as the offset 

In object detection, we first generate multiple anchors, then predict the category and offset for each anchor, then adjust the anchor position according to the predicted offset, and finally filter the predicted anchors to output.



In the training set, each image has been labeled with the position of the ground truth bounding box and the category of the object it contains. Whereas for the anchors, how do we know and assign a ground truth bounding box similar to an anchor?


Suppose the anchors in the image are $A_1, A_2, \ldots, A_{n_a}$, the ground truth bboxes are $B_1, B_2, \ldots, B_ {n_b}$, and $n_a \geq n_b$. Define the matrix $\boldsymbol{X} \in \mathbb{R}^{n_a \times n_b}$, where the element $x_ {ij}$ in the $i$th row and $j$th column is the IoU of anchor $A_i$ and bbox $B_j$.


First, we find the largest element in the matrix $\boldsymbol{X}$, and set the row index and column index of the element as $i_1, j_1$, respectively. We assign the bbox $B_ {j_1}$ to the anchor $A_ {i_1}$. Obviously, the anchor $A_ {i_1}$ and the bbox $B_ {j_1}$ have the highest similarity among all pairs. Next, all elements on the $i_1$ row and $j_1$ column in the matrix $\boldsymbol{X}$ are discarded. 

Find the largest remaining element in the matrix $\boldsymbol{X}$, and set the element's row index and column index as $i_2, j_2$. Again assign the bbox $B_ {j_2}$ to the anchor $A_ {i_2}$, and then discard all elements in the $i_2$ row and $j_2$ column of the matrix $\boldsymbol{X}$. At this point, the matrix $\boldsymbol{X}$ have discarded two rows and two columns.

Keep on looping until all $n_b$ column elements in the matrix $\boldsymbol{X}$ are discarded. At this time, we have assigned a bbox for $n_b$_num anchors. Next, we only traverse the remaining $n_a-n_b$ anchors: given the anchor $A_i$, find the bbox $B_j$ that has the largest IoU with $A_i$ according to the $i$ row of the matrix $\boldsymbol{X}$. And assign $B_j$ to anchor $A_i$, only when the IoU is greater than our pre-defined threshold.


As shown in figure, assuming that the maximum value in the matrix $\boldsymbol{X}$ is $x_{23}$, we will assign bbox $B_3$ to anchor $A_2$. Then, discard all elements in row 2 and col 3 of the matrix, find the largest remaining element $x_{71}$, and assign bbox $B_1$ to anchor $A_7$. Next, discard all elements in row 7 and col 1 of the matrix, find the largest remaining element $x_{54}$, and assign a bbox $B_4$ to anchor $A_5$. Finally, discard all elements in row 5 and col 4. Find the remaining largest element $x_{92}$, assign a bbox $B_2$ to anchor $A_9$. After that, we only need to traverse the remaining anchors except $A_2, A_5, A_7, A_9$, and decide whether to assign bboxes according to the threshold.



![Image Name](https://cdn.kesci.com/upload/image/q5vsc1hcg8.png?imageView2/0/w/640/h/640)



Now we can label the category and offset of the anchors. If an anchor $A$ is assigned a bbox $B$, set the category of the anchor box $A$ same as $B$, and based on the relative position of the central coordinates of $B$ and $A$ and relative size of each box, we mark the offset of anchor $A$. 

Because the positions and sizes of the boxes are different, we usually require some special transformations to make the distribution of the offset more uniform and easier to fit. 

Let the center coordinates of the anchor box $A$ and its assigned bbox $B$ be $(x_a, y_a)$ and $(x_b, y_b)$, and the widths of $A$ and $B$ are $w_a$ and $w_b$, with heights of $h_a$ and $h_b$, respectively. A common way is to mark the offset of $A$ as

$$
\left( \frac{ \frac{x_b - x_a}{w_a} - \mu_x }{\sigma_x},
\frac{ \frac{y_b - y_a}{h_a} - \mu_y }{\sigma_y},
\frac{ \log \frac{w_b}{w_a} - \mu_w }{\sigma_w},
\frac{ \log \frac{h_b}{h_a} - \mu_h }{\sigma_h}\right),
$$


The default value of the constant is $ \ mu_x = \ mu_y = \ mu_w = \ mu_h = 0, \ sigma_x = \ sigma_y = 0.1, \ sigma_w = \ sigma_h = 0.2 $. If an anchor box is not assigned a true bounding box, we only need to set the anchor box category as the background. Anchor boxes with category as background are usually called negative anchor boxes, and the rest are called positive anchor boxes.


A specific example is demonstrated below. We define real bounding boxes for cats and dogs in the read image. The first element is the category (0 is dog and 1 is cat). The remaining 4 elements are the $ x $ and $ y $ axes in the upper left corner. Coordinates and $ x $ and $ y $ axis coordinates in the lower right corner (with a range between 0 and 1). Here, five anchor boxes to be labeled are constructed by the coordinates of the upper left corner and the lower right corner, which are respectively marked as $ A_0, \ ldots, A_4 $ (the index in the program starts from 0). First draw the positions of these anchor boxes and real bounding boxes in the image.

The default value of the constant is $\mu_x = \mu_y = \mu_w = \mu_h = 0, \sigma_x=\sigma_y=0.1, \sigma_w=\sigma_h=0.2$

If an anchor is not assigned with a bbox, we only need to set the anchor category as the background. Anchors with category as background are usually called negative anchors, and the rest are called positive anchors.

Next is a specific example, defining bbox for cats and dogs in the image. The first element is the category (0 is dog and 1 is cat). The remaining 4 elements are the $x$ and $y$ coordinates in the upper left corner, $x$ and $y$ coordinates in the lower right corner (with a range between 0 and 1).

In [19]:
bbox_scale = torch.tensor((w, h, w, h), dtype=torch.float32)

ground_truth = torch.tensor([[0, 0.1, 0.08, 0.52, 0.92],
                            [1, 0.55, 0.2, 0.9, 0.88]])
                            
anchors = torch.tensor([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
                    [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
                    [0.57, 0.3, 0.92, 0.9]])

fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog', 'cat'], 'k')
show_bboxes(fig.axes, anchors * bbox_scale, ['0', '1', '2', '3', '4']);

In [20]:
compute_jaccard(anchors, ground_truth[:, 1:]) # test compute_jaccard function

tensor([[0.0536, 0.0000],
        [0.1417, 0.0000],
        [0.0000, 0.5657],
        [0.0000, 0.2059],
        [0.0000, 0.7459]])

Next is the `MultiBoxTarget` function to mark class and bias for anchor boxes. The function set 0 for background, and set int index for obejct classes thereafter(1 for dog, 2 for cat).

In [27]:
def assign_anchor(bb, anchor, jaccard_threshold=0.5):
    """
    # assign ground truth bboxes for anchor, anchor as standardized(xmin, ymin, xmax, ymax).
    Args:
        bb: ground truth bounding box, shape:（nb, 4）
        anchor: anchor to be assigned, shape:（na, 4）
        jaccard_threshold: pre-defined threshold
    Returns:
        assigned_idx: shape: (na, ), the index of ground truth bb for every anchor assigned
        returns -1 if no bb assigned
    """
    
    na = anchor.shape[0] 
    nb = bb.shape[0]
    jaccard = compute_jaccard(anchor, bb).detach().cpu().numpy() # shape: (na, nb)
    assigned_idx = np.ones(na) * -1  # initial index all set to -1
    
    # assign an anchor to bb (jaccard_threshold not required)
    jaccard_cp = jaccard.copy()
    for j in range(nb):
        i = np.argmax(jaccard_cp[:, j])
        assigned_idx[i] = j
        jaccard_cp[i, :] = float("-inf") # negative inf
     
    # deal with unassigned anchor, jaccard_threshold required
    for i in range(na):
        if assigned_idx[i] == -1:
            j = np.argmax(jaccard[i, :])
            if jaccard[i, j] >= jaccard_threshold:
                assigned_idx[i] = j
                
    return torch.tensor(assigned_idx, dtype=torch.long)


def xy_to_cxcy(xy):
    """
    change (x_min, y_min, x_max, y_max) anchor to (center_x, center_y, w, h) format
    https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py
    
    Args:
        xy: bounding boxes in boundary coordinates, a tensor of size (n_boxes, 4)
    Returns: 
        bounding boxes in center-size coordinates, a tensor of size (n_boxes, 4)
    """
    return torch.cat([(xy[:, 2:] + xy[:, :2]) / 2,  # c_x, c_y
                      xy[:, 2:] - xy[:, :2]], 1)  # w, h


def MultiBoxTarget(anchor, label):
    """
    Args:
        anchor: torch tensor, input anchor, generated by MultiBoxPrior, shape:（1，num_anchors，4）
        label: label, shape (bn, max_num of true bbox for every image, 5)
               in the 2nd-dim, if not enough anchor box, use -1 to fill blank
               5 in the last dim is [class label, 4 coordinates]
    Returns:
        list, [bbox_offset, bbox_mask, cls_labels]
        bbox_offset: bias for every anchor box, shape (batch_num，num_anchors*4)
        bbox_mask: shape like bbox_offset, mask for every anchor, foreground or background, 
        match with bias, background is 0, foreground is 1
        cls_labels: class for every anchor, 0 for background, shape (batch_num，num_anchors)
    """
    assert len(anchor.shape) == 3 and len(label.shape) == 3
    bn = label.shape[0]
    
    def MultiBoxTarget_one(anc, lab, eps=1e-6):
        """
        MultiBoxTarget help function, deal with one in the batch
        Args:
            anc: shape of (num_anchors, 4)
            lab: shape of (num_groundtruth_bbox, 5), 5 is[class_label, 4 coordinates]
            eps: smoothing param, avoid log0
        Returns:
            offset: (num_anchors*4, )
            bbox_mask: (num_anchors*4, ), 0 for background, 1 for non-background
            cls_labels: (num_anchors, 4), 0 for background
        """
        an = anc.shape[0]

        assigned_idx = assign_anchor(lab[:, 1:], anc) # (num_anchors, )
        print("a: ",  assigned_idx.shape)
        print(assigned_idx)
        bbox_mask = ((assigned_idx >= 0).float().unsqueeze(-1)).repeat(1, 4) # (num_anchors, 4)
        print("b: " , bbox_mask.shape)
        print(bbox_mask)

        cls_labels = torch.zeros(an, dtype=torch.long) # 0 for background
        assigned_bb = torch.zeros((an, 4), dtype=torch.float32) # bb coordinates for matched anchor
        for i in range(an):
            bb_idx = assigned_idx[i]
            if bb_idx >= 0: # non-background
                cls_labels[i] = lab[bb_idx, 0].long().item() + 1 # add 1
                assigned_bb[i, :] = lab[bb_idx, 1:]
        
        # calculate bias
        center_anc = xy_to_cxcy(anc) # (center_x, center_y, w, h)
        center_assigned_bb = xy_to_cxcy(assigned_bb)

        offset_xy = 10.0 * (center_assigned_bb[:, :2] - center_anc[:, :2]) / center_anc[:, 2:]
        offset_wh = 5.0 * torch.log(eps + center_assigned_bb[:, 2:] / center_anc[:, 2:])
        offset = torch.cat([offset_xy, offset_wh], dim = 1) * bbox_mask # (num_anchors, 4)

        return offset.view(-1), bbox_mask.view(-1), cls_labels
    
    # output
    batch_offset = []
    batch_mask = []
    batch_cls_labels = []
    for b in range(bn):
        offset, bbox_mask, cls_labels = MultiBoxTarget_one(anchor[0, :, :], label[b, :, :])
        
        batch_offset.append(offset)
        batch_mask.append(bbox_mask)
        batch_cls_labels.append(cls_labels)
    
    bbox_offset = torch.stack(batch_offset)
    bbox_mask = torch.stack(batch_mask)
    cls_labels = torch.stack(batch_cls_labels)
    
    return [bbox_offset, bbox_mask, cls_labels]

Use `unsqueeze` to add sample dim for anchor and ground truth bbox

In [28]:
labels = MultiBoxTarget(anchors.unsqueeze(dim=0),
                        ground_truth.unsqueeze(dim=0))

a:  torch.Size([5])
tensor([-1,  0,  1, -1,  1])
b:  torch.Size([5, 4])
tensor([[0., 0., 0., 0.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [0., 0., 0., 0.],
        [1., 1., 1., 1.]])


Return 3 values, all are `Tensor`. The last value is the marked class of anchor box.

In [30]:
labels[2] # class label, background, dog, cat, background, cat

tensor([[0, 1, 2, 0, 2]])

In [31]:
labels[1] # mask value, shape(batch_num, num_anchors*4)

tensor([[0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1.,
         1., 1.]])

In [32]:
labels[0] # 4 bias for every anchor box, 0 for background anchor box

tensor([[-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00,  1.4000e+00,
          1.0000e+01,  2.5940e+00,  7.1754e+00, -1.2000e+00,  2.6882e-01,
          1.6824e+00, -1.5655e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00,
         -0.0000e+00, -5.7143e-01, -1.0000e+00,  4.1723e-06,  6.2582e-01]])

## Output Prediction Bounding Box

In the prediction, we first generate multiple anchors for the image, and predict the categories and offsets for these anchors one by one. Then, we get the predicted boundary box based on the anchor and its predicted offset. When the number of anchors is large, a lot of similar predicted bounding boxes may show up on the same object. To make the results more concise, we can remove similar boundary box prediction. A commonly used method is called non-maximum suppression (NMS).

For a prediction bounding box $B$, the model calculates the prediction probability for each category. Let the largest prediction probability be $p$, and the category corresponding to the probability is the prediction category of $B$. 

We also call $p$ the confidence of the predicted bounding box $B$. On the same image, we sort the prediction bounding boxes of the non-background prediction categories from high to low to get the list $L$. From $L$, select the most confident prediction bounding box $B_1$ as the benchmark, and remove all non-benchmarked bounding boxes whose IoU with $B_1$ is greater than a certain pre-set threshold from $L$. At this point, $L$ retains the most confident prediction bounding box and removes other prediction bounding boxes similar to it.

Next, from $L$, select the prediction bounding box $B_2$ with the second highest confidence as the benchmark, and remove all non-benchmarked bounding boxes whose IoU with $B_2$ is greater than a threshold from $L$. This process is repeated until all predicted bounding boxes in $L$ have been used as the benchmark. At this time, the IoU ratio of any pair of predicted bounding boxes in $L$ is less than the threshold. Finally, all predicted bounding boxes in the list $L$ are output.

Let's look at a specific example below. First create 4 anchors. For simplicity, we assume that the prediction offset is all 0: the prediction bounding box is the anchor. Finally, we construct the predicted probabilities for each category.

In [36]:
anchors = torch.tensor([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95],
                        [0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]])
offset_preds = torch.tensor([0.0] * (4 * len(anchors)))
cls_probs = torch.tensor([[0., 0., 0., 0.,],  # prob for background
                          [0.9, 0.8, 0.7, 0.1],  # prob for dog
                          [0.1, 0.2, 0.3, 0.9]])  # prob for cat

In [37]:
# display the bouding boxes and their prob
fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, anchors * bbox_scale,
            ['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9'])

Use `MultiBoxDetection` to perform non-maximum suppression on image.

In [38]:
from collections import namedtuple
Pred_BB_Info = namedtuple("Pred_BB_Info", ["index", "class_id", "confidence", "xyxy"])

def non_max_suppression(bb_info_list, nms_threshold = 0.5):
    """
    Args:
        bb_info_list: Pred_BB_Info list, has pred class id, prob
        nms_threshold: threshold
    Returns:
        output: Pred_BB_Info list, only keep filtered bbox using threshold 
    """
    output = []
    # sorting according to prob from high to low
    sorted_bb_info_list = sorted(bb_info_list, key = lambda x: x.confidence, reverse=True)
    
    # iterate through list and remove unnecessary output
    while len(sorted_bb_info_list) != 0:
        best = sorted_bb_info_list.pop(0)
        output.append(best)
        
        if len(sorted_bb_info_list) == 0:
            break

        bb_xyxy = []
        for bb in sorted_bb_info_list:
            bb_xyxy.append(bb.xyxy)
        
        iou = compute_jaccard(torch.tensor([best.xyxy]), 
                              torch.tensor(bb_xyxy))[0] # shape: (len(sorted_bb_info_list), )
        
        n = len(sorted_bb_info_list)
        sorted_bb_info_list = [sorted_bb_info_list[i] for i in range(n) if iou[i] <= nms_threshold]
    return output


def MultiBoxDetection(cls_prob, loc_pred, anchor, nms_threshold = 0.5):
    """
    Args:
        cls_prob: get prob for every anchor box after softmax, shape:(bn, pred_class+1, num_anchors)
        loc_pred: bias for every anchor predicted, shape:(bn, num_anchors*4)
        anchor: MultiBoxPrior default anchor output, shape: (1, num_anchors, 4)
        nms_threshold: threshold
    Returns:
        all anchor, shape: (bn, num_anchors, 6)
        each anchor represented as[class_id, confidence, xmin, ymin, xmax, ymax]
        class_id=-1 means bacjground get removed in NMS
    """
    assert len(cls_prob.shape) == 3 and len(loc_pred.shape) == 2 and len(anchor.shape) == 3
    bn = cls_prob.shape[0]
   
    
    def MultiBoxDetection_one(c_p, l_p, anc, nms_threshold = 0.5):
        """
        MultiBoxDetection help function, deal with one in a batch
        Args:
            c_p: (pred_class+1, num_anchors)
            l_p: (num_anchors*4, )
            anc: (num_anchors, 4)
            nms_threshold: threshold
        Return:
            output: (num_anchors, 6)
        """
        pred_bb_num = c_p.shape[1]
        anc = (anc + l_p.view(pred_bb_num, 4)).detach().cpu().numpy() # add bias
        
        confidence, class_id = torch.max(c_p, 0)
        confidence = confidence.detach().cpu().numpy()
        class_id = class_id.detach().cpu().numpy()
        
        pred_bb_info = [Pred_BB_Info(
                            index = i,
                            class_id = class_id[i] - 1, # positive class label starts from 0
                            confidence = confidence[i],
                            xyxy=[*anc[i]]) # xyxy is a list
                        for i in range(pred_bb_num)]
        
        # postive class index
        obj_bb_idx = [bb.index for bb in non_max_suppression(pred_bb_info, nms_threshold)]
        
        output = []
        for bb in pred_bb_info:
            output.append([
                (bb.class_id if bb.index in obj_bb_idx else -1.0),
                bb.confidence,
                *bb.xyxy
            ])
            
        return torch.tensor(output) # shape: (num_anchors, 6)
    
    batch_output = []
    
    for b in range(bn):
        batch_output.append(MultiBoxDetection_one(cls_prob[b], loc_pred[b], anchor[0], nms_threshold))
    
    return torch.stack(batch_output)

In [39]:
output = MultiBoxDetection(
    cls_probs.unsqueeze(dim=0), offset_preds.unsqueeze(dim=0),
    anchors.unsqueeze(dim=0), nms_threshold=0.5)

output # output is info for a pred box, shape (bn, num_anchors, 6)
# 1st element: pred_class(0:dog, 1:cat, -1 means background removed)
# 2nd element: confidence interval
# 3-6 element: upper left x, y and lower right x,y coordinates(between 0 and 1)

tensor([[[ 0.0000,  0.9000,  0.1000,  0.0800,  0.5200,  0.9200],
         [-1.0000,  0.8000,  0.0800,  0.2000,  0.5600,  0.9500],
         [-1.0000,  0.7000,  0.1500,  0.3000,  0.6200,  0.9100],
         [ 1.0000,  0.9000,  0.5500,  0.2000,  0.9000,  0.8800]]])

In [40]:
fig = d2l.plt.imshow(img)
for i in output[0].detach().cpu().numpy():
    if i[0] == -1:
        continue
    label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
    show_bboxes(fig.axes, [torch.tensor(i[2:]) * bbox_scale], label)

## Multi-size Object Detection

Another way to increase the accuracy and reduce computation, we can perform generating different size anchor boxes for objects of different sizes. For example, we can generate more small-size anchors for samll-size objects and larger and fewer anchors for larger-size objects.

In [41]:
w, h = img.size
w, h

(728, 561)

In [42]:
d2l.set_figsize()

def display_anchors(fmap_w, fmap_h, s): # fmap controls how many anchor to be generated
    fmap = torch.zeros((1, 10, fmap_h, fmap_w), dtype=torch.float32)
    
    # move all anchors to equally distribute on the image
    offset_x, offset_y = 1.0/fmap_w, 1.0/fmap_h
    anchors = d2l.MultiBoxPrior(fmap, sizes=s, ratios=[1, 2, 0.5]) + \
        torch.tensor([offset_x/2, offset_y/2, offset_x/2, offset_y/2])
    
    bbox_scale = torch.tensor([[w, h, w, h]], dtype=torch.float32)
    d2l.show_bboxes(d2l.plt.imshow(img).axes,
                    anchors[0] * bbox_scale)

In [43]:
display_anchors(fmap_w=4, fmap_h=2, s=[0.15])

In [44]:
display_anchors(fmap_w=2, fmap_h=1, s=[0.4])

In [45]:
display_anchors(fmap_w=1, fmap_h=1, s=[0.8])

In practice, we can remove the lower-confidence prediction bounding box before performing non-maximum suppression, thereby reducing the amount of calculation for non-maximum suppression. We can also filter the output for non-maximum suppression, for example, to keep only those results with higher confidence as the final output.


### Summary

* Generate multiple anchors with different sizes and aspect ratios around each pixel.
* The IoU is the ratio of the intersection area and the union area of ​​two bounding boxes.
* In the training set, there are two types of labels for each anchor box: one is the category of the target contained in the anchor box; the other is the offset of the true bounding box from the anchor box.
* When predicting, you can use non-maximum suppression to remove similar prediction bounding boxes to make the results concise.