# <font style="color:blue">2. Generating Anchor Boxes</font>

Our detector has 9 anchors for every feature map by default.

<img src='https://www.learnopencv.com/wp-content/uploads/2020/03/c3-w8-anchors.png' align='middle'>

**What is feature map here?**

Let's say `a` is an input image of dimension `256x256`, and it has two feature maps `b` (`8 x 8 feature map (grid)`) and `c` (`4 x 4 feature map (grid)`). One element of the feature map represents segments of pixels in the original image `a`.



**Why 9?**

To answer this question, let's take a look into DataEncoder class.
We have 3 aspect ratios of sizes $1/2$, $1$ and $2$. For each size, there are `3` scales: $1, 2^{1/3}$ and $2^{2/3}$.
These anchors of the appropriate sizes are generated for each of five feature maps we have.

In [1]:
from IPython.display import Code
import inspect

from trainer.encoder import (
    DataEncoder,
    decode_boxes,
    encode_boxes,
    generate_anchors,
    generate_anchor_grid
)

In [2]:
Code(data=inspect.getsource(DataEncoder.__init__))

**Why have we chosen the following anchor area?**
```
anchor_areas = [8 * 8, 16 * 16., 32 * 32., 64 * 64., 128 * 128]  # p3 -> p7
```
The first anchor area is responsible for generating anchors for the first output layer of `FPN` and so on. 

```
256/8 = 32

256/16 = 16
   .
   .
256/128 = 2
```

**So, how do we generate them?**

We first generate our 9 anchors, knowing which areas it should cover, using predefined ratios and scales.

In [3]:
print(Code(data=inspect.getsource(generate_anchors)))

def generate_anchors(anchor_area, aspect_ratios, scales):
    anchors = []
    for scale in scales:
        for ratio in aspect_ratios:
            h = math.sqrt(anchor_area/ratio)
            w = math.sqrt(anchor_area*ratio)
            x1 = (math.sqrt(anchor_area) - scale * w) * 0.5
            y1 = (math.sqrt(anchor_area) - scale * h) * 0.5
            x2 = (math.sqrt(anchor_area) + scale * w) * 0.5
            y2 = (math.sqrt(anchor_area) + scale * h) * 0.5
            anchors.append([x1, y1, x2, y2])
    return torch.Tensor(anchors)



For each feature map we create a grid, that will allow us to densely put all of the possible boxes.

In [4]:
print(Code(data=inspect.getsource(generate_anchor_grid)))

def generate_anchor_grid(input_size, fm_size, anchors):
    grid_size = input_size[0] / fm_size
    x, y = torch.meshgrid(torch.arange(0, fm_size) * grid_size, torch.arange(0, fm_size) * grid_size)
    anchors = anchors.view(-1, 1, 1, 4)
    xyxy = torch.stack([x, y, x, y], 2).float()
    boxes = (xyxy + anchors).permute(2, 1, 0, 3).contiguous().view(-1, 4)
    boxes[:, 0::2] = boxes[:, 0::2].clamp(0, input_size[0])
    boxes[:, 1::2] = boxes[:, 1::2].clamp(0, input_size[1])
    return boxes



**Let's check the size of the anchor boxes. for input image size `3x256x256` and `3x300x300`.**

In [5]:
height_width = (256, 256)

data_encoder = DataEncoder(height_width)

print('anchor_boxes size: {}'.format(data_encoder.anchor_boxes.size()))

anchor_boxes size: torch.Size([12276, 4])


In [6]:
height_width = (300, 300)

data_encoder = DataEncoder(height_width)

print('anchor_boxes size: {}'.format(data_encoder.anchor_boxes.size()))

anchor_boxes size: torch.Size([17451, 4])


**Let's compare anchor boxes size to detector network output size:**

<div>
    <table>
        <tr><td><h3>Image input size</h3></td> <td><h3>Anchor boxes size</h3></td> <td><h3>Detector Network output size</h3></td> </tr>
        <tr><td><h3>(256, 256)</h3></td> <td><h3>[12276, 4]</h3></td> <td><h3>[batch_size, 12276, 4]</h3></td> </tr>
        <tr><td><h3>(300, 300)</h3></td> <td><h3>[17451, 4]</h3></td> <td><h3>[batch_size, 17451, 4]</h3></td> </tr>
    </table>
</div>

Basically, we want to encode our location target, such that the size location target becomes equal to the size of anchor boxes.
