# <font style="color:blue">3. Matching Predictions with Ground Truth</font>

Now, when we have a wide range of anchors, we want to know which of them are the most suitable for our training.

We can measure this by computing the intersection over union (IoU) between anchors and target boxes.
Those boxes that will have the maximum metric will be used further in training.

If anchor box's IoU is in between 0.4 and 0.5, we think that it has a bad match with the target and ignore it in the training process.

An anchor box is considered to be background if its IoU with any ground-truth box is below 0.4.
After that, all of the matched boxes should be encoded.

In [1]:
from IPython.display import Code
import inspect

from trainer.encoder import (
    DataEncoder,
    decode_boxes,
    encode_boxes,
    generate_anchors,
    generate_anchor_grid
)

In [2]:
Code(data=inspect.getsource(DataEncoder.encode))

## <font style="color:green">3.1. Encoding Boxes</font>

**What is encoding and why do we need it?**

Instead of predicting the bounding box location on the image directly, our bounding box regressor predicts the offset of the bounding box to anchor boxes. Representing the bounding box with respect to anchor boxes requires encoding. 

Generally, we represent a bounding box in $[x_{min}, y_{min}, x_{max}, y_{max}]$ format. However, at the time of learning these boxes, we do not use this format. Actually, it learns the bounding boxes with respect to nearby anchors. 

- Anchor and bounding box is the format, $[x_{min}, y_{min}, x_{max}, y_{max}]$. So `anchors_wh = anchors[:, 2:] - anchors[:, :2] + 1` will give $[x_{max}-x_{min}, y_{max}-y_{min}]$, that is width and height.


- `anchors_ctr = anchors[:, :2] + 0.5 * anchors_wh` will give $[(x_{min} + x_{max})/2, (y_{min} + y_{max})/2]$, i.e. center of the bounding box.


In [3]:
Code(data=inspect.getsource(encode_boxes))

**Precisely, it is encoded as follows:**

- Difference between center co-ordinates of the nearby anchors and the ground truth bounding box, divided by the anchor width-height. The difference of  $x$ coordinate is divided by width and the difference of y coordinate is divided by height. `(boxes_ctr - anchors_ctr) / anchors_wh` is doing this operation.


-  The above point will give two values, however, to represent a bounding box we need four values. The rest two values will come from the operation `torch.log(boxes_wh / anchors_wh)`. In words, logs of width and height ratio. 
It is trying to learn how to move the predicted box to look the same as the target. For that purpose, we should encode our anchors as the offsets to the target bounding boxes we want to learn.

The offset is calculated with respect to the center of the box and includes how the width and the height should be regressed.
Concatenate along axis-1 by using `torch.cat([(boxes_ctr - anchors_ctr) / anchors_wh, torch.log(boxes_wh / anchors_wh)], 1)`. This is the encoded format of bounding boxes. 

## <font style="color:green">3.2. Decoding Boxes</font>

After training, we want to get our bounding boxes back in the same format as we had them before.
So we need to use the decoding procedure, which is opposite to the encoding one.

In [4]:
Code(data=inspect.getsource(decode_boxes))

After we obtained all of the predictions, we want to choose only the ones in which our network is confident enough.

That's why we use a classification threshold 0.5 not to consider all of the predictions with lower probabilities.
Even after thresholding, we still may have bounding boxes with similar box coordinates that we want to suppress.
To remove this redundancy, we'll apply vanilla non-maximum-suppression (NMS).

It chooses an anchor box with the highest confidence and removes any overlapping anchor boxes with an IoU greater than 0.5.

In [5]:
Code(data=inspect.getsource(DataEncoder.decode))