# <font style="color:blue">1. Detector NN-Architecture</font>

Single-stage object detection pipeline looks like as follows:

---

<img src='https://www.learnopencv.com/wp-content/uploads/2020/03/c3-w8-pipeline.png' align="middle">

---

We extract the features with a backbone and then use two branches for prediction: one is to regress the box coordinates of an object, and another one - classification - to predict which class the detected object belongs to. 

In this unit, we will go into details of detector network architecture.

Here, we will use the [Feature Pyramid Network](https://arxiv.org/pdf/1612.03144.pdf) for feature extraction. On top of this, we will use class subnet and box subnet to get classification and bounding box. Let's have a look at the following architecture. 

---

<img src='https://www.learnopencv.com/wp-content/uploads/2020/03/c3-w8-retinanet.png' align='middle'>

---

Feature Pyramid Net is built on top of ResNet (we will use `ResNet-18`) in a fully convolutional fashion.
It includes two pathways: **bottom-up or forward** and **top-down**, which goes in the inverse direction. These two pathways are connected in-between with lateral connections.


Bottom-up pathway is doing the feedforward path, extracting the features. Nothing new here.

**What about the top-down pathway?**

Features closer to the input image have a rich segment (bounding box) information. So it is needed to merge all of the feature maps from different levels of the pyramid into one semantically-rich feature map.

**Let's have a close look at the top-down and lateral connection.**

---

<img src='https://www.learnopencv.com/wp-content/uploads/2020/03/c3-w8-lateral_con.png' align='middle'>

---

The higher-level features are upsampled to be 2x larger. For this purpose, nearest neighbor upsampling is used. The larger feature map undergoes a 1x1 convolutional layer to reduce the channel dimension. Finally, these two feature maps are added together in element-wise manner. The process continues until the finest merged feature map is created.

**How is these merged features map being used for prediction?**

These features map goes into two different CNN of classes and bounding boxes predictions. 

**So we have the following components of the object detector architecture:**

1. ResNet18 (`a`)
2. Feature Pyramid Network (`b`)
3. Prediction Network: (i) class subnet (`c`) and (ii) box subnet (`d`).


Let's have a look into these components with an example input. Let's assume we have a batch size of `2`, and the input image dimension is `3x256x256` (channel first).


In [1]:
import torch
from torchvision import models
from IPython.display import Code
import inspect

from fpn import FPN
from detector import Detector

## <font style="color:green">1.1. ResNet</font>

We will be using `ResNet18`. We will use a pre-trained model because it has already enriched features for classification. Let's have a look at it.

In [2]:
resnet = models.resnet18(pretrained=True)

In [3]:
print(resnet)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

**We can see that ResNet18 has the following blocks:**
1. `conv1`
1. `bn1`
1. `relu`
1. `maxpool`
1. `layer1`
1. `layer2`
1. `layer3`
1. `layer4`
1. avgpool
1. fc

We are using number `1-8` blocks in `FPN`. Before going into FPN let's have a look at the output dimension of these blocks.

We are just focusing on the output dimension, so instead of using a real image, we will use random input.


In [4]:
# btch_size = 2, image dimesion = 3 x 256 x 256 

image_inputs = torch.rand((2, 3, 256, 256))

x = resnet.conv1(image_inputs)
x = resnet.bn1(x)
x = resnet.relu(x)
x = resnet.maxpool(x)
layer1_output = resnet.layer1(x)
layer2_output = resnet.layer2(layer1_output)
layer3_output = resnet.layer3(layer2_output)
layer4_output = resnet.layer4(layer3_output)

FPN uses `layer2_output`, `layer3_output`, and `layer4_output` to get features from different convolution layers. 

Let's have a look at its dimensions.

In [5]:
print('layer2_output size: {}'.format(layer2_output.size()))

print('layer3_output size: {}'.format(layer3_output.size()))

print('layer4_output size: {}'.format(layer4_output.size()))

layer2_output size: torch.Size([2, 128, 32, 32])
layer3_output size: torch.Size([2, 256, 16, 16])
layer4_output size: torch.Size([2, 512, 8, 8])


## <font style="color:green">1.2. Feature Pyramid Network</font>

Let us have a look at `FPN` class that implements `Feature Pyramid Networks`.

In [6]:
Code(data=inspect.getsource(FPN))

We can see that `FPN` has added two more convolution layers, `conv6` and `conv7` on the top of `layer4`.

In [7]:
fpn = FPN()

output = fpn(image_inputs)

for layer in output:
    print(layer.size())

torch.Size([2, 64, 32, 32])
torch.Size([2, 64, 16, 16])
torch.Size([2, 64, 8, 8])
torch.Size([2, 64, 4, 4])
torch.Size([2, 64, 2, 2])


We can see that all layers have the same number of channels (`64`).

We can also see that width and height is half of the previous layer width and height. 

Let's take another example of an input.

In [8]:
image_inputs = torch.rand((2, 3, 300, 300))

output = fpn(image_inputs)

for layer in output:
    print(layer.size())

torch.Size([2, 64, 38, 38])
torch.Size([2, 64, 19, 19])
torch.Size([2, 64, 10, 10])
torch.Size([2, 64, 5, 5])
torch.Size([2, 64, 3, 3])


Here number of channes is the same as above (`64`).

But, width and height is not half of the previous layer width and height in all cases. 

We can use following expression to find next layer width and height:

$$
next\_layer\_width = \big \lceil {\frac{current\_layer\_width}{2}} \big\rceil
$$

$$
next\_layer\_height = \big \lceil {\frac{current\_layer\_height}{2}} \big\rceil
$$

Verify this expression with different examples. You can also verify by observing `kernel_size`, `stride`, and `padding` of `conv2d` in `FPN` class.

## <font style="color:green">1.3. Prediction Network</font>

Let us have a look at `Detector` class that implements our detector network.

In [9]:
Code(data=inspect.getsource(Detector))

We can see that the detector has two heads, one for class prediction and another for location prediction. Let's have look at its output size.

In [10]:
image_inputs = torch.rand((2, 3, 256, 256))

detector = Detector()

location_pred, class_pred = detector(image_inputs)

print('location_pred size: {}'.format(location_pred.size()))

print('class_pred size: {}'.format(class_pred.size()))

location_pred size: torch.Size([2, 12276, 4])
class_pred size: torch.Size([2, 12276, 2])


**Where does the number `12276` come from?**

output of `FPN`:

```
torch.Size([2, 64, 32, 32])
torch.Size([2, 64, 16, 16])
torch.Size([2, 64, 8, 8])
torch.Size([2, 64, 4, 4])
torch.Size([2, 64, 2, 2])
```

Location predictor (`loc_pred`) in the detector using multiple convolutions to transform the output to the following: 

```
torch.Size([2, 9*4, 32, 32])  # (batch_size, num_anchor*4 , H, W)
torch.Size([2, 9*4, 16, 16])
torch.Size([2, 9*4, 8, 8])
torch.Size([2, 9*4, 4, 4])
torch.Size([2, 9*4, 2, 2])
```

`(batch_size, number_of_anchor*4 , H, W)` re-arranged as follows:

```
(batch_size, num_anchor*4 , H, W)-->(batch_size, H, W, num_anchor*4)-->(batch_size, H*W*num_anchor, 4)
```

`num_anchor = 9`

So, `32*32*9 + 16*16*9 + 8*8*9 + 4*4*9 + 2*2*9 = 12276`.

From the above re-arrangement, it is clear that each feature map of `FPN` (starting from first feature of `(32, 32)` and end to last feature of `(2, 2)`) has `9x4` sized mapping. 

In [11]:
image_inputs = torch.rand((2, 3, 300, 300))

location_pred, class_pred = detector(image_inputs)

print('location_pred size: {}'.format(location_pred.size()))

print('class_pred size: {}'.format(class_pred.size()))

location_pred size: torch.Size([2, 17451, 4])
class_pred size: torch.Size([2, 17451, 2])


Why are we interested in the output dimension of the detector network? Because to find loss (for training) we need similar dimensional targets.