In [1]:
import torch
import numpy as np
import time
import os
from typing import Dict, Optional, Sequence, Tuple

import matplotlib.pyplot as plt
import imageio.v2 as imageio

import torchvision
from torchvision.models.detection import (
    ssdlite320_mobilenet_v3_large,
    SSDLite320_MobileNet_V3_Large_Weights,
    ssd300_vgg16,
    SSD300_VGG16_Weights
)
from torchvision.transforms import v2
from torchvision.ops import box_iou, box_convert, complete_box_iou
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import StepLR
from torchmetrics.detection.mean_ap import MeanAveragePrecision

from pathlib import Path

import CarImageClass

from SSD_from_scratch import mySSD
from SSD_trainer import SSD_train, plot_losses, collate_detection, ConditionalIoUCrop, load_checkpoint, build_targets

# device = "cuda" if torch.cuda.is_available() else "cpu"
# just use cpu for this file
device = 'cpu'

# desktop or laptop
machine = 'laptop'

# Setup path to data folder
if machine == 'laptop':
    folder_path = Path(r"C:\self-driving-car\data")
else:
    folder_path = Path(r"C:\Udacity_car_data\data")

train_path = folder_path / "train"
test_path = folder_path / "test"
train_path_simple = folder_path / "train_simple"
test_path_simple = folder_path / "test_simple"
train_path_oo = folder_path / "train_one_obj"
test_path_oo = folder_path / "test_one_obj"

# How does a single shot detector (SSD) model work?

First, we need to understand the concept of *prior* bounding boxes, or just *priors*.  The model is equipped with a set of predetermined (by model architecture) priors.  In the case of this SSD model, it has 8732.  Priors are static and never change with model training.  Some of the priors are displayed below in red with the ground truth (GT) bounding box in green.

![36 priors and a car](figures/priors_3.gif)
![19*19*4 priors and a car](figures/priors_19.gif)
![38*38*4 priors and a car](figures/priors_38.gif)

As we can see, the priors come in various sizes, since the objective is to be able to detect object of various size.  For this image, we can see the best priors for the GT box (according to an intersection over union score).

![Priors above IoU threshold](figures/priors_above_threshold.gif)

### Model architecture and outputs
The key features of the model are six convolution layers, which feed into a classification head and localization head.  The name "single shot detector" was given since the model predicts class and location in one pass.  The image below is from the paper of W. Liu, et al. (see [here](https://link.springer.com/chapter/10.1007/978-3-319-46448-0_2)) which first introducted the SSD model.

![SSD model architecture by W. Liu, et al.](figures/SSD_architecture.png)


The ouput of the localization head corresponding to 'conv_4_3' in the image above is a tensor of size $(B, 4*4, 38, 38)$.  Let's breakdown the meaning of each dimension of this tensor (from left to right):

* B: batch size
* 4: number of dimensions in a prior (the form here is $(x_{\mathrm{min}}, y_{\mathrm{min}},x_{\mathrm{max}}, y_{\mathrm{max}})$)
* 4: number of priors per center
* 38: feature map height
* 38: feature map width

Repeating this process for all six of the convolution layers, the output of the localization head is comprised of tensors of size:
* $(B, 4*4, 38, 38)$ (from 'conv_4_3')
* $(B, 4*6, 19, 19)$ (from 'conv_7')
* $(B, 4*6, 10, 10)$ (from 'conv_8_2')
* $(B, 4*6, 5, 5)$ (from 'conv_9_2')
* $(B, 4*4, 3, 3)$ (from 'conv_10_2')
* $(B, 4*4, 1, 1)$ (from 'conv_11_2')

Returning to the images of the priors above, the left/middle/right image shows the priors corresponding 'conv_10_2'/'conv_8_2'/'conv_4_3' layers, respectively.

Now it is clear where the number 8732 comes from, since 

\begin{equation*}
    4*38*38 + 6*19*19 + 6*10*10 + 6*5*5 + 4*3*3 + 4*1*1 = 8732.
\end{equation*}

The output of the localization head is the concatenation of the six tensors above, which results in a tensor of size $(B, 8732, 4)$.

Similarly, the output of the classification head corresponding to 'conv_4_3' is a tensor of size $(B, C*4, 38, 38)$, where $C$ is the number of classes, and all other numbers are the same as before.  Repeating this process for all six of the convolution layers, the output of the classification head is comprised of tensors of size:
* $(B, C*4, 38, 38)$ (from 'conv_4_3')
* $(B, C*6, 19, 19)$ (from 'conv_7')
* $(B, C*6, 10, 10)$ (from 'conv_8_2')
* $(B, C*6, 5, 5)$ (from 'conv_9_2')
* $(B, C*4, 3, 3)$ (from 'conv_10_2')
* $(B, C*4, 1, 1)$ (from 'conv_11_2')

The output of the classification head is the concatenation of the six tensors above, which results in a tensor of size $(B, 8732, C)$.

### How to interpret the model outputs?

The output of the classification head is easier to understand.  Given an image (i.e. batch size $B=1$), the classification head outputs are class logits for each prior.  Given a particular prior, the classification head output is $(\ell_1, \ell_2, \ldots, \ell_{C})$, where $C$ is the number of classes and $\ell_j$, $j=1,\ldots, C$, is the logit for class $j$.  The class probabilities are $(p_1, p_2, \ldots, p_{C})$ are computed via the softmax function, i.e.
\begin{equation*}
    p_{j} = \frac{e^{\ell_{j}}}{\sum_{i=1}^{C}e^{\ell_{i}}}, \quad j=1,2,\ldots,C.
\end{equation*}

If the classification score corresponding to a particular prior $p = (c_{x}^{p}, c_{y}^{p}, w^{p}, h^{p})$ is high, that means a GT box $g = (c_{x}^{g}, c_{y}^{g}, w^{g}, h^{g})$ should be 'close' to $p$.  The localization head does not predict locations of objects, it predicts offsets to priors of the form $(t_x, t_y, t_w, t_h)$, where
\begin{equation*}
t_{x} = \frac{c_{x}^{\hat{g}} - c_{x}^{p}}{w^{p}v_{c}}, \quad t_{y} = \frac{c_{y}^{\hat{g}} - c_{y}^{p}}{h^{p}v_{c}}, \quad t_{w} = \frac{\log(w^{\hat{g}}/w^{p})}{v_{s}}, \quad t_{h} = \frac{\log(h^{\hat{g}}/h^{p})}{v_{s}},
\end{equation*}
where $\hat{g}=(c_{x}^{\hat{g}}, c_{y}^{\hat{g}}, w^{\hat{g}}, h^{\hat{g}})$ is the predicted bounding box.  Let us reiterate that the GT box $g$ is unknown and $(t_x, t_y, t_w, t_h)$ are the predicted values, so our predicted bounding box has coordinates 
\begin{equation*}
c_{x}^{\hat{g}} = c_{x}^{p} + t_{x}w^{p}v_{c}, \quad c_{y}^{\hat{g}} = c_{y}^{p} + t_{y}h^{p}v_{c}, \quad w^{\hat{g}} = w^{p}\mathrm{e}^{t_{w}v_{s}}, \quad h^{\hat{g}} = h^{p}\mathrm{e}^{t_{h}v_{s}}.
\end{equation*}