# ECE 364 Lecture 26
## Semantic Segmentation and Object Detection
### Learning Objectives:
After this lecture, students will be able to:
* State how labels are expressed for a semantic segmentation problem and how semantic segmentation is different from ordinary image classification.
* State how labels are expressed for an object detection problem.
* Understand the concepts of positive and negative default/anchor boxes, non-maximum suppression, and inference vs. training in object detection.
* Explain how single-stage object detectors are trained and used for inference on testing images.

## Semantic Segmentation Review
<div>
    <center><img src="semantic-segmentation-example.png" width="500"></center>
</div>

Consider the above figure depicting a semantic segmentation example. Each pixel in this image is assigned its own label, like how we provide an image-level label for image classification. To accomplish such a task, we often use **encoder-decoder** CNN model architectures also known as **autoencoders**. The encoder stage proceeds like a normal CNN where successive convolutional layers and pooling layers reduce the spatial resolution of feature maps. The encoder and decoder stages meet at a bottleneck where the decoder stage begins upsampling the feature maps back towards the desired output resolution. Where the encoder stage applies pooling or strided convolution, the decoder stage performs **upsampling** or **transposed convolution** to increase the spatial resolution of feature maps. Finally, once the desired resolution is reached, we may use a $1\times 1$ 2D convolution layer to combine the feature maps in the last layer to provide final class scores at each pixel like how a fully connected layer does for regular image classification.

Such CNNs have no fully-connected layers (and in fact may avoid all pooling layers as well) and thus we refer to them as **fully convolutional networks** (FCN). An example autoencoder model is shown below.

<div>
    <center><img src="autoencoder-example.png" width="600"></center>
</div>

### U-Net

One highly popular semantic segmentation architecture is known as [U-Net](https://arxiv.org/pdf/1505.04597) (Ronneberger, et al. 2015). The authors introduce a novel autoencoder-based model that incorporates connections between encoder feature maps and decoder feature maps. The intuition behind this choice is to allow feature learning that combines earlier primitive features from the encoder with more complex representations from the decoder. For each connection, the feature maps of the encoder and decoder (at the same spatial resolution) are simply concatenated along the feature channel dimension. The below figure demonstrates a U-Net model with two encoder/decoder stages for a binary segmentation problem; thus, the output of the last convolution is passed through a sigmoid layer. The number of channels is set by a base width $L$ and the label underneath each convolutional layer is the number of output layers. In the below figure, yellow prisms are convolutional layers, orange prisms perform downsampling by maxpooling, blue prisms perform upsampling, and green plus symbols concatenate feature maps.

<div>
    <center><img src="u-net.png" width="800"></center>
</div>

## Object Detection
### Problem Statement
The objective of any object detection problem is to place bounding boxes over every object of interest in an image and also classify the object inside of that box. Every instance of a given class must be separately identified. Thus, we may have many different objects that must be localized and separated by class as seen below.
<div>
    <center><img src="object-detection-example.png" width="500"></center>
</div>
More formally, for an (RGB) image $\mathbf{X}\in\mathbb{R}^{3\times H\times W}$ with $K$ annotations across $C$ possible classes, $\mathbf{Y}\in\mathbb{A}^{K}$ represents the annotations where $Y_i\in\mathbb{A}=\{1, 2, \ldots, C\}\times\mathbb{R}^{4}$. Each annotation must identify one of the $C$ classes as well as four coordinates specifying the dimensions of the bounding box. Multiple standards may be chosen for the bounding box, we will opt to specify the coordinates in the form of $(x_{\textrm{min}}, y_{\textrm{min}}, x_{\textrm{max}}, y_{\textrm{max}})\in[0, 1]^{4}$. In this format, the coordinates are normalized from 0 to 1 relative to the height and width of the image while $(x_{\textrm{min}}, y_{\textrm{min}})$ and $(x_{\textrm{max}}, y_{\textrm{max}})$ enumerate the top-left and bottom-right corners of the bounding box. In summary, the annotations for an image in the context of object detection may contain $K>1$ annotations where each annotation specifies the class, top-left corner, and bottom-right corner of the bounding box that tightly holds the object.

### Single-stage Object Detection
Object detectors are separated into two main groups: single-stage and two-stage detectors. In this lecture, we will focus on single-stage detectors and in particular the Single Shot Multibox Detector, also known as SSD. An implementation of [SSD](https://pytorch.org/vision/main/models/ssd.html) and a lightweight [SSDLite](https://pytorch.org/vision/main/models/ssdlite.html) version may be found in PyTorch here. At a high level, the SSD model creates many convolutional feature maps that have hard-coded **anchors** or **default boxes** in the image. The SSD model, and similar single-stage models, have multiple scales or resolution for these pre-defined boxes. For the feature maps at a given scale, each pixel location at that scale defines a center point relative to the image dimensions, e.g for a $16\times 16$ feature map, the pixel at location $(2, 1)$ would be centered at $(2.5/16, 1.5/16)$ of the height and width of the image (accounting for offsetting to the center of each square in the image grid). In addition to the center point and a given scale, we may also pre-determine certain aspect ratios, e.g. 1:1, 1:2, 3:1 for width:height. The below figure depicts example default boxes at two scaled in SSD.

<div>
    <center><img src="default-boxes.png" width="500"></center>
</div>

For the SSD model, we have $m$ default box scales. For scale $s_k, k\in[1, m]$, we determine the scale as 
$$
s_k= s_{\textrm{min}}+\frac{s_{\textrm{max}}-s_{\textrm{min}}}{m-1},
$$
where $s_{\textrm{min}}$ and $s_{\textrm{max}}$ are chosen as 0.2 and 0.9 in the SSD paper but may be adjusted in practice. For a feature map at scale $s_k$ of size $H_k\times W_k$, the center point at pixel location $(i, j)$ will correspond to location
$$
\left(\frac{i+0.5}{W_k}, \frac{j+0.5}{H_k}\right)
$$
in the original image. Thus, we have a center point and scale for each default box. Lastly, we pre-define aspect ratios $a_r\in\{1, 2, 3, 1/2, 1/3\}$ to set multiple available heights and widths for default boxes at the given center location and scale. The result height and width at scale $k$ and aspect ratio $a_r$ is given by
$$
\begin{align}
h_k^a &= \frac{s_k}{\sqrt{a_r}}\\
w_k^a &= s_k\sqrt{a_r}.
\end{align}
$$
Finally, the SSD authors also define a sixth aspect ratio that is also 1:1 but at the intermediate scale of $s_k'=\sqrt{s_ks_{k+1}}$. In total, the original SSD model proposed for $300\times 300$ images has $8,\!732$ default boxes across 6 feature scales.

### Positive and Negative Default Boxes

For all the available default boxes in an SSD model, they may be separated into positive and negative groups. A default box is determined as a positive box if it has intersection over union (IoU) of at least 0.5 with a ground-truth object box. Thus, multiple default boxes may be positive for a given object box. All other default boxes are seen as negative; therefore, there is a large imbalance where many more negative default boxes will exist than positive default boxes.

### Predictions at Default Boxes

To create one scale of default box predictions in SSD, we use a convolutional layer with $3\times 3$ kernels, padding of one, and stride one. Suppose the input to this layer at scale $k$ is $Z_k\in\mathbb{R}^{C_k\times H_k \times W_k}$. The resulting convolutional layer will have $6\times (C+1+4)$ output channels. The number of output channels is broken down as follows:
$$
\textrm{\# of Output Channels} = \underbrace{6}_{\textrm{number of aspect ratios}}\times\left(\underbrace{C+1}_{C~\textrm{foreground classes}+1~\textrm{background class}} + \underbrace{4}_{\textrm{bounding box offsets}}\right)
$$

The four bounding box offsets are defined as $(\Delta c_x, \Delta c_y, \Delta w, \Delta h)\in\mathbb{R}_+^4$ to indicate proportional changes in the center point, height, and width of the bounding box going from pre-defined default box to the ground-truth object box. Thus, these offsets perform a regression task to make minor adjustments to better fit default boxes with significant object overlap. The below figure depicts the architecture of the SSD model.

<div>
    <center><img src="ssd.png" width="800"></center>
</div>

### Training Object Detectors

Using these default boxes, both positive and negative, we may finally define the necessary loss function to train the fully convolution SSD model (and similar single-stage detectors). Let $x_{ij}^p\in\{0, 1\}$ be an indicator of matching default box $i$ to ground-truth box $j$ from class $p$, $c$ be the class of the bounding box, $l$ be the predicted bounding box, $g$ be the ground-truth bounding box, and $d$ be the matched default box. The loss function $\mathcal{L}$ is given as
$$
\mathcal{L}(x, c, l, g) = \frac{1}{N}\left(\mathcal{L}_{\textrm{cls}}(x, c) +\alpha\mathcal{L}_{\textrm{loc}}(x, l, g)\right),
$$
where $N$ is the number of matched default boxes for the given image and
$$
\begin{align}
    \mathcal{L}_{\textrm{loc}}(x, l, g) &= \sum_{i\in\textrm{Positive Boxes}}^{N}\sum_{a\in\{cx, cy, w, h\}}x_{ij}^{k}\textrm{Smooth}L1(l_i^a-\hat{g}_j^a)\\
    \hat{g}_j^{cx} &= \frac{g_j^{cx}-d_i^{cx}}{d_i^{w}}\\
    \hat{g}_j^{cy} &= \frac{g_j^{cy}-d_i^{cy}}{d_i^{h}}\\
    \hat{g}_j^{w} &= \log\left(\frac{g_j^w}{d_i^w}\right)\\
    \hat{g}_j^{h} &= \log\left(\frac{g_j^h}{d_i^h}\right).
\end{align}
$$
Thus, the localization loss seeks to regress the relative offsets of the well-matched default boxes, i.e. slightly shift the center point or increase/decrease the height or width using the [Smooth L1 loss function](https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html). The classification loss is simply cross-entropy loss where we apply softmax to the class probabilities and maximize the probability of the ground-truth class. Negative default boxes then seek to maximize the probability of the background or 0 class. Finally, the value of $\alpha> 0$ is a hyperparameter to balance the classification and localization losses. Empirically, the authors choose this value as one.

### Inference in Object Detectors
In order to evaluate an object detector, we must have a procedure to perform **inference** and produce final predictions of bounding boxes and object classes. Our model will have thousands of default boxes each with class scores and box offsets; thus, we will first filter out boxes that are likely mostly overlapping with the background and not with objects. A class probability threshold is typically set for the highest-probability foreground class, e.g. 0.02, to identify possible object boxes. Even with such a low threshold, most of the default boxes will be filtered out. For a single object, there may be several remaining default boxes that overlap with the object of interest. Thus, we must devise a way to de-duplicate boxes. 

### Non-maximum Suppression (NMS)
The method of non-maximum suppression (NMS) is commonly used to produce the final predictions for object bounding boxes. After class score thresholding, NMS sorts the default boxes in descending order of the highest non-background class probability. Working in order through the default boxes, we remove any lower-probabilitiy default boxes with an overlap (IoU) greater than some threshold, e.g. 0.4. After NMS, we will have a collection of bounding boxes with relatively little overlap and well-defined class scores that predict some object class. 

<div>
    <center><img src="nms.png" width="800"></center>
</div>

### Average Precision (AP) and Mean Average Precision (mAP)

The most popular metric for evaluating object detection models is known as Average Precision (AP). For each predicted bounding box, it may be categorized as one of two things:
* True Positive (TP): Overlaps with an object and predicts the correct class
* False Positive (FP): Does not sufficiently overlap with an object and/or predicts the incorrect class.

Furthermore, for each ground-truth bounding box, it is referred to as a **False Negative** (FN) if no predicted bounding box matches with it. For a given class probability threshold (i.e. remove boxes below this threshold), every bounding-box will be either a TP or an FP, while any unmatched ground-truth boxes will default to FN. Using these three quantities by counting the number of TP, FP, and FN boxes, we may compute **precision** and **recall**:
$$
\begin{align}
    \textrm{Precision} &= \frac{TP}{TP+FP}\in [0, 1]\\
    \textrm{Recall} &= \frac{TP}{TP+FN}\in [0, 1.
\end{align}
$$
Intuitively, precision evaluates how often our model is correct when it identifies an object while recall evaluates what proportion of objects our model is able to find. A model that only predicts one object box but is correct will have high precision but low recall. Conversely, a model that predicts many bounding boxes will likely have high recall and find all the objects but may have low precision since many predicted boxes will be false positives. As we lower the class probability threshold from 1 down to 0, the precision will typically drop while the recall rises. The resulting series of points forms the **precision-recall curve** (PRC). The metric of Average Precision is the area under the precision-recall curve (also referred to as AUPRC). [This tutorial is helpful to visualize the PRC and how we compute AP.](https://kili-technology.com/data-labeling/machine-learning/mean-average-precision-map-a-complete-guide).

We have not defined yet, however, what it means for a predicted bounding box to overlap sufficiently with a ground-truth object. Average Precision also requires we specify an IoU threshold to determine if two overlapping boxes overlap enough to be considered a true positive or a false positive. Thus, we often will have a subscript, e.g. $\textrm{AP}_{0.5}$, to indicate an IoU threshold for computing Average Precision. Finally, the metric of **mean Average Precision** (mAP) is the average of AP at multiple IoU thresholds, typically from 0.5 to 0.95 in increments of 0.05.