[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io)

# AI Program

## Machine Learning - Deep Learning - Object Detection

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 16/06/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0099DeepLearningObjectDetection.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Deep Learning
import torch
import torch.nn            as nn
import torch.nn.functional as F
import torchinfo

import torchvision
from torchvision.transforms import v2 as TorchVisionTrns

# Miscellaneous
import os
from platform import python_version
import random

# Typing
from typing import Callable, Dict, Generator, List, Optional, Self, Set, Tuple, Union
from numpy.typing import NDArray
from torch import Tensor

# Visualization
import matplotlib.pyplot as plt

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

# Improve performance by benchmarking
torch.backends.cudnn.benchmark = True

# Reproducibility (Per PyTorch Version on the same device)
# torch.manual_seed(seedNum)
# torch.backends.cudnn.deterministic = True
# torch.backends.cudnn.benchmark     = False #<! Makes things slower

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

PROJECT_NAME     = 'FixelCourses'
DATA_FOLDER_NAME = 'DataSets'
BASE_FOLDER_PATH = os.getcwd()[:(len(os.getcwd()) - (os.getcwd()[::-1].lower().find(PROJECT_NAME.lower()[::-1])))]
DATA_FOLDER_PATH = os.path.join(BASE_FOLDER_PATH, DATA_FOLDER_NAME)

TENSOR_BOARD_BASE = 'TB'

D_CLASSES  = {0: 'Red', 1: 'Green', 2: 'Blue'}
L_CLASSES  = ['R', 'G', 'B']
T_IMG_SIZE = (100, 100, 3)

In [None]:
# Download Auxiliary Modules for Google Colab
if runInGoogleColab:
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DataManipulation.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DataVisualization.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DeepLearningPyTorch.py

In [None]:
# Courses Packages

from DataManipulation import BBoxFormat
from DataManipulation import GenLabeldEllipseImg
from DataVisualization import PlotBox, PlotBBox, PlotLabelsHistogram
from DeepLearningPyTorch import ObjectDetectionDataset, ToTensor, YoloGrid
from DeepLearningPyTorch import CollateObjectDetection, GetBatch, TrainModel

* <font color='blue'>(**!**)</font> Go through `GenLabeldDataEllipse()`.
* <font color='blue'>(**!**)</font> Go through `ObjectDetectionDataset`.
* <font color='blue'>(**!**)</font> Go through `YoloGrid`.

In [None]:
# General Auxiliary Functions

def GenData( numSamples: int, tuImgSize: Tuple[int, int, int], maxObj: int, boxFormat: BBoxFormat = BBoxFormat.YOLO ) -> Tuple[np.ndarray, List[np.ndarray], List[np.ndarray]]:

    mX = np.empty(shape = (numSamples, *tuImgSize[::-1]))
    lY = [None] * numSamples
    lB = [None] * numSamples

    for ii in range(numSamples):
        numObj = np.random.randint(maxObj) + 1
        mI, vLbl, mBB = GenLabeldEllipseImg(tuImgSize[:2], numObj, boxFormat = boxFormat)
        mX[ii]  = np.transpose(mI, (2, 0, 1))
        lY[ii]  = vLbl
        lB[ii]  = mBB
    
    return mX, lY, lB

## Object Detection

The _Object Detection_ task generalizes the _Object Localization_ task in 2 manners:

1. Support for many objects at the same image.
2. Detection as if there is any object at all.

This notebook demonstrates:
 - Generating a synthetic data set.
 - Generating the _target_ data in the YOLO form.
 - Building a model for _Object Detection_.
 - Training a model with a composed objective.

</br>

* <font color='brown'>(**#**)</font> The _Object Detection_ in this notebook is applies in _YOLO_ style: Single Pass, Grid and Anchors.
* <font color='brown'>(**#**)</font> The motivation for a synthetic dataset is being able to implement the whole training process (Existing datasets are huge).  
  Yet the ability to create synthetic dataset is a useful skill.
* <font color='brown'>(**#**)</font> There are known datasets for object detection: [COCO Dataset](https://cocodataset.org), [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).   
  They also define standards for the labeling system.  
  Training them is on the scale of days.
* <font color='brown'>(**#**)</font> [Object Detection Annotation Formats](https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation).
* <font color='brown'>(**#**)</font> Review of Object Detection approaches is given by Lilian Weng: [Part 1: Gradient Vector, HOG, and SS](https://lilianweng.github.io/posts/2017-10-29-object-recognition-part-1), [Part 2: CNN, DPM and Overfeat](https://lilianweng.github.io/posts/2017-12-15-object-recognition-part-2), [Part 3: R-CNN Family](https://lilianweng.github.io/posts/2017-12-31-object-recognition-part-3), [Part 4: Fast Detection Models](https://lilianweng.github.io/posts/2018-12-27-object-recognition-part-4).
* <font color='brown'>(**#**)</font> A different approach by the SSD Architecture: [SSD object detection: Single Shot MultiBox Detector for real-time processing](https://scribe.rip/9bd8deac0e06), [Review: SSD — Single Shot Detector (Object Detection)](https://scribe.rip/851a94607d11).

In [None]:
# Parameters

# Data
numSamplesTrain = 30_000
numSamplesVal   = 10_000
boxFormat       = BBoxFormat.YOLO
numCls          = len(L_CLASSES) #<! Number of classes
maxObj          = 3

# Model
gridSize = 5 #<! The grid is (gridSize x gridSize) 

# Training
λObj = 1.0 #<! Objectness Loss weight
λReg = 1.0 #<! Localization Loss weight
λCls = 1.0 #<! Classification Loss weight
probThr = 0.5 #<! Probability Threshold (Score)
iouThr  = 0.5 #<! IoU Threshold (Score)

λ = 20.0 #<! Localization Loss
ϵ = 0.1 #<! Label Smoothing

batchSize   = 256
numWorkers  = 2 #<! Number of workers
numEpochs   = 35

# Visualization
numImg = 3

## Generate / Load Data

The data is synthetic data.  
Each image includes Ellipses where its color is the class (`R`, `G`, `B`) and the bounding rectangle.

* <font color='brown'>(**#**)</font> The label, per object, is a vector of `5`: `[Class, xCenter, yCenter, boxWidth, boxHeight]`.  
* <font color='brown'>(**#**)</font> The label is in `YOLO` format, hence it is normalized to `[0, 1]`.


In [None]:
# Image Sample

mI, vY, mBB = GenLabeldEllipseImg(T_IMG_SIZE[:2], maxObj, boxFormat = boxFormat)
vYY = np.array([L_CLASSES[clsIdx] for clsIdx in vY])
hA = PlotBox(mI, vYY, mBB)

* <font color='brown'>(**#**)</font> One could use negative values for the bounding box. The model will extrapolate the object dimensions.

In [None]:
# Generate Data

tXTrain, lYTrain, lBBTrain = GenData(numSamplesTrain, T_IMG_SIZE, maxObj, boxFormat = boxFormat)
tXVal,   lYVal,   lBBVal   = GenData(numSamplesVal, T_IMG_SIZE, maxObj, boxFormat = boxFormat)

print(f'The training data set data shape: {tXTrain.shape}')
print(f'The training data set labels length: {len(lYTrain)}')
print(f'The training data set box length: {len(lBBTrain)}')
print(f'The validation data set data shape: {tXVal.shape}')
print(f'The validation data set labels length: {len(lYVal)}')
print(f'The validation data set box length: {len(lBBVal)}')

* <font color='red'>(**?**)</font> Why are lists used instead of arrays for the labels and the bounding boxes?

In [None]:
# Generate Data

dsTrain = ObjectDetectionDataset(tXTrain, lYTrain, lBBTrain)
dsVal   = ObjectDetectionDataset(tXVal, lYVal, lBBVal)
lClass  = dsTrain.GetLabels(uniqueCls = False)

print(f'The training data set data shape: {dsTrain._tX.shape}')
print(f'The test data set data shape: {dsVal._tX.shape}')
print(f'The unique values of the labels: {np.unique(lClass)}')

* <font color='brown'>(**#**)</font> PyTorch with the `v2` transforms deals with bounding boxes using special type: `BoundingBoxes`.
* <font color='brown'>(**#**)</font> For _data augmentation_ see:
    - [Transforming and Augmenting Images](https://pytorch.org/vision/stable/transforms.html).
    - [Getting Started with Transforms v2](https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_getting_started.html).
    - [Transforms v2: End to End Object Detection / Segmentation Example](https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_e2e.html).
    - [How to Write Your Own v2 Transforms](https://pytorch.org/vision/stable/auto_examples/transforms/plot_custom_transforms.html).

In [None]:
# Element of the Data Set

tX, (vY, mB) = dsTrain[2]

print(f'The features shape: {tX.shape}')
print(f'The label shape: {vY.shape}')
print(f'The bounding box shape: {mB.shape}')

* <font color='brown'>(**#**)</font> Since the labels are in the same contiguous container as the bounding box parameters, their type is `Float`.
* <font color='brown'>(**#**)</font> The bounding box is using absolute values. In practice it is commonly normalized to the image dimensions.

### Plot the Data

In [None]:
# Plot the Data

vYY = np.array([L_CLASSES[clsIdx] for clsIdx in vY])
hA = PlotBox(np.transpose(tX, (1, 2, 0)), vYY, mB)

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(lClass, lClass = L_CLASSES);

* <font color='red'>(**?**)</font> Explain the amount of samples in the histogram per class and in total.

### YOLO Style Transformer

The target data should be altered into a YOLO grid style.  
Namely the bounding box should be redefined relative to the grid cell the center of the box reside at.

![](https://i.imgur.com/CE1Ef7g.png)

This section implements the transformation as a PyTorch transformation.

In [None]:
# Extract Data

tX, (vY, mB) = dsTrain[10]

In [None]:
# Plot the Data
lYY = [L_CLASSES[clsIdx] for clsIdx in vY]
hA = PlotBox(np.transpose(tX, (1, 2, 0)), lYY, mB)
hA.grid(True)

### Data Loaders

This section defines the data loaded.  
It uses a special collate function which collate the different detections (Per image) into a list of dictionaries.  
A dictionary per image and the list for the whole batch.

In [None]:
# Data Loader

dlTrain = torch.utils.data.DataLoader(dsTrain, shuffle = True, batch_size = 1 * batchSize, num_workers = 0, collate_fn = CollateObjectDetection, drop_last = True, persistent_workers = False)
dlVal   = torch.utils.data.DataLoader(dsVal, shuffle = False, batch_size = 2 * batchSize, num_workers = 0, collate_fn = CollateObjectDetection, persistent_workers = False)

In [None]:
# Iterate on the Loader
# The first batch.

tX, lY = GetBatch(dlTrain)

print(f'The batch features dimensions: {tX.shape}')
print(f'The targets length: {len(lY)}')
print(f'The dictionary keys per target: {lY[0].keys()}')

* <font color='brown'>(**#**)</font> The dimensions of the boxes and labels per sample might be different.  
Hence they can not be collated into a single tensor.  
The common solution is to pack them as a dictionary per sample.
* <font color='blue'>(**!**)</font> Go through `CollateObjectDetection()`.

## The Model

This section defines the model.  

![](https://i.imgur.com/8oqg3Du.png)  
**Credit**: Optimized Convolutional Neural Network Architectures for Efficient On Device Vision based Object Detection

![](https://i.imgur.com/AGQaauN.png)  
**Credit**: [Farid at @ai_fast_track](https://x.com/ai_fast_track/status/1453368771285032971)

* <font color='brown'>(**#**)</font> The following implementation has a model with a single output, both for the regression and the classification.
* <font color='brown'>(**#**)</font> One could create 2 different outputs (_Heads_) for each task.
* <font color='brown'>(**#**)</font> [Finally Understand Anchor Boxes in Object Detection (2D and 3D)](https://www.thinkautonomous.ai/blog/anchor-boxes/) - How the prior is set by clustering the data.
* <font color='brown'>(**#**)</font> Anchor Free Model: [Forget the Hassles of Anchor Boxes with FCOS: Fully Convolutional One Stage Object Detection](https://scribe.rip/fc0e25622e1c).

### Depth Wise Separable Convolution

The Depth Wise Separable Convolution merges 2 concept:
 - Spatial Convolutions in Groups.
 - Projection by `1x1` Convolution.

* <font color='brown'>(**#**)</font> See [Animated AI - Groups, Depthwise and Depthwise Separable Convolution (Neural Networks)](https://www.youtube.com/watch?v=vVaRhZXovbw).
* <font color='brown'>(**#**)</font> The concept was made popular by models such as EfficientNet ([EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946)) and [MobileNet](https://en.wikipedia.org/wiki/MobileNet).

In [None]:
# Depth Wise Separable Convolutional Layer

class DepthWiseSeparableConv2D( nn.Module ):
    def __init__( self, inChannels: int, outChannels: int, kernelSize: int, stride: int = 1, padding: Union[int, str] = 0, bias: bool = True ) -> None:
        super(DepthWiseSeparableConv2D, self).__init__()

        self.oDepthWiseSeparableConv2D = nn.Sequential(
            nn.Conv2d(inChannels, inChannels, kernel_size = kernelSize, stride = stride, padding = padding, groups = inChannels, bias = bias),
            nn.Conv2d(inChannels, outChannels, kernel_size = 1, stride = 1, padding = 0, bias = bias)
        )
    
    def forward( self, tX: Tensor ) -> Tensor:
        
        tX = self.oDepthWiseSeparableConv2D(tX)
        
        return tX

In [None]:
# Model Class

class YoloLikeModel( nn.Module ):
    def __init__( self, numClasses: int, gridSize: int ) -> None:
        super(YoloLikeModel, self).__init__()

        self.numClasses = numClasses
        self.gridSize   = gridSize

        # self.oFeatureExtractor = nn.Sequential(
        #     DepthWiseSeparableConv2D(3, 16, kernelSize = 3, stride = 1, padding = 1),
        #     nn.ReLU(),
        #     nn.MaxPool2d(kernel_size = 2, stride = 2),

        #     DepthWiseSeparableConv2D(16, 32, kernelSize = 3, stride = 1, padding = 1),
        #     nn.ReLU(),
        #     nn.MaxPool2d(kernel_size = 2, stride = 2),

        #     DepthWiseSeparableConv2D(32, 64, kernelSize = 3, stride = 1, padding = 1),
        #     nn.ReLU(),
        #     nn.MaxPool2d(kernel_size = 2, stride = 2),

        #     DepthWiseSeparableConv2D(64, 128, kernelSize = 3, stride = 1, padding = 1),
        #     nn.ReLU(),
        #     nn.MaxPool2d(kernel_size = 2, stride = 2)
        # )

        self.oFeatureExtractor = nn.Sequential(
            DepthWiseSeparableConv2D(3 ,   32, 3, stride = 1, padding = 0, bias = False), nn.BatchNorm2d(32 ), nn.ReLU(), #<! (98, 98)
            DepthWiseSeparableConv2D(32,   32, 3, stride = 1, padding = 0, bias = False), nn.BatchNorm2d(32 ), nn.ReLU(), #<! (96, 96)
            DepthWiseSeparableConv2D(32,   32, 3, stride = 2, padding = 1, bias = False), nn.BatchNorm2d(32 ), nn.ReLU(), #<! (48, 48)
            DepthWiseSeparableConv2D(32,   32, 3, stride = 2, padding = 1, bias = False), nn.BatchNorm2d(32 ), nn.ReLU(), #<! (24, 24)
            DepthWiseSeparableConv2D(32,   64, 3, stride = 2, padding = 1, bias = False), nn.BatchNorm2d(64 ), nn.ReLU(), #<! (12, 12)
            DepthWiseSeparableConv2D(64,  128, 3, stride = 2, padding = 1, bias = False), nn.BatchNorm2d(128), nn.ReLU(), #<! (6 , 6)
            
            nn.Conv2d(128,  (5 + numClasses), 1, stride = 1, padding = 0, bias = False), nn.BatchNorm2d(5 + numClasses), nn.ReLU(), #<! (6 , 6)
        )

        self.oClassifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear((5 + numClasses) * ((T_IMG_SIZE[0] - 4) // 16) * ((T_IMG_SIZE[1] - 4) // 16), 384),
            nn.ReLU(),
            nn.Linear(384, gridSize * gridSize * (5 + numClasses))
        )
    
    def forward( self, tX: Tensor ) -> Tensor:
        
        tX = self.oFeatureExtractor(tX)
        tX = self.oClassifier(tX)
        tX = tX.view(-1, 5 + self.numClasses, self.gridSize, self.gridSize) #<! Reshape to (batchSize, 5 + numClasses, gridSize, gridSize)
        tX[:, 1:5, :, :] = torch.sigmoid(tX[:, 1:5, :, :]) #<! (cx, cy, h, w), Where (cx, cy) are relative to the grid cell, and (h, w) are relative to the image size
        
        return tX

* <font color='red'>(**?**)</font> Explain the classifier head. Can it be implemented without the `Linear` layer?
* <font color='brown'>(**#**)</font> Both _Objectness_ probability and the class probabilities are not normalized and kept as _logits_.

In [None]:
# Build the Model

oModel = YoloLikeModel(len(L_CLASSES), gridSize)

In [None]:
# Model Information
# Pay attention to the layers name.
torchinfo.summary(oModel, (batchSize, *(T_IMG_SIZE[::-1])), col_names = ['kernel_size', 'input_size', 'output_size', 'num_params'], device = 'cpu', row_settings = ['depth', 'var_names'])

* <font color='red'>(**?**)</font> Explain the dimensions of the last layer.
* <font color='red'>(**?**)</font> Will the model work with smaller images?

## Train the Model

This section trains the model.  

### Object Localization Loss

The loss is a composite of 3 loss functions:

$$\ell \left( \hat{\boldsymbol{y}}, \boldsymbol{y} \right) = \lambda_{\text{Object}} \cdot \ell_{\text{Object}}\left(\hat{\boldsymbol{y}}_{\text{object}},\boldsymbol{y}_{\text{object}}\right) + \lambda_{\text{Regression}} \cdot \ell_{\text{Regression}}\left(\hat{\boldsymbol{y}}_{\text{bbox}},\boldsymbol{y}_{\text{bbox}}\right) + \lambda_{\text{Classification}}\cdot\ell_{\text{CE}}\left(\hat{\boldsymbol{y}}_{\text{label}},\boldsymbol{y}_{\text{label}}\right)$$

Where
 - $\ell_{\text{Object}}$ - Binary classification loss for detection of an object within the grid cell. Calculated in each grid cell.
 - $\ell_{\text{Regression}}$ - Regression loss for the normalized coordinates of the box: `[cx, cy, w, h]`. Calculated only in cells where object exists.
 - $\ell_{\text{Regression}}$ - Multiclass classification loss for the class of the object. Calculated only in cells where object exists.
 - $\lambda_{\text{Object}}, \lambda_{\text{Regression}}$ and $\lambda_{\text{Classification}}$ are the weights of each loss.

* <font color='brown'>(**#**)</font> There are various normalization method for the coordinate of the box.  
Usually `(cx, cy)` are normalized within the cell and `(w, h)` are normalized to the image size. 

In [None]:
# Object Localization Loss
class ObjDetLoss( nn.Module ):
    def __init__( self, numCls: int, λObj: float, λReg: float, λCls: float, ϵ: float = 0.0 ) -> None:
        super(ObjDetLoss, self).__init__()

        self.numCls     = numCls
        self.λObj       = λObj #<! Objectness Loss Weight
        self.λReg       = λReg #<! Regression Loss Weight
        self.λCls       = λCls #<! Classification Loss Weight
        self.ϵ          = ϵ
        self.oObjLoss   = nn.BCEWithLogitsLoss()
        self.oRegLoss   = nn.MSELoss()
        self.oClsLoss   = nn.CrossEntropyLoss(label_smoothing = ϵ)
    
    def forward( self: Self, tYHat: torch.Tensor, lY: List[Dict[str, Tensor]] ) -> torch.Tensor:
        # `tYHat`: (batchSize, 5 + numCls, gridSize, gridSize)
        #          Each grid cell contains: [Objectness, cx, cy, w, h, Class1, Class2, ..., ClassN]
        #          Where `(cx, cy)` are relative to the grid cell, and `(w, h)` are relative to the image size.
        # `lY`: List of length batchSize, each element is a dictionary with keys: 'Boxes' and 'Labels'
        #       Boxes are normalized to [0, 1] with respect to the image size.

        batchSize = tYHat.shape[0]
        gridSize  = tYHat.shape[2]
        runDevice = tYHat.device

        # Build targets in YOLO grid format (Per cell)
        tObjTgt = torch.zeros((batchSize, gridSize, gridSize), device = runDevice, dtype = tYHat.dtype)
        tRegTgt = torch.zeros((batchSize, gridSize, gridSize, 4), device = runDevice, dtype = tYHat.dtype)
        tClsTgt = torch.zeros((batchSize, gridSize, gridSize), device = runDevice, dtype = torch.long)

        for ii in range(batchSize):
            tB = lY[ii]['Boxes']  #<! (D, 4) in YOLO format: [cx, cy, w, h] normalized
            tC = lY[ii]['Labels'] #<! (D, ) class indices
            
            if tB.numel() == 0:
                continue
            
            # Convert to grid coordinates (Same logic as `YoloGrid`)
            vCxNorm, vCyNorm, vW, vH = tB.T  #<! All in [0, 1] normalized
            # Cell indices
            vCellX = (vCxNorm * gridSize).floor().to(torch.long).clamp_(0, gridSize - 1)
            vCellY = (vCyNorm * gridSize).floor().to(torch.long).clamp_(0, gridSize - 1)
            
            # cx, cy relative to cell (in [0, 1] within the cell)
            vCxRel = vCxNorm * gridSize - vCellX.to(vCxNorm.dtype)
            vCyRel = vCyNorm * gridSize - vCellY.to(vCyNorm.dtype)
            
            tObjTgt[ii, vCellY, vCellX]    = 1.0
            # w, h stay normalized to [0, 1] (relative to image size)
            tRegTgt[ii, vCellY, vCellX, :] = torch.stack((vCxRel, vCyRel, vW, vH), dim = 1)
            tClsTgt[ii, vCellY, vCellX]    = tC.to(torch.long)
        
        # Objectness loss: over all grid cells
        objLoss = self.oObjLoss(tYHat[:, 0, :, :], tObjTgt)
        
        # Regression / Classification: only on cells with objects (GT)
        mObjMask = tObjTgt.to(torch.bool)
        if mObjMask.any():
            tYHatReg = tYHat[:, 1:5, :, :].permute(0, 2, 3, 1) #<! (B, G, G, 4)
            regLoss  = self.oRegLoss(tYHatReg[mObjMask], tRegTgt[mObjMask])

            tYHatCls = tYHat[:, 5:, :, :].permute(0, 2, 3, 1) #<! (B, G, G, numCls)
            clsLoss  = self.oClsLoss(tYHatCls[mObjMask], tClsTgt[mObjMask])
        else:
            regLoss = 0.0
            clsLoss = 0.0

        lossVal = (self.λObj * objLoss) + (self.λReg * regLoss) + (self.λCls * clsLoss)
        
        return lossVal

* <font color='green'>(**@**)</font> Read on the differentiation friendly variants of IoU and use one of them as the Regression loss.

### Object Detection Score

The score per image is given by:

 - For any GT Object, a point is given if there is at least 1 matching prediction.
 - For any Prediction, a point is given if there is at least one matching GT object.

A match is if prediction an object have the same class, the IoU is above a threshold and the objectness score is above a threshold.

* <font color='red'>(**?**)</font> What are the bounds of the values of the score function?
* <font color='red'>(**?**)</font> Is higher or lower value bette for the score?
* <font color='brown'>(**#**)</font> A common score for the detection is the Mean Average Precision (mAP):  
  - [Naoki Shibuya - Object Detection: mean Average Precision (mAP)](https://naokishibuya.github.io/blog/2022-05-27-object-detection-mean-average-precision).  
  - [Cartucho - mAP](https://github.com/Cartucho/mAP).  
  - [Lei Mao - Mean Average Precision mAP for Object Detection](https://leimao.github.io/blog/Object-Detection-Mean-Average-Precision-mAP).

In [None]:
# Object Detection Score
class ObjDetScore( nn.Module ):
    def __init__( self, numCls: int, gridSize: int, probThr: float, iouThr: float, runDevice: torch.device ) -> None:
        super(ObjDetScore, self).__init__()

        self.numCls    = numCls
        self.gridSize  = gridSize
        self.probThr   = probThr
        self.iouThr    = iouThr
        self.runDevice = runDevice
    
    def forward( self: Self, tYHat: torch.Tensor, lY: List[Dict[str, Tensor]] ) -> float:
        # `tYHat`: (batchSize, 5 + numCls, gridSize, gridSize)
        #          Each grid cell contains: [Objectness, cx, cy, w, h, Class1, Class2, ..., ClassN]
        #          Where `(cx, cy)` are relative to the grid cell, and `(w, h)` are relative to the image size.
        # `lY`: List of length batchSize, each element is a dictionary with keys: 'Boxes' and 'Labels'
        #       Boxes are normalized to [0, 1] with respect to the image size.

        batchSize = tYHat.shape[0]
        gridSize  = tYHat.shape[2]

        numObjDet = 0
        numMatch  = 0

        for ii in range(batchSize):
            mB = lY[ii]['Boxes']  #<! (D, 4) in YOLO format: [cx, cy, w, h] normalized
            vC = lY[ii]['Labels'] #<! (D, ) class indices

            mPredMask = torch.sigmoid(tYHat[ii, 0, :, :]).gt(self.probThr) #<! (G, G)

            numObj  = int(mB.shape[0])
            numPred = int(mPredMask.sum().item())

            numObjDetBatch = numObj + numPred
            numObjDet     += numObjDetBatch

            # If either side is empty, no matches are possible
            if (numObj == 0) or (numPred == 0):
                continue

            # Gather predicted boxes / classes for all predictions with objectness above threshold
            mIdx      = torch.nonzero(mPredMask, as_tuple = False) #<! (P, 2) -> (y, x)
            vPy, vPx  = mIdx[:, 0], mIdx[:, 1]

            # Predicted class per prediction
            tClsP = tYHat[ii, 5:, vPy, vPx]                     #<! (numCls, P)
            vClsP = torch.argmax(tClsP, dim = 0).to(torch.long) #<! (P, )

            # Predicted boxes per prediction in image-normalized cxcywh
            mBP = tYHat[ii, 1:5, vPy, vPx].T #<! (P, 4) -> (cx_cell, cy_cell, w, h)
            mBP = mBP.clone()
            mBP[:, 0] = (vPx.to(mBP.dtype) + mBP[:, 0]) / gridSize
            mBP[:, 1] = (vPy.to(mBP.dtype) + mBP[:, 1]) / gridSize

            # IoU matrix (P, D)
            mIoU  = torchvision.ops.box_iou(mBP, mB, 'cxcywh') #<! (P, D)

            vC = vC.to(torch.long) #<! Classification loss expect the GT to be integer class indices

            # Boolean match matrix: (P, D)
            mMatchMat = (mIoU > self.iouThr) & (vClsP[:, None] == vC[None, :])

            # Count Valid Matches from Predictions to GT
            # Valid Match: prediction has at least one GT with correct class and IoU > thr
            numMatchBatchPredGt = int(mMatchMat.any(dim = 1).sum().item())

            # Count Valid Matches from GT to Predictions
            # Valid Match: GT has at least one prediction with correct class and IoU > thr
            numMatchBatchGtPred = int(mMatchMat.any(dim = 0).sum().item())

            # Sum the Total Matches per Batch
            numMatchBatch = numMatchBatchPredGt + numMatchBatchGtPred
            numMatch += numMatchBatch

        if numObjDet > 0:
            valScore = float(numMatch / numObjDet)
        else:
            valScore = 0.0

        return valScore

In [None]:
# Run Device

runDevice = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') #<! The 1st CUDA device

In [None]:
# Loss and Score Function

hL = ObjDetLoss(numCls = numCls, λObj = λObj, λReg = λReg, λCls = λCls, ϵ = ϵ)
hS = ObjDetScore(numCls = numCls, gridSize = gridSize, probThr = probThr, iouThr = iouThr, runDevice = runDevice)

hL = hL.to(runDevice)
hS = hS.to(runDevice)

In [None]:
# Training Loop

oModel = oModel.to(runDevice)
oOpt = torch.optim.AdamW(oModel.parameters(), lr = 1e-5, betas = (0.9, 0.99), weight_decay = 1e-5) #<! Define optimizer
oSch = torch.optim.lr_scheduler.OneCycleLR(oOpt, max_lr = 7.5e-4, total_steps = numEpochs)
_, lTrainLoss, lTrainScore, lValLoss, lValScore, lLearnRate = TrainModel(oModel, dlTrain, dlVal, oOpt, numEpochs, hL, hS, oSch = oSch)

In [None]:
# Plot Training Phase

hF, vHa = plt.subplots(nrows = 1, ncols = 3, figsize = (12, 5))
vHa = np.ravel(vHa)

hA = vHa[0]
hA.plot(lTrainLoss, lw = 2, label = 'Train')
hA.plot(lValLoss, lw = 2, label = 'Validation')
hA.set_title(f'Object Localization Loss (λ = {λ:0.1f})')
hA.set_xlabel('Epoch')
hA.set_ylabel('Loss')
hA.legend()

hA = vHa[1]
hA.plot(lTrainScore, lw = 2, label = 'Train')
hA.plot(lValScore, lw = 2, label = 'Validation')
hA.set_title('Object Localization Score')
hA.set_xlabel('Epoch')
hA.set_ylabel('Score')
hA.legend()

hA = vHa[2]
hA.plot(lLearnRate, lw = 2)
hA.set_title('Learn Rate Scheduler')
hA.set_xlabel('Epoch')
hA.set_ylabel('Learn Rate');

In [None]:
# Plot Prediction

rndIdx = np.random.randint(numSamplesVal)

tX, tuY = dsVal[rndIdx]
vYGt    = tuY[0] #<! GT labels (D,)
mBGt    = tuY[1] #<! GT boxes (D, 4) in YOLO format [cx, cy, w, h] normalized

with torch.no_grad():
    tX = torch.tensor(tX)
    tX = torch.unsqueeze(tX, 0)
    tX = tX.to(runDevice)
    tYHat = oModel(tX).detach().cpu()

# Extract predictions with objectness above threshold
tYHat = tYHat[0] #<! (5 + numCls, gridSize, gridSize)

# Apply sigmoid to objectness (RAW logits)
mObjProb  = torch.sigmoid(tYHat[0, :, :]) #<! (G, G)
# Apply threshold
mPredMask = mObjProb.gt(probThr)          #<! (G, G)

# Get indices of cells with predictions
mIdx     = torch.nonzero(mPredMask, as_tuple = False) #<! (P, 2) -> (y, x)
numPred  = mIdx.shape[0]

if numPred > 0:
    vPy, vPx = mIdx[:, 0], mIdx[:, 1]
    
    # Extract predicted boxes: (cx_rel, cy_rel, w, h) -> convert to image normalized (cx, cy, w, h)
    mBPred = tYHat[1:5, vPy, vPx].T #<! (P, 4) -> (cx_cell, cy_cell, w, h)
    mBPred = mBPred.clone()
    mBPred[:, 0] = (vPx.to(mBPred.dtype) + mBPred[:, 0]) / gridSize #<! cx normalized
    mBPred[:, 1] = (vPy.to(mBPred.dtype) + mBPred[:, 1]) / gridSize #<! cy normalized
    # w, h are already normalized to [0, 1]
    
    # Extract predicted classes
    tClsPred = tYHat[5:, vPy, vPx]                     #<! (numCls, P)
    vClsPred = torch.argmax(tClsPred, dim = 0).numpy() #<! (P,)
    
    # Extract objectness scores for the predictions
    vObjScore = mObjProb[vPy, vPx].numpy() #<! (P,)
    
    mBPred = mBPred.numpy() #<! (P, 4)
else:
    mBPred    = np.empty((0, 4))
    vClsPred  = np.array([])
    vObjScore = np.array([])

print(f'Number of GT objects: {len(vYGt)}')
print(f'Number of predictions above threshold: {numPred}')
print(f'Predicted boxes shape: {mBPred.shape}')
print(f'GT boxes shape: {mBGt.shape}')

# Plot GT and Predictions
mImg = tX[0].cpu().numpy()
mImg = np.transpose(mImg, (1, 2, 0))

lYGtStr   = [L_CLASSES[int(c)] for c in vYGt]
lYPredStr = [L_CLASSES[int(c)] for c in vClsPred]

hA = PlotBox(mImg, lYGtStr, mBGt) #<! Plot GT
for ii, vB in enumerate(mBPred):
    hA = PlotBBox(hA, lYPredStr[ii], vB) #<! Plot Predictions

* <font color='green'>(**@**)</font> Use the _mAP_ score in the training evaluation. You should use [TorchMetrics' `MeanAveragePrecision`](https://lightning.ai/docs/torchmetrics/stable/detection/mean_average_precision.html).