### 🧠 YOLOv1 – You Only Look Once (Version 1) – Detailed Explanation


YOLOv1, introduced by **Joseph Redmon et al. in 2015**, marked a major shift in object detection by framing it as a **single regression problem**, rather than a series of region proposals and classifications like R-CNNs.

It offered **real-time object detection** with a unique approach: **one neural network processes the entire image in one forward pass**, hence the name **"You Only Look Once."**



### 📌 Key Innovations in YOLOv1

1. **Unified Architecture:** Single CNN for bounding box and class prediction.
2. **Grid-Based Prediction:** Image divided into an S×S grid (typically 7×7).
3. **End-to-End Training:** Entire model trained using a single loss function.
4. **Fast and Simple:** Real-time performance on standard hardware.



### 🔧 Architecture Overview

1. **Input Size:** 448×448 RGB image.
2. **Convolutional Layers:** 24 convolutional layers (feature extraction).
3. **Fully Connected Layers:** 2 FC layers (for predicting bounding boxes and class scores).
4. **Output:** A tensor of size **S × S × (B × 5 + C)**, where:

   * **S = 7** (grid size)
   * **B = 2** (number of boxes per grid cell)
   * **5** = (x, y, w, h, confidence)
   * **C = number of classes**

Thus, for Pascal VOC (20 classes):

$$
7 × 7 × (2 × 5 + 20) = 7 × 7 × 30 = 1470 \text{ outputs}
$$



### 🧮 How YOLOv1 Works

#### 1. **Grid Division**

* The image is divided into a **7x7 grid**.
* Each grid cell is responsible for detecting an object **only if the object’s center falls inside it**.

#### 2. **Bounding Box Prediction**

Each grid cell predicts:

* **2 bounding boxes**:

  * Each box has 5 predictions: `(x, y, w, h, confidence)`
  * `(x, y)` are relative to the grid cell, and `w`, `h` are relative to the whole image.
* **1 set of class probabilities**:

  * Shared across both bounding boxes.
  * 20 probabilities for 20 classes in Pascal VOC.

#### 3. **Confidence Score**

For each bounding box, the confidence score is:

$$
\text{Confidence} = P(\text{object}) \times \text{IoU}_{\text{pred, truth}}
$$

This score indicates:

* Whether an object is present.
* How well the predicted box matches the ground truth box (via IoU).



### 🧠 Loss Function (Recap from earlier)

YOLOv1 uses a **custom loss function** that penalizes:

* Localization error (bounding box coordinates).
* Confidence score error (objectness).
* Classification error (class probabilities).

It includes weights for:

* Boxes with and without objects (`λ_coord`, `λ_noobj`).
* Uses **Mean Squared Error (MSE)** for all parts of the loss.



### 📉 Limitations of YOLOv1

| Issue                                       | Description                                                           |
| ------------------------------------------- | --------------------------------------------------------------------- |
| 🧠 Struggles with small/overlapping objects | One object per grid cell – can’t detect multiple objects in one cell. |
| 🧠 Poor localization for unusual shapes     | Bounding box prediction isn't as precise as region proposal methods.  |
| 🧠 Uses MSE for classification              | Not optimal for multi-class probabilities.                            |
| 🧠 Fixed number of boxes                    | Can’t handle variable number of objects per image.                    |



### ✅ Strengths of YOLOv1

| Feature                | Benefit                                                              |
| ---------------------- | -------------------------------------------------------------------- |
| ⚡ Real-time Detection  | \~45 FPS on standard GPU (fastest at the time).                      |
| 🔄 Unified Pipeline    | Single model, end-to-end trainable.                                  |
| 🔍 Global Context      | Looks at the whole image at once, unlike R-CNN which looks at parts. |
| 🧩 Simple Architecture | Easy to train and deploy.                                            |



### 📚 Paper Reference:

**Title**: You Only Look Once: Unified, Real-Time Object Detection
**Authors**: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
**Published**: 2015 (CVPR)
[Link to Paper](https://arxiv.org/abs/1506.02640)



## Implementaion

### Import Libraries

In [1]:
import torch
import torch.nn as nn
import pandas as pd
import os
import PIL
import skimage
from skimage import io
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import torchvision.transforms as transforms
import torch.optim as optim
import torchvision.transforms.functional as FT
from torch.utils.data import DataLoader
from tqdm import tqdm
seed = 123
import cv2
import xml.etree.ElementTree as ET
torch.manual_seed(seed)
from collections import Counter

## Model Architecture 

In [21]:
architecture_config = [
    # (kernel_size, filters, stride, padding)
    (7, 64, 2, 3),          # Conv layer: 7x7 kernel, 64 filters, stride=2, padding=3 -> reduces spatial dim from 448x448 to 224x224
    "M",                   # MaxPooling: 2x2 with stride 2 -> reduces spatial dim to 112x112



    (3, 192, 1, 1),        # Conv layer: 3x3 kernel, 192 filters, stride=1, padding=1 -> keeps spatial dim 112x112
    "M",                   # MaxPooling -> reduces spatial dim to 56x56


    (1, 128, 1, 0),        # Conv: 1x1 kernel, 128 filters, stride=1, no padding -> used for dimensionality reduction
    (3, 256, 1, 1),        # Conv: 3x3 kernel, 256 filters, stride=1, padding=1
    (1, 256, 1, 0),        # Conv: 1x1, 256 filters
    (3, 512, 1, 1),        # Conv: 3x3, 512 filters
    "M",                   # MaxPooling -> spatial dim becomes 28x28



    # This block is repeated 4 times:
    [(1, 256, 1, 0), (3, 512, 1, 1), 4],
    # → Adds: 1x1 conv (256 filters) followed by 3x3 conv (512 filters), repeated 4 times
    


    (1, 512, 1, 0),        # Conv: 1x1, 512 filters
    (3, 1024, 1, 1),       # Conv: 3x3, 1024 filters
    "M",                   # MaxPooling -> spatial dim becomes 14x14


    # This block is repeated 2 times:
    [(1, 512, 1, 0), (3, 1024, 1, 1), 2],
    # → Adds: 1x1 conv (512 filters) followed by 3x3 conv (1024 filters), repeated 2 times

    (3, 1024, 1, 1),       # Conv: 3x3, 1024 filters
    (3, 1024, 2, 1),       # Conv: 3x3, 1024 filters, stride=2 → spatial dim becomes 7x7
    (3, 1024, 1, 1),       # Conv: 3x3, 1024 filters
    (3, 1024, 1, 1),       # Conv: 3x3, 1024 filters

    # Fully connected (FC) layers will be added later separately
]


#### 🔷 1. CNNBlock Class – Basic Convolution Block

In [6]:
class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(CNNBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.batchnorm = nn.BatchNorm2d(out_channels)
        self.leakyrelu = nn.LeakyReLU(0.1)
        
    def forward(self, x):
        return self.leakyrelu(self.batchnorm(self.conv(x)))


✅ Explanation:

A custom block used throughout the architecture.

- Composed of:

- Conv2d (no bias → since batchnorm handles it),

- BatchNorm2d for faster convergence,

- LeakyReLU(0.1) as activation.



#### 📌 **Why we create it:**

* YOLOv1 uses many **repetitive conv blocks**, each with:

  * A Convolutional Layer (to extract spatial features),
  * Batch Normalization (to stabilize and speed up training),
  * LeakyReLU activation (helps avoid dying ReLU problem).

#### 🔧 **What it does:**

* Applies feature transformation via learned filters (Conv2D),
* Normalizes outputs (BatchNorm),
* Adds non-linearity (LeakyReLU).

#### 🟩 **Output:**

* A 4D tensor: `[batch_size, out_channels, H, W]`
* Represents extracted features from the input or previous layer.

### 🔷 2. YoloV1 Class – Main Model Class

In [7]:
class YoloV1(nn.Module):
    def __init__(self, in_channels=3, **kwargs):
        super(YoloV1, self).__init__()
        self.architecture = architecture_config
        self.in_channels = in_channels
        self.darknet = self._create_conv_layers(self.architecture)
        self.fcs = self._create_fcs(**kwargs)


✅ Explanation:

In __init__, the model:

- Reads architecture_config (your earlier list).

- Creates convolutional layers using _create_conv_layers().

- Adds fully connected layers using _create_fcs().

### 🔷 3. forward() Method – Forward Pass Logic

In [8]:
def forward(self, x):
    x = self.darknet(x)
    return self.fcs(torch.flatten(x, start_dim=1))


#### 📌 **Why we create it:**

* Defines how the input image flows through the model to get predictions.

#### 🔧 **What it does:**

* Passes input through:

  * `darknet` → CNN backbone,
  * `fcs` → Fully connected head after flattening the CNN output.

#### 🟩 **Output:**

* A **single flat prediction vector** for the entire image:

  * Divided into S×S grid cells,
  * For each cell: predictions for class and bounding boxes.

### 🔷 4. _create_conv_layers() – CNN Backbone Construction

In [9]:
def _create_conv_layers(self, architecture):
    layers = []
    in_channels = self.in_channels
    
    for x in architecture:
        if type(x) == tuple:
            layers += [CNNBlock(in_channels, x[1], kernel_size=x[0], stride=x[2], padding=x[3])]
            in_channels = x[1]
        
        elif type(x) == str:
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        
        elif type(x) == list:
            conv1 = x[0]
            conv2 = x[1]
            repeats = x[2]
            
            for _ in range(repeats):
                layers += [CNNBlock(in_channels, conv1[1], kernel_size=conv1[0], stride=conv1[2], padding=conv1[3])]
                layers += [CNNBlock(conv1[1], conv2[1], kernel_size=conv2[0], stride=conv2[2], padding=conv2[3])]
                in_channels = conv2[1]
    
    return nn.Sequential(*layers)


#### 📌 **Why we create it:**

* YOLOv1 uses a custom CNN (inspired by GoogLeNet) to extract **rich feature maps** from the input image.
* These convolutional layers **encode position, texture, and shape** of objects.

#### 🔧 **What it does:**

* Converts the `architecture_config` into actual PyTorch layers:

  * `Tuple` → Single CNNBlock.
  * `"M"` → MaxPool (downsamples spatial resolution).
  * `List` → Repeated mini CNN-blocks.

#### 🟩 **Output:**

* A tensor of shape `[batch_size, 1024, S, S]` (typically S = 7).
* Each of the `S x S` grid cells contains **deep features** used to:

  * Predict bounding boxes,
  * Predict class probabilities.


### 🔷 5. _create_fcs() – Fully Connected Layers

In [10]:
def _create_fcs(self, split_size, num_boxes, num_classes):
    S, B, C = split_size, num_boxes, num_classes
    return nn.Sequential(
        nn.Flatten(),
        nn.Linear(1024 * S * S, 496),  # Optionally change to 4096 for original paper
        nn.Dropout(0.0),
        nn.LeakyReLU(0.1),
        nn.Linear(496, S * S * (C + B * 5))
    )


### 🔷 3. `_create_fcs()` – Fully Connected Layers (Prediction Head)

#### 📌 **Why we create it:**

* After convolution, YOLO flattens the spatial feature map to feed into a prediction head.
* The fully connected layers predict:

  * Class probabilities,
  * Bounding box coordinates,
  * Confidence scores.

#### 🔧 **What it does:**

* `nn.Flatten()` → Flattens `[batch_size, 1024, 7, 7]` → `[batch_size, 1024*7*7]`
* `nn.Linear(...)` → Reduces dimension to 496 (or 4096 in original paper),
* `nn.Dropout` & `LeakyReLU` → Regularization & non-linearity,
* Final `Linear` layer outputs `S * S * (C + B×5)`.

#### 🟩 **Output:**

* `[batch_size, S*S*(C + B*5)]` → Can be reshaped to `[batch_size, S, S, C + B*5]`
* For each grid cell:

  * C class scores,
  * B bounding boxes (x, y, w, h, confidence).


### Example Flow (assuming 448×448 input and S=7, B=2, C=20):

| Stage       | Input Size         | Output Size       | Purpose                          |
| ----------- | ------------------ | ----------------- | -------------------------------- |
| Conv Layers | `[1, 3, 448, 448]` | `[1, 1024, 7, 7]` | Extract features                 |
| Flatten     | `[1, 1024, 7, 7]`  | `[1, 50176]`      | Prepare for dense layers         |
| FC Layer 1  | `[1, 50176]`       | `[1, 496]`        | Compress features                |
| FC Layer 2  | `[1, 496]`         | `[1, 1470]`       | Final prediction (7×7×30)        |
| Reshape     | `[1, 1470]`        | `[1, 7, 7, 30]`   | Class & bbox prediction per grid |



### Breakdown of Final Output: `7x7x30`

* 7×7 grid → 49 cells.
* For each cell:

  * 20 class scores (C),
  * 2 bounding boxes (B=2), each with 5 values (x, y, w, h, confidence).

Total: `7 × 7 × (20 + 2×5) = 7 × 7 × 30 = 1470`


### Complete Block

In [2]:
class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(CNNBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.batchnorm = nn.BatchNorm2d(out_channels)
        self.leakyrelu = nn.LeakyReLU(0.1)
        
    def forward(self, x):
        return self.leakyrelu(self.batchnorm(self.conv(x)))
    
class YoloV1(nn.Module):
    def __init__(self, in_channels=3, **kwargs):
        super(YoloV1, self).__init__()
        self.architecture = architecture_config
        self.in_channels = in_channels
        self.darknet = self._create_conv_layers(self.architecture)
        self.fcs = self._create_fcs(**kwargs)
        
    def forward(self, x):
        x = self.darknet(x)
        return self.fcs(torch.flatten(x, start_dim=1))
    
    def _create_conv_layers(self, architecture):
        layers = []
        in_channels = self.in_channels
        
        for x in architecture:
            if type(x) == tuple:
                layers += [CNNBlock(in_channels, x[1], kernel_size=x[0], stride=x[2], padding=x[3])]
                in_channels = x[1]
            elif type(x) == str:
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            elif type(x) == list:
                conv1 = x[0] #Tuple
                conv2 = x[1] #Tuple
                repeats = x[2] #Int
                
                for _ in range(repeats):
                    layers += [CNNBlock(in_channels, conv1[1], kernel_size=conv1[0], stride=conv1[2], padding=conv1[3])]
                    layers += [CNNBlock(conv1[1], conv2[1], kernel_size=conv2[0], stride=conv2[2], padding=conv2[3])]
                    in_channels = conv2[1]
                    
        return nn.Sequential(*layers)
    
    def _create_fcs(self, split_size, num_boxes, num_classes):
        S, B, C = split_size, num_boxes, num_classes
        return nn.Sequential(nn.Flatten(), nn.Linear(1024 * S * S, 496), nn.Dropout(0.0), nn.LeakyReLU(0.1), nn.Linear(496, S * S * (C + B * 5)))#Original paper uses nn.Linear(1024 * S * S, 4096) not 496. Also the last layer will be reshaped to (S, S, 13) where C+B*5 = 13

## **Intersection Over Union**

In [3]:
def intersection_over_union(boxes_preds, boxes_labels, box_format='midpoint'):
    """
    Calculates intersection over union
    
    Parameters:
        boxes_preds (tensor): Predictions of Bounding Boxes (BATCH_SIZE, 4)
        boxes_labels (tensor): Correct labels of Bounding Boxes (BATCH_SIZE, 4)
        box_format (str): midpoint/corners, if boxes are (x,y,w,h) or (x1,y1,x2,y2) respectively.
    
    Returns:
        tensor: Intersection over union for all examples
    """
    # boxes_preds shape is (N, 4) where N is the number of bboxes
    #boxes_labels shape is (n, 4)
    
    if box_format == 'midpoint':
        box1_x1 = boxes_preds[..., 0:1] - boxes_preds[..., 2:3] / 2
        box1_y1 = boxes_preds[..., 1:2] - boxes_preds[..., 3:4] / 2
        box1_x2 = boxes_preds[..., 0:1] + boxes_preds[..., 2:3] / 2
        box1_y2 = boxes_preds[..., 1:2] + boxes_preds[..., 3:4] / 2
        box2_x1 = boxes_labels[..., 0:1] - boxes_labels[..., 2:3] / 2
        box2_y1 = boxes_labels[..., 1:2] - boxes_labels[..., 3:4] / 2
        box2_x2 = boxes_labels[..., 0:1] + boxes_labels[..., 2:3] / 2
        box2_y2 = boxes_labels[..., 1:2] + boxes_labels[..., 3:4] / 2
        
    if box_format == 'corners':
        box1_x1 = boxes_preds[..., 0:1]
        box1_y1 = boxes_preds[..., 1:2]
        box1_x2 = boxes_preds[..., 2:3]
        box1_y2 = boxes_preds[..., 3:4] # Output tensor should be (N, 1). If we only use 3, we go to (N)
        box2_x1 = boxes_labels[..., 0:1]
        box2_y1 = boxes_labels[..., 1:2]
        box2_x2 = boxes_labels[..., 2:3]
        box2_y2 = boxes_labels[..., 3:4]
    
    x1 = torch.max(box1_x1, box2_x1)
    y1 = torch.max(box1_y1, box2_y1)
    x2 = torch.min(box1_x2, box2_x2)
    y2 = torch.min(box1_y2, box2_y2)
    
    #.clamp(0) is for the case when they don't intersect. Since when they don't intersect, one of these will be negative so that should become 0
    intersection = (x2 - x1).clamp(0) * (y2 - y1).clamp(0)
    
    box1_area = abs((box1_x2 - box1_x1) * (box1_y2 - box1_y1))
    box2_area = abs((box2_x2 - box2_x1) * (box2_y2 - box2_y1))
    
    return intersection / (box1_area + box2_area - intersection + 1e-6)

## 🔶 Purpose of the Function:

To compute **IoU = Area of Overlap / Area of Union** between predicted and true bounding boxes.

It supports two formats:

* `"midpoint"`: boxes given as (x\_center, y\_center, width, height)
* `"corners"`: boxes given as (x1, y1, x2, y2)



## 🧱 Step-by-Step Explanation

### ✅ Step 1: Convert box format (if needed)

```python
if box_format == 'midpoint':
    box1_x1 = boxes_preds[..., 0:1] - boxes_preds[..., 2:3] / 2
    box1_y1 = boxes_preds[..., 1:2] - boxes_preds[..., 3:4] / 2
    box1_x2 = boxes_preds[..., 0:1] + boxes_preds[..., 2:3] / 2
    box1_y2 = boxes_preds[..., 1:2] + boxes_preds[..., 3:4] / 2
    ...
```

* This converts `(x_center, y_center, width, height)` → `(x1, y1, x2, y2)` for both predictions and labels.
* Required because intersection calculation is easier in corner format.



### ✅ Step 2: Extract corners directly (if already in "corners" format)

```python
if box_format == 'corners':
    box1_x1 = boxes_preds[..., 0:1]
    box1_y1 = boxes_preds[..., 1:2]
    box1_x2 = boxes_preds[..., 2:3]
    box1_y2 = boxes_preds[..., 3:4]
    ...
```



### ✅ Step 3: Calculate the intersection coordinates

```python
x1 = torch.max(box1_x1, box2_x1)
y1 = torch.max(box1_y1, box2_y1)
x2 = torch.min(box1_x2, box2_x2)
y2 = torch.min(box1_y2, box2_y2)
```

* These are the coordinates of the **intersection rectangle**.
* Only the overlapping region between the two boxes.



### ✅ Step 4: Calculate the area of intersection

```python
intersection = (x2 - x1).clamp(0) * (y2 - y1).clamp(0)
```

* `(x2 - x1)` and `(y2 - y1)` give width and height of overlap.
* `clamp(0)` ensures **no negative values** (if boxes don’t intersect).
* Area = width × height.



### ✅ Step 5: Calculate individual areas

```python
box1_area = abs((box1_x2 - box1_x1) * (box1_y2 - box1_y1))
box2_area = abs((box2_x2 - box2_x1) * (box2_y2 - box2_y1))
```

* Just standard area of a rectangle = (x2 - x1) × (y2 - y1)
* `abs` ensures valid area even if coordinates are flipped.



### ✅ Step 6: Compute IoU

```python
return intersection / (box1_area + box2_area - intersection + 1e-6)
```

* **Union** = Area of box1 + Area of box2 - Area of intersection
* Add `1e-6` to denominator to avoid division by zero
* Final result = **IoU**, ranges from 0 to 1.



## 🔁 Input/Output Shape Summary

| Input          | Shape                            |
| -------------- | -------------------------------- |
| `boxes_preds`  | `[N, 4]`                         |
| `boxes_labels` | `[N, 4]`                         |
| Output         | `[N, 1]` (IoU for each box pair) |



## 🧠 Use Case in YOLO

This function is critical for:

* Calculating **loss** during training (to determine how well boxes match),
* Applying **Non-Max Suppression** (to eliminate duplicate detections).



### **Non-Max Supression**

In [4]:
def non_max_suppression(bboxes, iou_threshold, threshold, box_format="corners"):
    """
    Does Non Max Suppression given bboxes
    Parameters:
        bboxes (list): list of lists containing all bboxes with each bboxes
        specified as [class_pred, prob_score, x1, y1, x2, y2]
        iou_threshold (float): threshold where predicted bboxes is correct
        threshold (float): threshold to remove predicted bboxes (independent of IoU) 
        box_format (str): "midpoint" or "corners" used to specify bboxes
    Returns:
        list: bboxes after performing NMS given a specific IoU threshold
    """

    assert type(bboxes) == list

    bboxes = [box for box in bboxes if box[1] > threshold]
    bboxes = sorted(bboxes, key=lambda x: x[1], reverse=True)
    bboxes_after_nms = []

    while bboxes:
        chosen_box = bboxes.pop(0)

        bboxes = [
            box
            for box in bboxes
            if box[0] != chosen_box[0]
            or intersection_over_union(
                torch.tensor(chosen_box[2:]),
                torch.tensor(box[2:]),
                box_format=box_format,
            )
            < iou_threshold
        ]

        bboxes_after_nms.append(chosen_box)

    return bboxes_after_nms

### ✅ **Function Goal**

Filter out overlapping bounding boxes that likely refer to the same object, keeping only the one with the highest confidence.



### 📥 **Input Parameters**

* `bboxes`: A list of predicted bounding boxes, where each box is of the form:
  `[class_pred, prob_score, x1, y1, x2, y2]`

* `iou_threshold`: If IoU between two boxes exceeds this, the box with the lower score is suppressed.

* `threshold`: Minimum probability score for a box to be considered.

* `box_format`: `"corners"` (x1, y1, x2, y2) or `"midpoint"` (cx, cy, w, h)



### 🔢 **Step-by-Step Explanation**

#### **1. Ensure input type is list**

```python
assert type(bboxes) == list
```

Safeguard: NMS expects a list of lists.



#### **2. Remove low-confidence boxes**

```python
bboxes = [box for box in bboxes if box[1] > threshold]
```

Keep only boxes with a confidence score higher than the threshold.



#### **3. Sort by confidence score (descending)**

```python
bboxes = sorted(bboxes, key=lambda x: x[1], reverse=True)
```

This ensures that the highest confidence box is always selected first.



#### **4. NMS Loop**

```python
while bboxes:
    chosen_box = bboxes.pop(0)
```

Pick the highest-scoring box (`chosen_box`) and remove it from the list.



#### **5. Filter Remaining Boxes**

```python
bboxes = [
    box
    for box in bboxes
    if box[0] != chosen_box[0]
    or intersection_over_union(
        torch.tensor(chosen_box[2:]),
        torch.tensor(box[2:]),
        box_format=box_format,
    ) < iou_threshold
]
```

* Loop over the rest of the boxes.
* Remove:

  * Any box of the **same class**,
  * With **IoU > iou\_threshold** with the chosen box.
* This suppresses duplicates of the same object.



#### **6. Add Chosen Box to Final List**

```python
bboxes_after_nms.append(chosen_box)
```

Retain the box with highest confidence and acceptable overlap.



#### **7. Return Final Filtered Boxes**

```python
return bboxes_after_nms
```

After the loop, this list contains only the most confident and non-overlapping boxes.



### 🧠 Why Important in YOLO?

YOLO predicts multiple boxes per grid cell. **Non-Max Suppression** ensures that only the best box per object is kept, preventing duplicated detections.



### 📦 Example Input:

```python
[
 [0, 0.9, 10, 20, 30, 40],   # Class 0, high confidence
 [0, 0.85, 12, 22, 32, 42],  # Class 0, overlaps highly → likely removed
 [1, 0.8, 50, 50, 70, 70],   # Different class → retained
]
```



### **Mean Average Precision**

In [5]:
def mean_average_precision(
    pred_boxes, true_boxes, iou_threshold=0.5, box_format="midpoint", num_classes=20
):
    """
    Calculates mean average precision 
    Parameters:
        pred_boxes (list): list of lists containing all bboxes with each bboxes
        specified as [train_idx, class_prediction, prob_score, x1, y1, x2, y2]
        true_boxes (list): Similar as pred_boxes except all the correct ones 
        iou_threshold (float): threshold where predicted bboxes is correct
        box_format (str): "midpoint" or "corners" used to specify bboxes
        num_classes (int): number of classes
    Returns:
        float: mAP value across all classes given a specific IoU threshold 
    """

    # list storing all AP for respective classes
    average_precisions = []

    # used for numerical stability later on
    epsilon = 1e-6

    for c in range(num_classes):
        detections = []
        ground_truths = []

        # Go through all predictions and targets,
        # and only add the ones that belong to the
        # current class c
        for detection in pred_boxes:
            if detection[1] == c:
                detections.append(detection)

        for true_box in true_boxes:
            if true_box[1] == c:
                ground_truths.append(true_box)

        # find the amount of bboxes for each training example
        # Counter here finds how many ground truth bboxes we get
        # for each training example, so let's say img 0 has 3,
        # img 1 has 5 then we will obtain a dictionary with:
        # amount_bboxes = {0:3, 1:5}
        amount_bboxes = Counter([gt[0] for gt in ground_truths])

        # We then go through each key, val in this dictionary
        # and convert to the following (w.r.t same example):
        # ammount_bboxes = {0:torch.tensor[0,0,0], 1:torch.tensor[0,0,0,0,0]}
        for key, val in amount_bboxes.items():
            amount_bboxes[key] = torch.zeros(val)

        # sort by box probabilities which is index 2
        detections.sort(key=lambda x: x[2], reverse=True)
        TP = torch.zeros((len(detections)))
        FP = torch.zeros((len(detections)))
        total_true_bboxes = len(ground_truths)
        
        # If none exists for this class then we can safely skip
        if total_true_bboxes == 0:
            continue

        for detection_idx, detection in enumerate(detections):
            # Only take out the ground_truths that have the same
            # training idx as detection
            ground_truth_img = [
                bbox for bbox in ground_truths if bbox[0] == detection[0]
            ]

            num_gts = len(ground_truth_img)
            best_iou = 0

            for idx, gt in enumerate(ground_truth_img):
                iou = intersection_over_union(
                    torch.tensor(detection[3:]),
                    torch.tensor(gt[3:]),
                    box_format=box_format,
                )

                if iou > best_iou:
                    best_iou = iou
                    best_gt_idx = idx

            if best_iou > iou_threshold:
                # only detect ground truth detection once
                if amount_bboxes[detection[0]][best_gt_idx] == 0:
                    # true positive and add this bounding box to seen
                    TP[detection_idx] = 1
                    amount_bboxes[detection[0]][best_gt_idx] = 1
                else:
                    FP[detection_idx] = 1

            # if IOU is lower then the detection is a false positive
            else:
                FP[detection_idx] = 1

        TP_cumsum = torch.cumsum(TP, dim=0)
        FP_cumsum = torch.cumsum(FP, dim=0)
        recalls = TP_cumsum / (total_true_bboxes + epsilon)
        precisions = torch.divide(TP_cumsum, (TP_cumsum + FP_cumsum + epsilon))
        precisions = torch.cat((torch.tensor([1]), precisions))
        recalls = torch.cat((torch.tensor([0]), recalls))
        # torch.trapz for numerical integration
        average_precisions.append(torch.trapz(precisions, recalls))

    return sum(average_precisions) / len(average_precisions)

### 🎯 **Goal**

To compute the **mean Average Precision (mAP)** across all classes.
It measures how well your model predicts bounding boxes for all objects in all classes.



### 📥 **Inputs**

* `pred_boxes`: All predicted boxes in the format
  `[image_idx, class_pred, prob_score, x1, y1, x2, y2]`
* `true_boxes`: All ground-truth boxes in the same format.
* `iou_threshold`: Minimum IoU needed to count a prediction as correct.
* `box_format`: Either `"midpoint"` (cx, cy, w, h) or `"corners"` (x1, y1, x2, y2)
* `num_classes`: Total number of object classes.



## 🔁 Step-by-Step Breakdown



### 1. **Preparation**

```python
average_precisions = []
epsilon = 1e-6
```

* `average_precisions`: To store AP per class
* `epsilon`: Small number to avoid division by zero



### 2. **Loop through each class**

```python
for c in range(num_classes):
    detections = []
    ground_truths = []
```

We compute AP class-wise. For each class `c`, we gather:

* All detections (predicted bboxes of class `c`)
* All ground truths of class `c`



### 3. **Separate detections and GTs for this class**

```python
for detection in pred_boxes:
    if detection[1] == c:
        detections.append(detection)

for true_box in true_boxes:
    if true_box[1] == c:
        ground_truths.append(true_box)
```



### 4. **Count number of GT bboxes per image**

```python
amount_bboxes = Counter([gt[0] for gt in ground_truths])
for key, val in amount_bboxes.items():
    amount_bboxes[key] = torch.zeros(val)
```

* `amount_bboxes[img_id] = [0, 0, 0]` means there are 3 GT boxes in that image and none has been matched yet.



### 5. **Sort detections by confidence**

```python
detections.sort(key=lambda x: x[2], reverse=True)
TP = torch.zeros((len(detections)))
FP = torch.zeros((len(detections)))
total_true_bboxes = len(ground_truths)
```

* This helps us evaluate the most confident predictions first.

---

### 6. **Skip if no ground truths**

```python
if total_true_bboxes == 0:
    continue
```



### 7. **Evaluate each detection**

```python
for detection_idx, detection in enumerate(detections):
    ground_truth_img = [bbox for bbox in ground_truths if bbox[0] == detection[0]]
```

* For the predicted box, get all GT boxes from the same image.



### 8. **Find best IoU match**

```python
best_iou = 0
for idx, gt in enumerate(ground_truth_img):
    iou = intersection_over_union(...)
    if iou > best_iou:
        best_iou = iou
        best_gt_idx = idx
```

Compare detection to all GTs from same image, keep highest IoU.



### 9. **Mark True Positive or False Positive**

```python
if best_iou > iou_threshold:
    if amount_bboxes[detection[0]][best_gt_idx] == 0:
        TP[detection_idx] = 1
        amount_bboxes[detection[0]][best_gt_idx] = 1
    else:
        FP[detection_idx] = 1
else:
    FP[detection_idx] = 1
```

* If IoU is high and GT not already used → TP
* If IoU is high but GT already matched → FP
* If IoU is low → FP



### 10. **Cumulative Precision and Recall**

```python
TP_cumsum = torch.cumsum(TP, dim=0)
FP_cumsum = torch.cumsum(FP, dim=0)
recalls = TP_cumsum / (total_true_bboxes + epsilon)
precisions = torch.divide(TP_cumsum, (TP_cumsum + FP_cumsum + epsilon))
```

These vectors help us plot the **Precision-Recall (PR) curve**.



### 11. **Add end points for smooth curve**

```python
precisions = torch.cat((torch.tensor([1]), precisions))
recalls = torch.cat((torch.tensor([0]), recalls))
```



### 12. **Integrate Precision-Recall Curve**

```python
average_precisions.append(torch.trapz(precisions, recalls))
```

* `trapz()` performs numerical integration (area under PR curve = Average Precision)



### 13. **Return mean of all APs**

```python
return sum(average_precisions) / len(average_precisions)
```



### ✅ **Final Output**

Returns **mean Average Precision (mAP)** — a value between 0 and 1. Higher = better detection performance.



### Get Boxes

In [6]:
def get_bboxes(
    loader,
    model,
    iou_threshold,
    threshold,
    pred_format="cells",
    box_format="midpoint",
    device="cuda",
):
    all_pred_boxes = []
    all_true_boxes = []

    # make sure model is in eval before get bboxes
    model.eval()
    train_idx = 0

    for batch_idx, (x, labels) in enumerate(loader):
        x = x.to(device)
        labels = labels.to(device)

        with torch.no_grad():
            predictions = model(x)

        batch_size = x.shape[0]
        true_bboxes = cellboxes_to_boxes(labels)
        bboxes = cellboxes_to_boxes(predictions)

        for idx in range(batch_size):
            nms_boxes = non_max_suppression(
                bboxes[idx],
                iou_threshold=iou_threshold,
                threshold=threshold,
                box_format=box_format,
            )


            #if batch_idx == 0 and idx == 0:
            #    plot_image(x[idx].permute(1,2,0).to("cpu"), nms_boxes)
            #    print(nms_boxes)

            for nms_box in nms_boxes:
                all_pred_boxes.append([train_idx] + nms_box)

            for box in true_bboxes[idx]:
                # many will get converted to 0 pred
                if box[1] > threshold:
                    all_true_boxes.append([train_idx] + box)

            train_idx += 1

    model.train()
    return all_pred_boxes, all_true_boxes


### 🎯 **Goal**

To collect all predicted and ground truth bounding boxes from a dataset loader for computing metrics like mAP.



## 📥 Parameters

```python
loader         # Dataloader with (images, labels)
model          # YOLO model
iou_threshold  # IoU for NMS and AP calculation
threshold      # Confidence threshold for predictions
pred_format    # Format like "cells" (YOLO grid output)
box_format     # "midpoint" or "corners"
device         # "cuda" or "cpu"
```



## 🔁 Breakdown



### 1. **Initialize**

```python
all_pred_boxes = []
all_true_boxes = []
model.eval()
train_idx = 0
```

* `all_pred_boxes`: all predicted boxes, per image
* `all_true_boxes`: all ground-truth boxes, per image
* `train_idx`: tracks image index across batches (used for mAP calculation)



### 2. **Loop Through Batches**

```python
for batch_idx, (x, labels) in enumerate(loader):
```

* `x`: images
* `labels`: ground-truth boxes (YOLO format)



### 3. **Move to device and predict**

```python
x = x.to(device)
labels = labels.to(device)
with torch.no_grad():
    predictions = model(x)
```

* Perform forward pass without tracking gradients.



### 4. **Convert Cell Format to Bounding Boxes**

```python
true_bboxes = cellboxes_to_boxes(labels)
bboxes = cellboxes_to_boxes(predictions)
```

* Converts YOLO’s grid/cell outputs into `[class, prob, x1, y1, x2, y2]` box format for evaluation.



### 5. **Loop through individual images in batch**

```python
for idx in range(batch_size):
```



### 6. **Apply NMS to predicted boxes**

```python
nms_boxes = non_max_suppression(
    bboxes[idx],
    iou_threshold=iou_threshold,
    threshold=threshold,
    box_format=box_format,
)
```

* Removes overlapping predicted boxes using IoU threshold.



### 7. **Store predictions**

```python
for nms_box in nms_boxes:
    all_pred_boxes.append([train_idx] + nms_box)
```

* Add predictions for this image to `all_pred_boxes`
  (with `train_idx` as image identifier)



### 8. **Store ground truth boxes**

```python
for box in true_bboxes[idx]:
    if box[1] > threshold:  # filter low-confidence GT boxes
        all_true_boxes.append([train_idx] + box)
```



### 9. **Increment image index**

```python
train_idx += 1
```

Each image gets a unique index for comparison during mAP calculation.



### 10. **Return model to training mode**

```python
model.train()
return all_pred_boxes, all_true_boxes
```



### ✅ Output

* `all_pred_boxes`: List of `[image_id, class, prob, x1, y1, x2, y2]`
* `all_true_boxes`: Same format, without NMS



### `convert_cellboxes`

In [7]:
def convert_cellboxes(predictions, S=7, C=3):
    """
    Converts bounding boxes output from Yolo with
    an image split size of S into entire image ratios
    rather than relative to cell ratios. Tried to do this
    vectorized, but this resulted in quite difficult to read
    code... Use as a black box? Or implement a more intuitive,
    using 2 for loops iterating range(S) and convert them one
    by one, resulting in a slower but more readable implementation.
    """

    predictions = predictions.to("cpu")
    batch_size = predictions.shape[0]
    predictions = predictions.reshape(batch_size, 7, 7, C + 10)
    bboxes1 = predictions[..., C + 1:C + 5]
    bboxes2 = predictions[..., C + 6:C + 10]
    scores = torch.cat(
        (predictions[..., C].unsqueeze(0), predictions[..., C + 5].unsqueeze(0)), dim=0
    )
    best_box = scores.argmax(0).unsqueeze(-1)
    best_boxes = bboxes1 * (1 - best_box) + best_box * bboxes2
    cell_indices = torch.arange(7).repeat(batch_size, 7, 1).unsqueeze(-1)
    x = 1 / S * (best_boxes[..., :1] + cell_indices)
    y = 1 / S * (best_boxes[..., 1:2] + cell_indices.permute(0, 2, 1, 3))
    w_y = 1 / S * best_boxes[..., 2:4]
    converted_bboxes = torch.cat((x, y, w_y), dim=-1)
    predicted_class = predictions[..., :C].argmax(-1).unsqueeze(-1)
    best_confidence = torch.max(predictions[..., C], predictions[..., C + 5]).unsqueeze(
        -1
    )
    converted_preds = torch.cat(
        (predicted_class, best_confidence, converted_bboxes), dim=-1
    )

    return converted_preds

### 🎯 **Goal**

Convert YOLO outputs (from grid cells) into bounding boxes normalized w\.r.t. the **whole image**, in `[class, confidence, x, y, w, h]` format.



## 📥 Parameters

```python
predictions: Tensor of shape (N, 7, 7, C + 10)
    # Where:
    #   C = number of classes
    #   10 = 2 boxes per cell (each with [x, y, w, h, confidence])
S: Grid size (usually 7)
C: Number of classes
```



## 🔁 Step-by-Step Explanation

### 1. **Reshape**

```python
predictions = predictions.reshape(batch_size, 7, 7, C + 10)
```

From flat vector → structured grid: 7×7 cells each with 2 boxes and `C` class scores.



### 2. **Extract Bounding Boxes**

```python
bboxes1 = predictions[..., C + 1:C + 5]  # [x1, y1, w1, h1]
bboxes2 = predictions[..., C + 6:C + 10] # [x2, y2, w2, h2]
```

Two predicted boxes per grid cell.



### 3. **Select Best Box (Highest Confidence)**

```python
scores = torch.cat((predictions[..., C].unsqueeze(0), predictions[..., C + 5].unsqueeze(0)), dim=0)
best_box = scores.argmax(0).unsqueeze(-1)
best_boxes = bboxes1 * (1 - best_box) + best_box * bboxes2
```

* Get best box across the 2 using objectness scores.
* Use soft masking to pick either bboxes1 or bboxes2 per cell.



### 4. **Convert Cell-Relative to Image-Relative Coordinates**

```python
cell_indices = torch.arange(7).repeat(batch_size, 7, 1).unsqueeze(-1)
x = 1 / S * (best_boxes[..., :1] + cell_indices)
y = 1 / S * (best_boxes[..., 1:2] + cell_indices.permute(0, 2, 1, 3))
w_y = 1 / S * best_boxes[..., 2:4]
```

* Convert from **cell-relative** (e.g., x in \[0,1]) to **image-relative** (i.e., x in \[0,1] across whole image).
* Add cell offset (`i`, `j`) then scale by `1/S`.



### 5. **Concatenate Final Bounding Box**

```python
converted_bboxes = torch.cat((x, y, w_y), dim=-1)
```



### 6. **Get Class & Confidence**

```python
predicted_class = predictions[..., :C].argmax(-1).unsqueeze(-1)
best_confidence = torch.max(predictions[..., C], predictions[..., C + 5]).unsqueeze(-1)
```

* Highest class probability.
* Highest confidence of the two boxes.



### 7. **Final Output**

```python
converted_preds = torch.cat((predicted_class, best_confidence, converted_bboxes), dim=-1)
return converted_preds
```

Shape: `(N, 7, 7, 6)` with values:

```
[class, confidence, x, y, w, h]
```



### `cellboxes_to_boxes`

In [8]:
def cellboxes_to_boxes(out, S=7):
    converted_pred = convert_cellboxes(out).reshape(out.shape[0], S * S, -1)
    converted_pred[..., 0] = converted_pred[..., 0].long()
    all_bboxes = []

    for ex_idx in range(out.shape[0]):
        bboxes = []

        for bbox_idx in range(S * S):
            bboxes.append([x.item() for x in converted_pred[ex_idx, bbox_idx, :]])
        all_bboxes.append(bboxes)

    return all_bboxes

def save_checkpoint(state, filename="my_checkpoint.pth"):
    print("=> Saving checkpoint")
    torch.save(state, filename)
    
def load_checkpoint(checkpoint, model, optimizer):
    print("=> Loading checkpoint")
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])

### ✅ `cellboxes_to_boxes(out, S=7)`

**Purpose**:
Converts YOLO output tensor from model into a list of bounding boxes per image, using the format:

```
[class, confidence, x, y, w, h]
```

**Steps**:

1. Calls `convert_cellboxes()` to convert YOLO cell predictions into image-relative boxes.
2. Reshapes it into `(batch_size, 49, 6)` — because there are 49 (7×7) cells per image.
3. Converts class index to integer.
4. Loops through each image and stores the list of boxes.

**Returns**:
List of bounding boxes for each image in the batch:

```python
[
  [[class, conf, x, y, w, h], [...], ...],  # Image 1
  [[class, conf, x, y, w, h], [...], ...],  # Image 2
  ...
]
```



### 💾 `save_checkpoint(state, filename="my_checkpoint.pth")`

**Purpose**:
Saves the model state and optimizer state to a file.

**Typical Usage**:

```python
save_checkpoint({
    "state_dict": model.state_dict(),
    "optimizer": optimizer.state_dict(),
})
```



### 📦 `load_checkpoint(checkpoint, model, optimizer)`

**Purpose**:
Loads a previously saved checkpoint into the model and optimizer.

**Typical Usage**:

```python
checkpoint = torch.load("my_checkpoint.pth")
load_checkpoint(checkpoint, model, optimizer)
```



## **Dataset Preprocessing**

In [None]:
files_dir = 'Data/train_zip/train'
test_dir = 'Data/test_zip/test'

images = [image for image in sorted(os.listdir(files_dir))
                        if image[-4:]=='.jpg']
annots = []
for image in images:
    annot = image[:-4] + '.xml'
    annots.append(annot)
    
images = pd.Series(images, name='images')
annots = pd.Series(annots, name='annots')
df = pd.concat([images, annots], axis=1)
df = pd.DataFrame(df)

test_images = [image for image in sorted(os.listdir(test_dir))
                        if image[-4:]=='.jpg']

test_annots = []
for image in test_images:
    annot = image[:-4] + '.xml'
    test_annots.append(annot)

test_images = pd.Series(test_images, name='test_images')
test_annots = pd.Series(test_annots, name='test_annots')
test_df = pd.concat([test_images, test_annots], axis=1)
test_df = pd.DataFrame(test_df)

In [13]:
import os
import pandas as pd

def create_dataframe(image_dir, image_col, annot_col):
    images = sorted([img for img in os.listdir(image_dir) if img.endswith('.jpg')])
    annots = [img.replace('.jpg', '.xml') for img in images]
    return pd.DataFrame({image_col: images, annot_col: annots})

# Paths
train_dir = 'Data/train_zip/train'
test_dir = 'Data/test_zip/test'

# Create DataFrames
df = create_dataframe(train_dir, 'images', 'annots')
test_df = create_dataframe(test_dir, 'test_images', 'test_annots')

df.head(11)


Unnamed: 0,images,annots
0,apple_1.jpg,apple_1.xml
1,apple_10.jpg,apple_10.xml
2,apple_11.jpg,apple_11.xml
3,apple_12.jpg,apple_12.xml
4,apple_13.jpg,apple_13.xml
5,apple_14.jpg,apple_14.xml
6,apple_15.jpg,apple_15.xml
7,apple_16.jpg,apple_16.xml
8,apple_17.jpg,apple_17.xml
9,apple_18.jpg,apple_18.xml


In [15]:
class FruitImagesDataset(torch.utils.data.Dataset):
    def __init__(self, df=df, files_dir=train_dir, S=7, B=2, C=3, transform=None):
        self.annotations = df
        self.files_dir = files_dir
        self.transform = transform
        self.S = S
        self.B = B
        self.C = C

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        label_path = os.path.join(self.files_dir, self.annotations.iloc[index, 1])
        boxes = []
        tree = ET.parse(label_path)
        root = tree.getroot()
        
        class_dictionary = {'apple':0, 'banana':1, 'orange':2}
    
        if(int(root.find('size').find('height').text) == 0):
            filename = root.find('filename').text
            img = Image.open(self.files_dir + '/' + filename)
            img_width, img_height = img.size
            
            for member in root.findall('object'):
            
                klass = member.find('name').text
                klass = class_dictionary[klass]
            
                # bounding box
                xmin = int(member.find('bndbox').find('xmin').text)
                xmax = int(member.find('bndbox').find('xmax').text)
            
                ymin = int(member.find('bndbox').find('ymin').text)
                ymax = int(member.find('bndbox').find('ymax').text)
                
                centerx = ((xmax + xmin) / 2) / img_width
                centery = ((ymax + ymin) / 2) / img_height
                boxwidth = (xmax - xmin) / img_width
                boxheight = (ymax - ymin) / img_height
            
            
                boxes.append([klass, centerx, centery, boxwidth, boxheight])
            
        elif(int(root.find('size').find('height').text) != 0):
            
            for member in root.findall('object'):
            
                klass = member.find('name').text
                klass = class_dictionary[klass]
            
                                # bounding box
                xmin = int(member.find('bndbox').find('xmin').text)
                xmax = int(member.find('bndbox').find('xmax').text)
                img_width = int(root.find('size').find('width').text)
            
                ymin = int(member.find('bndbox').find('ymin').text)
                ymax = int(member.find('bndbox').find('ymax').text)
                img_height = int(root.find('size').find('height').text)
                
                centerx = ((xmax + xmin) / 2) / img_width
                centery = ((ymax + ymin) / 2) / img_height
                boxwidth = (xmax - xmin) / img_width
                boxheight = (ymax - ymin) / img_height
            
            
                boxes.append([klass, centerx, centery, boxwidth, boxheight])

                
        boxes = torch.tensor(boxes)
        img_path = os.path.join(self.files_dir, self.annotations.iloc[index, 0])
        image = Image.open(img_path)
        image = image.convert("RGB")

        if self.transform:
            # image = self.transform(image)
            image, boxes = self.transform(image, boxes)

        # Convert To Cells
        label_matrix = torch.zeros((self.S, self.S, self.C + 5 * self.B))
        for box in boxes:
            class_label, x, y, width, height = box.tolist()
            class_label = int(class_label)

            # i,j represents the cell row and cell column
            i, j = int(self.S * y), int(self.S * x)
            x_cell, y_cell = self.S * x - j, self.S * y - i

            """
            Calculating the width and height of cell of bounding box,
            relative to the cell is done by the following, with
            width as the example:
            
            width_pixels = (width*self.image_width)
            cell_pixels = (self.image_width)
            
            Then to find the width relative to the cell is simply:
            width_pixels/cell_pixels, simplification leads to the
            formulas below.
            """
            width_cell, height_cell = (
                width * self.S,
                height * self.S,
            )

            # If no object already found for specific cell i,j
            # Note: This means we restrict to ONE object
            # per cell!
#             print(i, j)
            if label_matrix[i, j, self.C] == 0:
                # Set that there exists an object
                label_matrix[i, j, self.C] = 1

                # Box coordinates
                box_coordinates = torch.tensor(
                    [x_cell, y_cell, width_cell, height_cell]
                )

                label_matrix[i, j, 4:8] = box_coordinates

                # Set one hot encoding for class_label
                label_matrix[i, j, class_label] = 1

        return image, label_matrix


The `FruitImagesDataset` class is designed to load fruit images and their associated bounding box annotations from XML files. It processes the images, converts bounding boxes into cell-based coordinates, and returns both the image and its corresponding label matrix.

### Key Features:

1. **Initialization (`__init__`)**:

   * Takes a dataframe (`df`), directory paths (`files_dir`), and other configuration parameters like grid size (`S`), number of bounding boxes (`B`), number of classes (`C`), and optional transformations (`transform`).

2. **Length Method (`__len__`)**:

   * Returns the total number of items (images) in the dataset.

3. **Get Item Method (`__getitem__`)**:

   * Loads an image and its corresponding XML annotation file.
   * Extracts bounding box details (class, center coordinates, width, and height).
   * Converts coordinates to relative values with respect to the image size.
   * Converts bounding box coordinates to a grid-based representation (cell coordinates in the `S x S` grid).

4. **Label Matrix**:

   * The label matrix is a `S x S x (C + 5 * B)` tensor, where:

     * `C` represents the number of classes (fruit types).
     * `5 * B` accounts for the bounding box parameters (center x, center y, width, height, and confidence score) for each bounding box.
   * Each grid cell contains the bounding box information if a box is centered in that cell.

5. **Transformation**:

   * If a `transform` function is provided, it's applied to both the image and bounding box coordinates.

### Suggestions for Improvement:

1. **Transform Method**:

   * You commented out `image = self.transform(image)`. If you're using data augmentation (like resizing, normalizing, or random flipping), you could apply transformations explicitly here.

2. **Bounding Box Handling**:

   * Your code assumes only one object per grid cell (`if label_matrix[i, j, self.C] == 0`). If you want to support multiple objects per cell, you'll need to extend this logic.

3. **Handling Images with No Bounding Boxes**:

   * For images that have no bounding boxes, you should ensure the `label_matrix` has only zeros to indicate no objects.

4. **Class Dictionary**:

   * The `class_dictionary` can be moved outside the class initialization if it's constant and doesn't need to be redefined for each instance.



### **Model Loss**

In [16]:
class YoloLoss(nn.Module):
    """
    Calculate the loss for yolo (v1) model
    """

    def __init__(self, S=7, B=2, C=3):
        super(YoloLoss, self).__init__()
        self.mse = nn.MSELoss(reduction="sum")

        """
        S is split size of image (in paper 7),
        B is number of boxes (in paper 2),
        C is number of classes (in paper 20, in dataset 3),
        """
        self.S = S
        self.B = B
        self.C = C

        # These are from Yolo paper, signifying how much we should
        # pay loss for no object (noobj) and the box coordinates (coord)
        self.lambda_noobj = 0.5
        self.lambda_coord = 5

    def forward(self, predictions, target):
        # predictions are shaped (BATCH_SIZE, S*S(C+B*5) when inputted
        predictions = predictions.reshape(-1, self.S, self.S, self.C + self.B * 5)

        # Calculate IoU for the two predicted bounding boxes with target bbox
        iou_b1 = intersection_over_union(predictions[..., self.C + 1:self.C + 5], target[..., self.C + 1:self.C + 5])
        iou_b2 = intersection_over_union(predictions[..., self.C + 6:self.C + 10], target[..., self.C + 1:self.C + 5])
        ious = torch.cat([iou_b1.unsqueeze(0), iou_b2.unsqueeze(0)], dim=0)

        # Take the box with highest IoU out of the two prediction
        # Note that bestbox will be indices of 0, 1 for which bbox was best
        iou_maxes, bestbox = torch.max(ious, dim=0)
        exists_box = target[..., self.C].unsqueeze(3)  # in paper this is Iobj_i

        # ======================== #
        #   FOR BOX COORDINATES    #
        # ======================== #

        # Set boxes with no object in them to 0. We only take out one of the two 
        # predictions, which is the one with highest Iou calculated previously.
        box_predictions = exists_box * (
            (
                bestbox * predictions[..., self.C + 6:self.C + 10]
                + (1 - bestbox) * predictions[..., self.C + 1:self.C + 5]
            )
        )

        box_targets = exists_box * target[..., self.C + 1:self.C + 5]

        # Take sqrt of width, height of boxes to ensure that
        box_predictions[..., 2:4] = torch.sign(box_predictions[..., 2:4]) * torch.sqrt(
            torch.abs(box_predictions[..., 2:4] + 1e-6)
        )
        box_targets[..., 2:4] = torch.sqrt(box_targets[..., 2:4])

        box_loss = self.mse(
            torch.flatten(box_predictions, end_dim=-2),
            torch.flatten(box_targets, end_dim=-2),
        )

        # ==================== #
        #   FOR OBJECT LOSS    #
        # ==================== #

        # pred_box is the confidence score for the bbox with highest IoU
        pred_box = (
            bestbox * predictions[..., self.C + 5:self.C + 6] + (1 - bestbox) * predictions[..., self.C:self.C + 1]
        )

        object_loss = self.mse(
            torch.flatten(exists_box * pred_box),
            torch.flatten(exists_box * target[..., self.C:self.C + 1]),
        )

        # ======================= #
        #   FOR NO OBJECT LOSS    #
        # ======================= #

        #max_no_obj = torch.max(predictions[..., 20:21], predictions[..., 25:26])
        #no_object_loss = self.mse(
        #    torch.flatten((1 - exists_box) * max_no_obj, start_dim=1),
        #    torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1),
        #)

        no_object_loss = self.mse(
            torch.flatten((1 - exists_box) * predictions[..., self.C:self.C + 1], start_dim=1),
            torch.flatten((1 - exists_box) * target[..., self.C:self.C + 1], start_dim=1),
        )

        no_object_loss += self.mse(
            torch.flatten((1 - exists_box) * predictions[..., self.C + 5:self.C + 6], start_dim=1),
            torch.flatten((1 - exists_box) * target[..., self.C:self.C + 1], start_dim=1)
        )

        # ================== #
        #   FOR CLASS LOSS   #
        # ================== #

        class_loss = self.mse(
            torch.flatten(exists_box * predictions[..., :self.C], end_dim=-2,),
            torch.flatten(exists_box * target[..., :self.C], end_dim=-2,),
        )

        loss = (
            self.lambda_coord * box_loss  # first two rows in paper
            + object_loss  # third row in paper
            + self.lambda_noobj * no_object_loss  # forth row
            + class_loss  # fifth row
        )

        return loss

The `YoloLoss` class calculates the loss for the YOLO (You Only Look Once) model, specifically for version 1. It implements several components of the YOLO loss function, including box loss, object loss, no object loss, and class loss, each with associated scaling factors.

### Key Features of the `YoloLoss` Class:

1. **Initialization (`__init__`)**:

   * Initializes the loss function with parameters for grid size (`S`), number of bounding boxes (`B`), number of classes (`C`), and weights for no object loss (`lambda_noobj`) and box coordinate loss (`lambda_coord`).
   * Uses `MSELoss` for computing the loss.

2. **Forward Method (`forward`)**:

   * Takes the `predictions` (model output) and `target` (ground truth) as inputs.
   * Reshapes the predictions to match the grid size and class + box information.

3. **IoU Calculation**:

   * Calculates the intersection over union (IoU) between the predicted bounding boxes and the target bounding boxes for both the predicted boxes (`iou_b1` and `iou_b2`).
   * Selects the bounding box with the highest IoU as the best prediction.

4. **Box Coordinates Loss**:

   * Computes the loss for box coordinates using the mean squared error (MSE) between predicted and target bounding boxes.
   * The width and height are adjusted using the square root of the values to scale them properly.

5. **Object Loss**:

   * Calculates the object confidence loss based on the best bounding box for each grid cell (i.e., whether there is an object present).

6. **No Object Loss**:

   * Computes the loss for cells that don't contain any objects. The loss is scaled using the `lambda_noobj` factor.

7. **Class Loss**:

   * Computes the classification loss, comparing the predicted class probabilities to the target class labels for each grid cell.

8. **Total Loss**:

   * The final loss is a weighted sum of the box loss, object loss, no object loss, and class loss, with appropriate scaling factors (`lambda_coord`, `lambda_noobj`).

### Breakdown of Loss Terms:

* **Box Loss** (`lambda_coord * box_loss`): Penalizes incorrect bounding box coordinates.
* **Object Loss** (`object_loss`): Penalizes cells that incorrectly predict the presence or absence of an object.
* **No Object Loss** (`lambda_noobj * no_object_loss`): Penalizes cells that incorrectly predict no object.
* **Class Loss** (`class_loss`): Penalizes incorrect class predictions.

### Suggestions for Improvement:

1. **IoU Calculation**:

   * The `intersection_over_union` function should be implemented or imported. Ensure that it handles both cases correctly (when the bounding boxes overlap and when they do not).

2. **Optimization**:

   * The loss function uses element-wise MSE. You may want to consider if this is the best loss for bounding box predictions, as other loss functions like `SmoothL1Loss` or `GIoU` (Generalized IoU) can sometimes give better results in object detection tasks.

3. **No Object Loss**:

   * The commented-out `max_no_obj` portion could be useful if you want to calculate the loss based on the maximum of the "no object" confidence for each bounding box.



### **Model Training**

In [17]:
LEARNING_RATE = 2e-5
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 16 # 64 in original paper but resource exhausted error otherwise.
WEIGHT_DECAY = 0
EPOCHS = 20
NUM_WORKERS = 2
PIN_MEMORY = True
LOAD_MODEL = False
LOAD_MODEL_FILE = "model.pth"

In [18]:
def train_fn(train_loader, model, optimizer, loss_fn):
    loop = tqdm(train_loader, leave=True)
    mean_loss = []
    
    for batch_idx, (x, y) in enumerate(loop):
        x, y = x.to(DEVICE), y.to(DEVICE)
        out = model(x)
        loss = loss_fn(out, y)
        mean_loss.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        loop.set_postfix(loss = loss.item())
        
    print(f"Mean loss was {sum(mean_loss) / len(mean_loss)}")

In [19]:
class Compose(object):
    def __init__(self, transforms):
        self.transforms = transforms

    def __call__(self, img, bboxes):
        for t in self.transforms:
            img, bboxes = t(img), bboxes

        return img, bboxes


transform = Compose([transforms.Resize((448, 448)), transforms.ToTensor()])

In [23]:
def main():
    model = YoloV1(split_size=7, num_boxes=2, num_classes=3).to(DEVICE)
    optimizer = optim.Adam(
        model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY
    )
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, factor=0.1, patience=3, mode='max', verbose=True)
    loss_fn = YoloLoss()

    if LOAD_MODEL:
        load_checkpoint(torch.load(LOAD_MODEL_FILE), model, optimizer)

    train_dataset = FruitImagesDataset(
        transform=transform,
        files_dir=train_dir
    )

    test_dataset = FruitImagesDataset(
        transform=transform, 
        files_dir=test_dir
    )

    train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
        drop_last=False,
    )

    test_loader = DataLoader(
        dataset=test_dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
        drop_last=False,
    )

    for epoch in range(EPOCHS):
        train_fn(train_loader, model, optimizer, loss_fn)
        
        pred_boxes, target_boxes = get_bboxes(
            train_loader, model, iou_threshold=0.5, threshold=0.4
        )

        mean_avg_prec = mean_average_precision(
            pred_boxes, target_boxes, iou_threshold=0.5, box_format="midpoint"
        )
        print(f"Train mAP: {mean_avg_prec}")
        
        scheduler.step(mean_avg_prec)
    
    checkpoint = {
            "state_dict": model.state_dict(),
            "optimizer": optimizer.state_dict(),
    }
    save_checkpoint(checkpoint, filename=LOAD_MODEL_FILE)
    



if __name__ == "__main__":
    main()

100%|██████████| 15/15 [00:11<00:00,  1.29it/s, loss=499]


Mean loss was 624.2450846354167
Train mAP: 0.0


100%|██████████| 15/15 [00:10<00:00,  1.37it/s, loss=197]


Mean loss was 277.68736368815104
Train mAP: 0.000975356379058212


100%|██████████| 15/15 [00:10<00:00,  1.46it/s, loss=175]


Mean loss was 180.51609598795574
Train mAP: 0.017863621935248375


100%|██████████| 15/15 [00:10<00:00,  1.47it/s, loss=118] 


Mean loss was 136.6421910603841
Train mAP: 0.12064716219902039


100%|██████████| 15/15 [00:10<00:00,  1.43it/s, loss=101] 


Mean loss was 107.87092437744141
Train mAP: 0.15986643731594086


100%|██████████| 15/15 [00:11<00:00,  1.36it/s, loss=71.4]


Mean loss was 92.60657552083333
Train mAP: 0.24699532985687256


100%|██████████| 15/15 [00:10<00:00,  1.47it/s, loss=64.3]


Mean loss was 80.96835149129232
Train mAP: 0.34965309500694275


100%|██████████| 15/15 [00:10<00:00,  1.47it/s, loss=80.7]


Mean loss was 74.83770319620768
Train mAP: 0.41942086815834045


100%|██████████| 15/15 [00:10<00:00,  1.46it/s, loss=55]  


Mean loss was 67.00601298014323
Train mAP: 0.559205174446106


100%|██████████| 15/15 [00:10<00:00,  1.46it/s, loss=50.3]


Mean loss was 63.34628168741862
Train mAP: 0.6099393367767334


100%|██████████| 15/15 [00:10<00:00,  1.48it/s, loss=59]  


Mean loss was 59.24650319417318
Train mAP: 0.6723823547363281


100%|██████████| 15/15 [00:10<00:00,  1.47it/s, loss=60]  


Mean loss was 57.036339569091794
Train mAP: 0.693082869052887


100%|██████████| 15/15 [00:10<00:00,  1.49it/s, loss=80.2]


Mean loss was 53.85564333597819
Train mAP: 0.7688582539558411


100%|██████████| 15/15 [00:10<00:00,  1.48it/s, loss=57.9]


Mean loss was 50.629859415690106
Train mAP: 0.7860605120658875


100%|██████████| 15/15 [00:10<00:00,  1.48it/s, loss=41.8]


Mean loss was 46.72044245402018
Train mAP: 0.8171432614326477


100%|██████████| 15/15 [00:10<00:00,  1.46it/s, loss=49.7]


Mean loss was 48.8563850402832
Train mAP: 0.8453867435455322


100%|██████████| 15/15 [00:10<00:00,  1.48it/s, loss=43.1]


Mean loss was 50.4185910542806
Train mAP: 0.8499798774719238


100%|██████████| 15/15 [00:10<00:00,  1.46it/s, loss=32.5]


Mean loss was 40.82767804463705
Train mAP: 0.8640470504760742


100%|██████████| 15/15 [00:10<00:00,  1.47it/s, loss=46.9]


Mean loss was 43.815459696451825
Train mAP: 0.8637433052062988


100%|██████████| 15/15 [00:10<00:00,  1.47it/s, loss=35.2]


Mean loss was 40.661138916015624
Train mAP: 0.883161723613739
=> Saving checkpoint


### **Predictions**

In [24]:
LOAD_MODEL = True
EPOCHS = 1

In [25]:
def predictions():
    model = YoloV1(split_size=7, num_boxes=2, num_classes=3).to(DEVICE)
    optimizer = optim.Adam(
        model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY
    )
    loss_fn = YoloLoss()

    if LOAD_MODEL:
        load_checkpoint(torch.load(LOAD_MODEL_FILE), model, optimizer)

    test_dataset = FruitImagesDataset(
        transform=transform, 
        df=test_df,
        files_dir=test_dir
    )

    test_loader = DataLoader(
        dataset=test_dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
        drop_last=False,
    )
        
    for epoch in range(EPOCHS):
        model.eval()
        train_fn(test_loader, model, optimizer, loss_fn)
        
        pred_boxes, target_boxes = get_bboxes(
            test_loader, model, iou_threshold=0.5, threshold=0.4
        )

        mean_avg_prec = mean_average_precision(
            pred_boxes, target_boxes, iou_threshold=0.5, box_format="midpoint"
        )
        print(f"Test mAP: {mean_avg_prec}")


predictions()

  load_checkpoint(torch.load(LOAD_MODEL_FILE), model, optimizer)


=> Loading checkpoint


100%|██████████| 4/4 [00:02<00:00,  1.42it/s, loss=103]


Mean loss was 168.57548713684082
Test mAP: 0.19635097682476044
