<a href="https://colab.research.google.com/github/BedinEduardo/Colab_Repositories/blob/master/Custom_OD_Models_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Original code: [here](https://medium.com/@noel.benji/customizing-object-detection-models-with-lightweight-pytorch-code-ed043e48a460#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6ImJhNjNiNDM2ODM2YTkzOWI3OTViNDEyMmQzZjRkMGQyMjVkMWM3MDAiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMDEyMTk5NDY0MDczOTg5MjY4NjciLCJlbWFpbCI6ImVkdWFyZG9iZWRpbjg5QGdtYWlsLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJuYmYiOjE3NTQ1Nzc3MzMsIm5hbWUiOiJFZHVhcmRvIEJlZGluIiwiZ2l2ZW5fbmFtZSI6IkVkdWFyZG8iLCJmYW1pbHlfbmFtZSI6IkJlZGluIiwiaWF0IjoxNzU0NTc4MDMzLCJleHAiOjE3NTQ1ODE2MzMsImp0aSI6IjU3MWQ5YTUwNWZjMjM5NmE4YzA0M2NhMTUwMmZlN2FhZGE3NDQwNWMifQ.UiSRYTurLuCMTXHMdR6qYwwgIWo1zWNho5WtFzWAn3sTIeUIfnjgMtZnm-XrR7xE37VteK4vrw9Pl201T2QrnRcJmUReU6rbv6VNcJhswc4m3JdBn6lVQmoIUhgSrBg68drlp_uNo-lkKufvv6Yo_zD2bNjmxpYYvmCYo_CtX54FUuw7d900bQXpGDi5szSHCz5F8MJajVKSYWdD3lmTm49mZNSufBQxjdfiuOoQD-RlGnxcmO6c4I1o3qB0T_GyQS76hfPwH7kAYhmhHJtOY9JZ9aCIuIHtkaaXhjS13yh6-phJc5lMUBby9Oo5y-DWB4WD9zRvOt0oLo9t5Z_qJA)

# Customizing Object Detection Models with Lightweight PyTorch Code

Object Detection(OD) and segmentation are vital tasks in CV that aim to identify and localize objects in images. OD models predict bounding boxes and classify objects within them, while segmentation models go further by assigning a class label to each pixel in an image. These technologies power applications ranging from autonomous vehicles and facial recognition to medical imaging industrial automation.

As the need for more precise, efficient, and adaptable models grows, there is increasing interest in lightweight models, customizable OD implementations. Many off-the-shelf solutions provide robust capabilities but often come with unnecessary overhead - training scripts, data loaders, augmentation pipelines, and prebuilt metrics - that can limit flexibility. For researchers and engineers who prefer building and controlling these components themselves, starting with barebones models offers a cleaner slate.

PyTorch is a higly versatily deep learning framework preferred by many for its flexibility and developer-friendly design. It enables easy experimentation and customization, making it an ideal choice for developing OD models from scratch.
Its dynamic computation graph and extensive library support ensure minimalist implementations remain powerful and extensible.

As part of this tutorial, we will explore the process of building and customizing OD models using lightweight PyTorch code.
From understanding core concepts to implementing backbones, detection heads, and forward passes, we will dive deep into the fundamentasl.

## Core Concepts in OD and Segmentation

OD and Segmentation models are composed of distinct components that work in unison to detect and classify objects within an image.
Understanding these components and their interactions is critical for customizing lightweight implementations.

### **Key Components of OD Models**

**Backbone**: The backbone is a feature extraction network that performs the input image into a set of high-level feature maps.
Common backbones include pretrained CNN like ResNet, MobileNet, EfficientNet and so one.
These backbones reduce spatial dimensions while capturing semantic information.

**Neck**: The neck process feature maps from the Backbone to enhance feature representation.
Architectures like Feature Pyramid Network - FPN - and PANet are commonly used to aggregate features at different scales, making the model better at detecting objects at different scales.

**Head**: The detection head generates the final outputs, such as bounding boxes, class labels, and confidence scores.
Depending on the model, this can involve

**Anchor-Based Heads**: Use predefined anchor boxes - Faster R-CNN
**Anchor-Free Heads**: Predict object centers directly - YOLOv4, CenterNet

### **Common Architectures**

**Faster R-CNN**: Combines a CNN backbone with a Region Proposal Network - RPN - to generate candidate object regions. The proposals are refined in subsequent stages using bouding box regression and classification.

**YOLO - You Only Look Once -**: A one-stage detector that divides the image into a grid and predicts bounding boxes and class probabilities directly.
Its speed and simplicity make it popular for real-time applications.

**Mask R-CNN**: Extends Faaster R-CNN by adding a parallel branch for instance segmentation, predicting masks for each detected object.

### **Essential OD Techiniques**:

**Anchor Generation**: Anchor boxes are predefined bounding boxes of various sizes and aspect ratios, used to match ground-truth objects. They enable the model to predict objects at different scales and locations.

**Region Proposals**: Used in two-stage detectors like Faster R-CNN, the RPN generates regions of interest - ROIs - that are likely to contain objects.

**Bounding Box Regression**: Bounding box regression adjust the anchor boxes or predictions to fit the object more precisely.
This step ensures accurate localization.


## Why Start with  Barebones Code?

Starting with barebones codes in CV projects, particularly in tasks like OD or image segmentation, provides a foundation that foster deeper learning and flexibility.
While high-level libraries like Detectron2 or MMDetection offer pre-built solutions, minimalist implementations empower developers to understand the inner workings of the model pipeline.

### Benefits of Minimalist Implementations

**Deeper Learning and Conceptual Clarity**: Barebones coding forces you to implement crictical components, such as data loaders, augmentation pipelines, and evaluation metrics, from scratch. This hands-on approach clarifies how these components interact, why ceratain designs are chosen, and the trade-offs involved in their configuration.

**Flexibility and Customization**: High-level libraries abstract away many details, limiting your ability to modify core operations.
With barebones code, you gain complete control over aspect, making it easier to tailor the model for specific use cases or integrate cutting-edge techniques.

**Debugging and optimization**: By understanding each elemement of the pipeline, identifying and resolving bugs become more straightforward. Optimize process, minimizating data-loading overhead or adjusting augmentation strategies for performance gains.

### **Constrasting Barebones Code With High-Level Libraries

Detectron2 or MMDetection --> propduction-ready solutions --> Pre-configured architectures --> optimized training pipelines --> seamless integration --> pouplar frameworks.
Their abstraction --> can obscure the underlying mechanics, leading to a "black-box" experience.

For example, data loaders in high-level libraries often support complex data pipelines out-of-box --> This limits your ability to experiment with alternative loading strategies or debug edge cases.
Similary, evaluation metrics are pre-configured --> can not suit custom tasks or specialized datasets.

### The Educational Value of Rebuilding Components

Building critical elements like -

**Data Loaders**: Implementing custom DataLoader classes teaches you how data is batched, shuffled, and augmented.

**Metrics**: Coding metrics such as mAP - mean Average Precision - provides insights into performance evaluation

**Augmentations**: Manually applying augmentation --> model generalization

Starting with barebones --> clear view of how CV models are structured and trained. --> Inovate, debug, and improve

## Setting Up the Enviroment

Building a lightweight custom OD model in PyTorch requires setting up the correct environment and organizing your project effectively.
This ensures smooth development, scalability, and efficiency during the implementation of your model.

### Key Dependencies

```bash
pip install torch
pip istall torchvision
CUDA - optional
pip install matplotlib
pip install matplotlib
pip install numpy
pip install pillow
pip install scikit-learn tqdm
```

**Project Structure**: To keep things organized, structure your project with clear directories for code, dataset and logs

```
object_detection_project/
├── data/
│   └── train/
│   └── val/
├── models/
│   └── yolov5.py
│   └── faster_rcnn.py
├── utils/
│   └── transforms.py
│   └── metrics.py
├── scripts/
│   └── train.py
│   └── inference.py
└── requirements.txt
```

This structure separatesd datasets, model definition, utilities, and scripts, making it easier to scale and manage.

## Minimal Implementation of A Backbone

### What is a Backbone in OD?

The backbone of an OD model is a NN responsible for feature extraction.
It process the input data (image) and generates feature maps that highlight important structures like edge, textures, and shapes.
Common choices includes ResNet and MobileNet

### Role of the Backbone

**Feature Extraction**: Captures spatial hierarchies in the image, forming the basis for further processing by the detection head.

**Transfer Learning**: Pre-trained backbones are often used to leverage existing knowledge, accelerating training on small datasets.

**Efficiency**: Choice of backbone impacts the model's speed and accuracy.

### Implementing A Single Convolutional Backbone in PyTorch

We will use a minimal example to build and modify a ResNet-based backbone for OD.

#### Step 1 - Import Dependencies

In [1]:
# Import Dependencies
import torch
import torch.nn as nn
import torchvision.models as models

In [2]:
# Setting up Accelerators
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"device: {device}")

device: cpu


### Step 2 - Define The Backbone

In [3]:
class Backbone(nn.Module):
  def __init__(self, pretrained=True, trainable_layers=3):
    super(Backbone, self).__init__()

    # Load ResNet with pre-trained weights
    self.weights = models.ResNet50_Weights.DEFAULT # Best availiable
    self.resnet= models.resnet50(weights=self.weights)

    # Extract layers up to the final conv block of the ResNet50
    self.feature_extractor = nn.Sequential(
                self.resnet.conv1,
                self.resnet.bn1,
                self.resnet.relu,
                self.resnet.maxpool,
                self.resnet.layer1,
                self.resnet.layer2,
                self.resnet.layer3,
                self.resnet.layer4
                )

    # Optional freeze some layers for transfer learning - When use transfer Learning
    layers_to_freeze = len(list(self.feature_extractor.children())) - trainable_layers
    for i, layer in enumerate(self.feature_extractor.children()):
      if i < layers_to_freeze:
        for param in layer.parameters():
          param.requires_grad = False

  def forward(self,x): # Forward function
    print(f"x: {x.shape}\n")
    x = self.feature_extractor(x)
    print(f"Backbone x.shape: {x.shape}\n")

    return x

## Step 3 - Initialize the Backbone

In [4]:
# Build a backbone as instance
backbone = Backbone(pretrained=True, trainable_layers=3)

Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth


100%|██████████| 97.8M/97.8M [00:00<00:00, 151MB/s]


In [5]:
print(backbone)

Backbone(
  (resnet): ResNet(
    (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (layer1): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (downsample): Sequential(
          

## Step 4 - Modifying Pre-Trained Backbones

Sometimes,specific architectures require slight modifications to the backbone - adding extralayer or changing the output size. Here is an example of adding a custom output layer:

In [6]:
class CustomBackbone(Backbone):
  def __init__(self, pretrained=True, trainable_layers=3, num_channels=512):
    super(CustomBackbone, self).__init__(pretrained, trainable_layers)

    # Add a 1x1 conv layer to reduce the output channels
    self.conv1x1 = nn.Conv2d(2048, num_channels, kernel_size=1)

  def forward(self, x):
    x = super(CustomBackbone, self).forward(x)

    x = self.conv1x1(x)
    print(f"CustomBackbone x.shape: {x.shape}\n")
    return x

## Testing the BackBone

In [7]:
# Example input - batch of 2 images, 3 channels, 224 x224
dummy_imput = torch.randn(2,3,224,224)
print(f"dummy imput: {dummy_imput.shape}\n")

# forward pass
features = backbone(dummy_imput)

print(f"Output feature map size: {features.size()}\n")  # RETURN x!
print(f"Output fetaure map.shape: {features.shape}\n")

dummy imput: torch.Size([2, 3, 224, 224])

x: torch.Size([2, 3, 224, 224])

Backbone x.shape: torch.Size([2, 2048, 7, 7])

Output feature map size: torch.Size([2, 2048, 7, 7])

Output fetaure map.shape: torch.Size([2, 2048, 7, 7])



### **Key Takeways**

**Pre-Trained Backbones Save Time**: Leveraging pre-trained weights accelerates convergence, especially with limited data.

**Customization**: Adding or modifying layers tailors the backbone to specific use cases.

**Freezing Layers**: Freezing early layers helps focus training on hig-level features.

This minimal implementation provides a foundation to build custom OD piplines,--> enables experiment different architectures --> pretrained models.

## Building A Barebone Detection Head
### Purpose of the Detection Head

The detection head in OD model process the features extracted by the backbone and generates:

**Bouding Box Regression**: Predicts the coordinates of the bounding boxes around detected objects.

**Class Prediction**: Assgins a class label - or "background" for no object - to each predicted bounding box.

A detection head is typically a lightweight NN, often a combination of fully connectec (dense) layers, that processes spatial feature maps.

### **Implementing A Simple Detection Head in PyTorch**

Here --> implement a barebone --> multi-layer perceptron (MLP) --> detection head --> BB regression --> class prediction

#### **Step 1 - Define The Detection Head**

The detection head comprises two separate MLPs - one for **bounding box regression** and another for **class prediction**.

In [8]:
## A class for detection Head Example

class DetectionHead(nn.Module):
  def __init__(self, in_channels, num_classes):
    super(DetectionHead, self).__init__()

    # Bounding Box Regression Head
    self.bbox_head = nn.Sequential(
        nn.AdaptiveAvgPool2d((1,1)),  # Use gloval average pooling to keep spatial infor - real detectors
        nn.Flatten(), # Flattenize the input data - in this case the Image tensor.
        nn.Linear(in_channels, 512),
        nn.ReLU(),
        nn.Linear(512,4)  # output: [x_min, y_min, x_max, y_max]
    )

    # Class prediction head
    self.class_head = nn.Sequential(
        nn.AdaptiveAvgPool2d((1,1)),
        nn.Flatten(),
        nn.Linear(in_channels, 512),
        nn.ReLU(),
        nn.Linear(512, num_classes)   # Output: logits for each class
    )

  def forward(self, x):
    # Bounding box regression
    print("BEFORE bbox_predictions Detection Head\n")
    bbox_predictions = self.bbox_head(x)
    print(f"bbox_predictions Detection Head: {bbox_predictions.shape}\n")

    # Class predictions
    print("BEFORE class_logits DETECTIONhead\n")
    class_logits = self.class_head(x)
    print(f"class logits Detection Head: {class_logits}\n")

    return bbox_predictions, class_logits

#### **Step 2 - Integrate the Head With the Backbone**

We now combine the **Backbone** and **DetectionHead** to create a complete object detection model.

In [9]:
class BarebonesObjectDetector(nn.Module):
  def __init__(self, backbone, num_classes):
    super(BarebonesObjectDetector, self).__init__()
    self.backbone = backbone
    self.detection_head = DetectionHead(in_channels=2048, num_classes=num_classes)

  def forward(self, x):
    # extract features using the backbone
    features = self.backbone(x)
    print(f"BarebonesObjectDetector features: {features.shape}")

    # forward pass trough the detection hed
    bbox_predictions, class_logits = self.detection_head(features)
    print(f"BarebonesObjectDetector bbox_pred: {bbox_predictions.shape}")
    print(f"BarebonesObjectDetector class_logits: {class_logits.shape}")

    return bbox_predictions, class_logits

#### **Step 3 - Initialize the Model**

In [10]:
# define the number of classes - 10 objects + 1 background class
num_classes = 11

# use the backbone implemented earlieer
backbone = Backbone(pretrained=True, trainable_layers=3)

# Build the full object detection model
model = BarebonesObjectDetector(backbone=backbone, num_classes=num_classes)

print(model)

BarebonesObjectDetector(
  (backbone): Backbone(
    (resnet): ResNet(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (rel

### Step 4 -Testing the Model

We will pass a dummy batch of images through the model to verify its functionality.

In [None]:
# Example input - batch of 2 images, 3 channels, 224 x 224
dummy_input = torch.randn(2,3,224,224)

# Forward pass
bbox_predictions, class_logits = model(dummy_imput)

# output sizes
print(f"Bounding Box predictions: {bbox_predictions.size()}\n")  # Should be [batch_size, 4]
print(f"Class Predictions: {class_logits.size()}\n")  # Should be [batch_size, num_classes]

x: torch.Size([2, 3, 224, 224])

Backbone x.shape: torch.Size([2, 2048, 7, 7])

BarebonesObjectDetector features: torch.Size([2, 2048, 7, 7])
BEFORE bbox_predictions Detection Head

bbox_predictions Detection Head: torch.Size([2, 4])

BEFORE class_logits DETECTIONhead

class logits Detection Head: tensor([[-0.0100, -0.0340,  0.0309, -0.0218,  0.0057, -0.0339,  0.0176, -0.0401,
          0.0726, -0.0189,  0.0658],
        [ 0.0125, -0.0264, -0.0091, -0.0122,  0.0281, -0.0176, -0.0293, -0.0477,
          0.0604, -0.0393, -0.0045]], grad_fn=<AddmmBackward0>)

BarebonesObjectDetector bbox_pred: torch.Size([2, 4])
BarebonesObjectDetector class_logits: torch.Size([2, 11])
Bounding Box predictions: torch.Size([2, 4])

Class Predictions: torch.Size([2, 11])



#### **Key feature of This Implementation**

**Separation of Concerns**: The detection head is modular, allowing easy replacement or enhancement.

**Scalability**: Architecture --> Support multiple classes and custom backbone integrations

**Simplicity**: Using MLPs for bounding box regression and classification provides a clear foundation for more complex heads like those in Faster R-CNN or YOLO.

This basic detection head is a stepping stone to understanding and implementing custom OD pipelines.

## **Implementing Forward Passes & Loss Functions**
### **Overview**

A forward pass in OD model involves -

Passing input image --> backbone --> features

Feeding features into detection head --> Bounding boxes --> class logits

Calculating losses for BB regression and classification.

Implement --> step-by-step

### **Loss function**
#### **BB Regression Loss**

We use **Smoth L1 Loss** (Huber loss) --> balance sensitivity to outliers while penalizing larger errors more significantly

#### **Classification Loss**

We use **Cross-Entropy Loss** for class logits. It measures the difference between predicted class probabilities and true labels.

### **Implementing the Loss Function**

### **Forward Pass - From input to Predictions**

Forward pass --> uses combined backbone and detection head implemented earlier.
For training --> ground truth annotations --> BB and class labels.


In [None]:
# A function to calculate the loss
def compute_losses(bbox_predictions, class_logits, targets):
  """
    Compute the total loss for BB and classification

    Args:
      bbox_predictions (torch.Tensor): Predicted BB, shape [btach_size, num_classes].
      targets (ditct): Ground truth with keys:
        -'boxes': Ground truth BB, shape [batch_size, 4]
        - 'labels': Ground truth class labels, shape [batch_size]

    Returns:
      torch.Tensor: Total loss (sum of regression and classficaton losses)
  """

  # Smooth L1 Loss for BB regression
  bbox_loss_fn = nn.SmoothL1Loss()
  bbox_loss = bbox_loss_fn(bbox_predictions, targets['boxes'])

  # Cross Entropy Loss for classification
  class_loss_fn = nn.CrossEntropyLoss()
  class_loss = class_loss_fn(class_logits, targets['labels'])

  # Total loss
  total_loss = bbox_loss + class_loss

  return total_loss

In [None]:
# A forward pass function for OD

def forward_pass(model, images, targets=None):
  """
  Performa a forward pass through the model.

  Args:
    model (nn.Module): the OD model
    images (torch.Tensor): input image of shape [batch_size, channels, heigh, width]
    targets (dict): Ground truth with keys 'boxes' and 'labels'.

  Returns:
    Tuple: Predicted BB and class logits - during inference,
    or total loss - during training.
  """

  # perform forward pass trough the model
  bbox_predictions, class_logits = model(images)  # this is the forward pass into the model
  if targets:
    # if training, compute losses
    loss = compute_losses(bbox_predictions, class_logits, targets)

    return loss
  else:
    # if inference, return predictions
    return bbox_predictions, class_logits


### **Training Example - Calculating Total Loss**

In [None]:
# A training example with dummy data.
num_classes = 11  # 10 objects + 1 background
model = BarebonesObjectDetector(backbone, num_classes)

# Example input images (batch_size, channels, height, width)
images = torch.randn(2,3,224,224)

# ground truth - dummy data
targets = {
    'boxes': torch.tensor([[50,60,150,200],[30,40,120,160]], dtype=torch.float32), # boxes shape: [btach_size,4]
    'labels': torch.tensor([3,5], dtype=torch.long)  # labels shape: {batch_size}
}

# Forward pass and compute loss
loss = forward_pass(model, images, targets)
print(f"Total Loss: {loss.item()}\n")

#### **Explanation of The Process**

**Input Images**: A batch of images is fed into the model.

**Feature Extraction**: The backbone extracts feature maps from the images.

**Bounding Box & Class Prediction**: The detection head predicts bounding boxes and class logits.

**Loss Computation**: BB regression loss is computed between predicted BB and ground truth boxes. Classification loss compares predicted class probabilities with ground truth labels.

**Total Loss**: Easy to replace or fine-tune specific components - loss function, detection head.

**Customization**: Ground truth processing, such as custom label encoding, can be easily integrated.

**Simplicity**: Provides a clear understanding of how BB regression and classification are optimized.

## **Customizer Anchor Generation and NMS**
### Introduction

In OD, **anchor generation** and **non-maximum suppression (NMS)** are essential components for predicting and refining BB,

**Anchor generation**: Builds predefined BB -anchors-- of various sizes and aspect ratios at different positions in the image.

**Non-Maximum Suppression -NMS**: Removes overlapping boxes and retains the most confident prediction for each object. Customizing these steps allows developers to optimize model performance for specific datasets and use cases.

###**Anchor Generation - The Foundation of Detection**
**What is Anchor Generation?**

Anchors are gris-based BB that help the model predict object locations.
During training, each anchor is assigned a label based on its IoU with ground truth boxes.

**Code for Anchor Generation**:

In [None]:
def generate_anchors(base_size=16, scales=[0.5,1.0,2.0], aspect_ratios=[0.5,1.0,2.0]):
  """
    Generate anchor boxes based on scales and aspect ratios.

    Args:
      base_size (int): The size of the base anchor.
      scales (list): Scaling factors for anchors.
      aspect_ratios (list): Aspect ratios for anchors.

    Returns:
      torch.Tensor: Generated anchors of shape [num_anchors, 4] (x_min, y_min, x_max, y_max)
  """

  anchors = []
  for scale in scales:
    for ratio in aspect_ratios:
      w = base_size * scale * (ratio**0.5)
      h = base_size / scale / (ratio**0.5)
      x_min, y_min = -w / 2, -h / 2
      x_max, y_max = w / 2, h / 2
      anchors.append([x_min, y_min, x_max, y_max])

  return torch.tensor(anchors, dtype=torch.float32)

In [None]:
# Example usage:
anchors = generate_anchors()
print(anchors)

### **Key Considerations**:

**Base Size**: Determines the scale of the grid.

**Scales & Aspect Ratios**: Should match the size and shapes of objects in the dataset.

#### **Non-Maximum Suppression -- Filtering Overlaps**

**What is NMS**

NMS ensures only the most confident bounding box is retained for Overlapping detections.
It uses a confidence score threshold and and IoU threshold to filer predictions.

**Code For NMS in PyTorch**

In [None]:
def calculate_iou(box1, box2):
  """
    Compute IoU between two sets of boxes.

    Args:
      box1 (torch.Tensor): Single box, shape [1,4]
      box2 (torch.Tensor): Multiple boxes, shape [N,4]

    Returns:
      torch.Tensor: IoU scores for each box in box2.
  """

  inter = (
      torch.min(box1[:,2], box2[:,2]) - torch.max(box1[:,0], box2[:,0])
  ).clamp(0) * (
      torch.min(box1[:,3], box2[:,3]) - torch.max(box1[:,1],box2[:,1])
  ).clamp(0)  # .clamp --> Clamps all elements in input into the range [ min, max ]. Letting min_value and max_value be min and max, respectively, this returns:
              #yi=min⁡(max⁡(xi,min_valuei),max_valuei)y i​ =min(max(x i​ ,min_value i​ ),max_value i​)

  box1_area = (box1[:, 2] - box1[:,0]) * (box1[:, 3]- box1[:,1])
  box2_area = (box2[:, 2] - box2[:,0]) * (box2[:, 3]- box2[:,1])

  union = box1_area + box2_area - inter
  IoU = inter / union

  return IoU

In [None]:
def non_maximum_suppresion(boxes, scores, iou_threshold=0.5):
  """
    Perform non-maximum suppresion - NMS - on BB

    Args:
      boxes (torch.Tensor): Predicted boxes, shape [num_boxes, 4]
      scores (torch.Tensor): Confidence_scores, shape [num_boxes].
      ior_threshold (float): IoU threshold for NMS

    Returns:
      torch.Tensor: Indides of the retained boxes.
  """

  indices = torch.argsort(scores, descending=True)
  keep = []

  while indices.numel() > 0:
    current = indices[0]
    keep.append(current)

    if indices.numel() == 1:
      break

    remaining_boxes = boxes[indices[1:]]
    iou = calculate_iou(boxes[current].unsqueeze(0), remaining_boxes)
    indices = indices[1:][iou < iou_threshold]

  return torch.tensor(keep, dtype=torch.long)

#### **Impact of Anchor Sizes & IoU Threshold**

**Anchor Sizes**: Larger anchorrs suit larger objects, while smaller anchors better capture small objects.

**IoU Threshold**: A **lower IoU threshold** may retain more overlapping boxes - increasinf Recall but reducing precision.
A **higher IoU threshold** favors precision by keeping only the most confident boxes.

**Example**:

* Small Objects: smaller base sizes and lower IoU thresholds.
* Large Objects: Larger base sizes and higher IoU thresholds.

Customizing anchor classification and NMS --> fine tunning OD --> specific datastes.

## **Training the Model with Custom Pipelines**

Training OD --> involves integratiing data preparation, model architecture, loss function, and optimization --> cohesive overflow.
With barenone PyTorch --> full control over these compoenents.

###**Custom DataLoaders & Preprocessing Pipelines**
####**Data Preparation**

In barebones implementation, building custom Dataset and DataLoader classes is essential for loading and preprocessing data.

**Code Example - Custom Dataset Class**

In [None]:
import torch
from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image

class CustomObjectDetectionDataset(Dataset): # use Dataset from PyTorch
  def __init__(self, annotation, image_dir, transform=None):
    """
      Args:
        annotations (list): list of dictionaries containing images paths and labels.
        image_dir (str): Directory with all the images
        transform (callable, optional): Transformations to be applied on an sample
    """
    self.annotation = annotation
    self.image_dir = image_dir
    self.transform = transform

  def __len__(self):
    return len(self.annotation)

  def __getitem__(self, idx):
    annotation = self.annotation[idx]
    image_path = f"{self.image_dir}/{annotation['image_name']}"
    image = Image.open(image_path).convert("RGB")
    boxes = torch.tensor(annotation['boxes'], dtype=torch.float32)
    labels = torch.tensor(annotation['label'], dtype=torch.long)

    sample = {"image": image, "boxes": boxes, "labels": labels}

    if self.transform:
      sample["image"] = self.transform(sample["image"])

    return sample

In [None]:
# downloading the COCO dataset ANOTATIONS for example usage
# Getting files
import requests
from pathlib import Path
import json
import zipfile

# setup path to a datafolder
data_path = Path("data/")
image_path = data_path / "images"


# if the image folder does not exist, downlowd it and prepara it
if image_path.is_dir():
  print(f"{image_path} directory alredy exist... skipping download")
else:
  print("Image path does not exist, building it")
  image_path.mkdir(parents=True, exist_ok=True)

url = f"http://images.cocodataset.org/annotations/annotations_trainval2014.zip"
#f"http://images.cocodataset.org/annotations/image_info_test2017.zip"
local_file = data_path / f"coco_annotations.zip"
print(f"Downloading {url}")
request = requests.get(url)

with open(local_file, "wb") as f:
  f.write(request.content)
  print(f"Saved {local_file}")


# Now extract the final zip
zip_path = data_path / "coco_annotations.zip"
with zipfile.ZipFile(zip_path, "r") as zip_ref:
  print(f"Extracting {zip_path}")
  zip_ref.extractall(image_path)
  print(f"{zip_path} extracted into {image_path}")

In [None]:
!wget -O data/coco_images.zip http://images.cocodataset.org/zips/train2014.zip
#http://images.cocodataset.org/zips/test2017.zip

In [None]:
# This part of the code can ben changed in some final version - it is only to get the dataset to tests of the algorithms
!unzip -q data/coco_images.zip -d data/images

In [None]:
# # downloading the COCO DATASET for example usage
# url = f"http://images.cocodataset.org/zips/test2017.zip"
# local_file = data_path / f"coco_images.zip"
# print(f"Downloading {url} to {local_file}")

# #request = requests.get(url)

# with requests.get(url, stream=True) as r:
#   r.raise_for_status()

#   with open(local_file, "wb") as f:
#     for chunk in r.iter_content(chunk_size=8182):
#       if chunk:
#         f.write

# # with open(local_file, "wb") as f:
# #   f.write(request.content)
# #   print(f"Saved {local_file}")
# print(f"✅ Saved{local_file}")



In [None]:
# # Now extract the final zip
# import os
# #zip_path = image_path / "coco_images.zip"
# print(f"zip_file: {local_file }")
# print(f"File size: {os.path.getsize(local_file) / (1024*1024):.2f}MB")
# with zipfile.ZipFile(local_file , "r") as zip_ref:
#   print(f"Extracting {local_file }")
#   zip_ref.extractall(image_path )
#   print(f"{local_file} extracted into {image_path}")

In [None]:
# Now getting the annotations variable
with open("./data/images/annotations/captions_train2014.json", "r") as g:
  annotations = json.load(g)

In [None]:
# Desting the code of CustomObject
transform = transforms.Compose([
    transforms.Resize((300,300)),
    transforms.ToTensor()
])
dataset = CustomObjectDetectionDataset(annotations,"./data/images/train2014/", transform) # should define annotations -- see in next chapters
dataloader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True)

In [None]:
print(f"type(dataset.annotation): {type(dataset.annotation)}")
print("len(dataset.annotation): ", len(dataset.annotation) if hasattr(dataset.annotation, '__len__') else "no length")
#print(dataset.annotation) # NÃO DESCOMENTAR!

### **Training Loop with Barebones PyTorch Code**

#### **Steps in the Training Loop**:

1. Load data batches using DataLoader.
2. Perform a forward pass through the model.
3. Compute losses
4. Backpropagate and update model parameters

### **Code Example - Training Loop**

In [None]:
# a code example for training loop for detection
import torch.optim as optim
def train_model_detection(model: nn.Module, dataloader: torch.utils.data.DataLoader, num_epochs: int, learning_rate: float):
  optimizer = optim.Adam(model.parameters(), lr=learning_rate)  # can be seted manually or by args; # Define the optimizer
  for epoch in range(num_epochs): # define the total epochs - can be adapted by training for batches and to insert this code in Daniel Bourke Example for training, validation and test loops
    model.train()
    total_loss = 0

    for batch in dataloader: # Here get the batches
      images = batch['image'].to(device)  # get the entire image
      boxes = batch['boxes'].to(device)  # get the bbox
      labels = batch['labels'].to(device)  # get the labels

      # reset gradients
      optimizer.zero_grad()

      # forward_pass   # bbox_predictions, class_logits, targets
      outputs = model(images)
      loss = compute_losses(boxes, outputs, labels)

      # backward pass and optimization
      loss.backward()
      optimizer.step()

      total_loss += loss.item()
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader)}")

In [None]:
# num_epochs = 2
# learning_rate = 0.001
# train_model_detection(model, dataloader, num_epochs, learning_rate)

### **Debugging Techiniques for Model Convergence**

1. **Validade Data and labels**:
Ensure images, bounding boxes, and labels are loaded correctly

Visualize samples using matplotlib to confirm proper data augmentation and label alignment.

2. **Monitor Loss Trends**:
Check if the total loss decreases over epochs. If not:

Very gradients are updating - torch.autograd hooks.

Smaller learning rate

3. **Overfit On a Small Batch**:
Test the model ability to fit on a single batch.
Raid convergence indicates a functional architecture.

4. **Inspect Outputs**:
Compare predictions - BB and labels -- ground truth
Use debugging prints.

5. **Adjust Anchor & Hyperparameters**:
Poor detection accuracy may indicate a mismatch between anchor sizes and dataset object dimension;

Training an OD model --> custom pipelne --> maximum flexibility.

## **Evaluating Object Detection Models**

Evaluation is critical step in OD workflows --> ensuring your model performs well on unseen data.
Key metrics --> mAP, IoU, Precision, R, F-Score

#### **Key metrics for OD**
* **IoU**: Measures overlap between predicted BB and ground truth boxes. IoU thresholds - 0.5, 0.75 - define wheter a detection is a TP.
* **mAP**: Primary metric for evaluating OD models.
  * Steps:
    * Compute Precision + Recall --> different confidence thresholds.
    * Calculate AP for each class.
    * Take mAP across all classes
    * Evaluated at specific IoU thresholds - mAP50, mAP95
* **Precision-Recall Curve**: CONTINUE FROM HERE

**Precision**: Fraction of true positives among all predicted positives.
**Recall**: Fraction of true positives among all actual positives

Plotting PR curve helps visualize trade-offs.

### **Minimalist Evaluation Script**

#### **Code Example - Evaluation mAP & IoU**



In [None]:
import numpy as np

def compute_iou(box1, box2):
  """
    Computes IoU between two boxes
  """
  x1 = max(box1[0], box2[0])
  y1 = max(box1[1], box2[1])
  x2 = min(box1[2], box2[2])
  y2 = min(box1[3], box2[3])

  inter_area = max(0, x2 - x1) * max(0, y2 - y1)  # get the intersection area
  box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1]) # get the area of box1
  box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1]) # get the area of box2
  union_area = box1_area + box2_area - inter_area # get the union of the area of the 2 boxes

  return inter_area / union_area if union_area > 0 else 0

In [None]:
def evaluate_model(model, dataloader, iou_threshold=0.5):
  """
  Evaluate mAP for the object detection model
  """
  model.eval()
  all_precisions = []
  all_recalls = []

  for batch in dataloader:
    images = batch["image"]  # get the image
    true_boxes = batch['boxes'] # the the true boxes
    true_labels = batch['labels']  # get the true labels

    with torch.no_grad():
      predictions = model(images)

    for i, pred in enumerate(predictions):
      pred_boxes = pred['boxes']
      pred_scores = pred['scores']
      gt_boxes = true_boxes[i]

      # calculate IoU for each prediction
      matches = []
      for pred_box in pred_boxes:
        ious = [compute_iou(pred_box, gt_box) for gt_box in gt_boxes]
        matches.append(max(ious) >= iou_threshold)

      tp = sum(matches)
      fp = len(matches) - tp
      fn = len(gt_boxes) - tp

      precision = tp / (tp + fp) if tp + fp > 0 else 0
      recall = tp / (tp + fn) if tp + fn > 0 else 0

      all_precisions.append(precision)
      all_recalls.append(recall)

  mAP = np.mean(all_precisions)
  print(f"mAP: {mAP:.4f}")


In [None]:
#evaluate_model(model, test_dataloader)

## Scaling Up - Modularity & Optimization
Scaling and OD --> model from barebones implementation to a production-grade system --> focus on modularity and optmization.
Modular code enhances reusability and experimentation, while optimization techniques improve performance and reduce training time.

### **Modularizing your code**
Essential --> different components --> backbones --> detection heads --> loss functions

#### **Component-Based Architecture**

Divide the model into independent modules

* **Backbone**: Feature extraction --> Resnet - MobilNet.
* **Head**: Classification and Regression layers.
* **Anchor Generator**: Generate anchor boxes.
* **Loss Functions**: Compute classification and bounding box losses.

**Examples of Modularization**:
```python
class CustomObjectDetector(torch.nn.Module):
  def __init__(self, backbone, detection_head, anchor_generator):
    super(CustomObjectDetector, self).__init__()
    self.backbone = backbone  # the network backbone
    self.detection_head = detection_head  # the NN used as detector
    self.anchor_generator = anchor_generator # the code that generate anchors

  def forward(self, images):
    features = self.backbone(images)  # get the features from backbone NN
    anchors = self.anchor_generator(features)
    predictions = self.detection_head(features, anchors)

    return predictions
```

### **Configuration-Driven Design**

Use configuration files - YAML or JSON - to define models parameters, datasets, and training operations.

This approach simplifies switching between experiments.

### **Optimzation Techiniques**:
####**Mixed Precision Training**
Leverage lower precision - float16 - for faster computations while maintatining acc.

**Code Snippet for Mixed Precision**
```python
  # Test if it will work
  scaler = torch.cuda.amp.GradScaler()

  for images, targets in dataloader:
    optimizer.zero_grad()
    with torch.cuda.amp.autocast():
      predictions = model(images)
      loss = compute_loss(predictions, targets)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

#### ** Distributed Data Parallelism**
 Use multiple GPU to parallelize training.

 ```python
  model = torch.nn.parallel.DistributedDataParallel(model)
 ```

#### **Profiling & Bottleneck Detection**

Use tools like torch.profiler to identify performance bottlenecks.

Optimize data loading, matrix operations, and reduntante computations.

```python
with torch.profiler.profile(
  schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
  on_trace_ready=torch.profiler.tensorboard_trace_handler('./log'),
  record_shapes=True,
  with_stack=True
) as prof:
    for images, targets in dataloader:
        predictions = model(images)
prof.export_chrome_trace("trace.json")
  
```

## **Real-World Use Case - Adapting to Custom Datasets

Adapting an OD model to a custom dataset involves several key steps - dataset preparation, annotation formatting, and custom data loading. By understanding these process, you can seamlessly integrate your barebones OD model into specialized workflows.

### **Dataset Preparation**

#### **Define the Dataset**:
Identify the specific classes and tasks for detection --> animals, vehicles, objects... -
Training --> Validation --> Test --> datasets - cross validation

#### **Anotation**
Annotations --> BB --> class labels --> Segmenation masks --> optional

* Popular formats:
  * **COCO** - JSON based --> support BB, segmentation masks, keypoints.
  * **Pascal VOC**: XML-based format for BB and labels
  * **YOLO**: text files with class and BB coordinates normalized to image size.

  If your annotations are no in standard format --> may convert them

  ### **Converting Annotations to COCO Format**

  If your annotations are not in a standard format, you may need to convert them.

  **Example - Converting A CSV to COCO format**:
  

In [None]:
import json

def csv_to_coco(csv_file, images_dir, output_file):
  coco_format = {
      "images": [],
      "annotations": [],
      "categories": []
  }

  categories = {"person": 1, "car": 2, "dog": 3}   # Examples categories
  annotations
  for idx, (filename, class_name, xmin, ymin, xmax, ymax) in enumerate(csv_file):
    coco_format["images"].append({
        "idx": idx,
        "filne_name": filename,
        "width": 1280,  # exmaple width
        "height": 720, # example height
    })

    coco_format["annotations"].append({
        "id": annotation_id,
        "image_id": idx,
        "category_id": categories[class_name],
        "bbox": [xmin, ymin, xmax - xmin, ymax - ymin],
        "area": (xmax - xmin) * (ymax - ymin),
        "iscrowd": 0
    })
    annotation_id += 1

  for name, id in categories.items():
    coco_format["categories"].append({"id": id, "name": name})

  with open(output_file, 'w') as f:
    json.dump(coco_format, f, indent=4)

In [None]:
# usage # uncomment below
#csv_to_coco(csv_file="annotations.csv", images_dir="images/", output_file="dataset.json")


### **Loading Custom Data in PyTorch**

**Custom Datasets Class**: torch.utils.data.Dataset --> to build a dataset class tailored to your annotations and preprocessing needs.

**Example**

In [None]:
import torch
import os
from PIL import Image

class CustomDataset(torch.utils.data.Dataset):
  def __init__(self, annotations, images_dir, transforms=None):
    # load the JSON file
    with open(annotations, "r") as f:
      self.annotations = json.load(f)
      #annotations
    self.images_dir = images_dir
    self.transforms = transforms



  def __len__(self):
    return len(self.annotations["images"])

  def __getitem__(self, idx):
    img_info = self.annotations["images"][idx]
    #img_id = self.annotations["images"][idx]["id"]
    img_id = img_info["id"]
    image_path = os.path.join(self.images_dir, img_info["file_name"])
                              #self.annoatations["images"][idx]["file_name"])
    image = Image.open(image_path).convert("RGB")

    # get all annotations for this image
    # anns = [a for a in self.coco["annotations"] if a["image_id"] == image_id]

    # boxes = [ann["bbox"] for ann in anns]
    # labels = [ann["category_id"] for ann in anns]

    # if self.transform:
    #     image = self.transform(image)

    # target = {
    #     "boxes": torch.tensor(boxes, dtype=torch.float32),
    #     "labels": torch.tensor(labels, dtype=torch.int64)
    # }
    #Extract BB and labels

    boxes = []
    labels = []
    for ann in self.annotations["annotations"]:
      if ann["image_id"] == img_id:
        boxes.append(ann["bbox"])
        labels.append(ann["category_id"])

    # convert to tensors
    boxes = torch.tensor(boxes, dtype=torch.float32)
    labels = torch.tensor(labels, dtype=torch.int64)

    target = {"boxes": boxes, "labels": labels}

    if self.transforms:
      image, target = self.transforms(image, target)

    return image, target

In [None]:
# In DL images usually have different sizes - width/height -> if tehy are tensors --> PyTorch can't stack into one batch without resizing or padding
# Custom collate_fn --> avoid the error: TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.Image.Image'>
# collation function (default_collate) is trying to stack items in a batch, but it cannot stack PIL images directly

def collate_fn(batch):
  return tuple(zip(*batch))

In [None]:
dataset = CustomDataset(annotations="./data/images/annotations/instances_train2014.json", images_dir="./data/images/train2014/")

In [None]:
dataloader = torch.utils.data.DataLoader(dataset,
                                         batch_size=4,
                                         shuffle=True,
                                         collate_fn=collate_fn)

In [None]:
num_epochs = 2
learning_rate = 0.001
train_model_detection(model, dataloader, num_epochs, learning_rate)