---
title: "Implementing a Positive Control"
date: 06-11-2024
date-format: short
execute:
    error: true
---

## Introduction 

Establishing a positive control is necessary to validate that my deep learning model is performing as expected. This validation requires comparing evaluation metrics generated by my model against established benchmarks from the literature using an identical task and dataset. A significant discrepancy would indicate potential model dysfunction. The challenge is finding a paper to which I can compare my model. 

I did find this paper: ['Pedestrian Segmentation from Complex Background Based on Predefined Pose Fields and Probabilistic Relaxation'](https://www.scielo.br/j/bcg/a/s4LPJYBbNVDQ4ZcWprP4rKw/?lang=en). The paper compares an image segmentation method to CNN-based methods. They use the [Penn Fudan dataset](https://www.cis.upenn.edu/~jshi/ped_html/) - an image database of pedestrians around a campus/urban environment. Each of the 170 samples contains at least one labelled pedestrian (one mask, one bounding box). 

In section 4.2.1 of the paper's 'Quantitative Evaluation', they compare their method notably to a Mask R-CNN - this model architecture is almost identical to that of my model. For the sake of completeness, I will give an overview here: 

### Mask R-CNN Structure: 

Firstly, the network uses a CNN module to extract feature maps using numerous kernels. 

These feature maps are passed through to the Region Proposal Network (RPN) module that considers pixel k-many 'windows' at evenly spaced 'anchors' over each filter map. This network learns to select windows that most likely contain an object of interest and 'proposes' them to downstream Fully-Connected layers following ROI pooling (to normalise dimensionality anchor windows differing in size and aspect ratio). In this way the network only pays attention to promising regions within the samples, without the need for a selective-search algorithm which is computationally less efficient. Up to here, the Mask R-CNN is identical to the Faster R-CNN architecture. 

The discrepancy lies in the output heads. A Faster R-CNN network has two output heads: one for classification (kx2 outputs for each ROI), and one for bounding-box regression (kx4 outputs for each ROI). The Mask R-CNN has an additional output head that outputs the object mask for a given input sample. 

As this is the only discrepancy, I believe its performance in the paper is suitable as a reference for the positive control.

### Goal: 

Validate my model's performance (Average Precision and Recall (AP and AR)) relative to the findings in a sufficiently similar implementation example from the literature.

### Hypothesis: 

If my Faster R-CNN implementation is functioning correctly, it should achieve detection metrics (AP and AR) comparable to the [published Mask R-CNN benchmarks]((https://www.scielo.br/j/bcg/a/s4LPJYBbNVDQ4ZcWprP4rKw/?lang=en)) on the Penn-Fudan dataset.

### Rationale:

1. The core detection architecture is identical 
2. The dataset and task (pedestrian detection) are standardized 
3. The evaluation protocols for AP and AR are consistent 
4. The segmentation head in Mask R-CNN does not affect detection metrics

### Experimental Plan: 

1. Train Faster R-CNN on Penn-Fudan dataset using: 
   - ResNet-50 backbone 
   - Standard detection heads 
   - Default training parameters
   
2. Evaluate using COCO metrics: 
   - Average Precision (AP) 
   - Average Recall (AR) 

3. Compare against published benchmarks:
  - Mask R-CNN: AP = 79.25%, AR = 92.63%
  - Other architectures (for context):
    - Yolact++: AP = 92.20%, AR = 94.02%
    - DeepLabv3: AP = 78.06%, AR = 92.83%

## Reference Selection 

The paper "Pedestrian Segmentation from Complex Background Based on Predefined Pose Fields and Probabilistic Relaxation" (Caisse Amisse, Jijón-Palma and António, 2021) provides suitable benchmark metrics for comparison. They evaluate multiple CNN-based architectures on the Penn-Fudan dataset, which contains 170 images of pedestrians in urban environments with pixel-level annotations (masks and bounding boxes).

## Architectural Comparison

### Base Architecture Similarity 

The paper benchmarks Mask R-CNN, which shares the same fundamental detection architecture as my Faster R-CNN implementation:

1. Backbone: ResNet-50 feature extractor 
2. Region Proposal Network (RPN) 
3. ROI Pooling layer 
4. Classification and bounding box regression heads 

### Key Differences 

The main architectural difference is that Mask R-CNN includes an additional segmentation head for mask prediction, while my Faster R-CNN implementation focuses solely on detection. This difference could potentially impact detection performance through:

1. Multi-task Learning Effects:
   - The additional mask supervision might help the shared layers learn better feature representations 
   - The model must balance detection and segmentation objectives, which could affect optimization

2. Parameter Updates:
   - Gradients from the mask head flow back through the shared layers 
   - This could influence how the detection-related parameters are updated during training

However, these implementations still serve as valid reference points because: 

1. The core detection architecture remains identical 
2. The published metrics provide a reasonable expected performance range 

## Multi-Study Validation 

Two independent studies support the comparison Faster R-CNN and Mask R-CNN metrics in the positive control: 

1. - Pedestrian Detection Reference Study (Caisse Amisse, Jijón-Palma and António, 2021) 
   - Mask R-CNN: AP = 79.25%, AR = 92.63% 
   - Similar task (except for mask generation), making it a good reference study for the positive control 
   - The frozen COCO ResNet50 backbone is identical to the one in my model (they also use transfer learning)

2. Vehicle Detection Study (Tahir, Shahbaz Khan and Owais Tariq, 2021) 
   - Faster R-CNN: AP = 76.3%, AR = 76% 
   - Mask R-CNN: AP = 74.3%, AR = 74.35% 
   - Shows consistent relative performance between the two architectures

These studies demonstrate that: 

1. Faster R-CNN and Mask R-CNN may achieve comparable metrics 
2. My implementation's performance (AP = 87%, AR = 92%) is similar to that of the Mask R-CNN with a ResNet50 backbone

## Implementation Details

### Establishing a Positive Control:
When starting this analysis, I began by following [this](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) tutorial. Here, they use the PennFudan dataset - images of pedestrians around a campus and urban streets (more info [here](https://airctic.github.io/icedata/pennfudan/)). I will be using this as my positive control.

Download here:

```{bash}
#| eval: false
wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip -P data
cd data && unzip PennFudanPed.zip
```


The tutorial predicts both bounding boxes for each pedestrian, as well as masks.For this reason they write their code for a mask R-CNN They organise their dataset as follows: 

PennFudanPed

  PedMasks

    FudanPed00001_mask.png
    FudanPed00002_mask.png
    FudanPed00003_mask.png
    FudanPed00004_mask.png
    ...

  PNGImages

    FudanPed00001.png
    FudanPed00002.png
    FudanPed00003.png
    FudanPed00004.png
    ...

Included in the data are masks that segment out each pedestrian. I will not be using this, given my model does not produce mask predictions in the output. 

This file structure can be found in raw/pos_control. 

Firstly, I will define the dataset class, taking the code from the tutorial and adapting it for use with my Faster R-CNN model instead:

In [None]:
from plate_detect import helper_training_functions
import torchvision_deps
from torchvision.ops.boxes import masks_to_boxes
import os
import numpy as np
import pandas as pd
from torchvision.io import read_image
import torch
from torchvision.transforms.v2 import functional as F
from torchvision import tv_tensors
from typing import Dict
import torchvision_deps.T_and_utils

Define the PennFudanDataset class as in the Pytorch tutorial:

In [None]:
class PennFudanDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms): # I like their use of root, this was something I should have done!
        self.root = root
        self.transforms = transforms
        # load all image files
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages")))) # also, note here they sort the otherwise arbitrary os.listdir return - this was a huge flaw I overlooked in my code!
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        # load images and masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = read_image(img_path)
        mask = read_image(mask_path)
        # instances are encoded as different colors
        obj_ids = torch.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]
        num_objs = len(obj_ids)

        # split the color-encoded mask into a set
        # of binary masks
        masks = (mask == obj_ids[:, None, None]).to(dtype=torch.uint8)

        # get bounding box coordinates for each mask
        boxes = masks_to_boxes(masks)

        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)

        image_id = idx
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        # Wrap sample and targets into torchvision tv_tensors:
        img = tv_tensors.Image(img)

        target = {}
        target["boxes"] = tv_tensors.BoundingBoxes(boxes, format="XYXY", canvas_size=F.get_size(img))
        #target["masks"] = tv_tensors.Mask(masks) <--- commented out since my model doen't care about masks
        target["labels"] = labels
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

Next, instantiate a dataset object, splitting into validation and training, initiating training for 10 epochs given the dataset is much smaller (the tutorial trains for just 2 epochs):

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model, optimizer, transforms = helper_training_functions.get_model_instance_object_detection(2)

dataset = PennFudanDataset('/Users/cla24mas/Documents/SC_TSL_15092024_plate_detect/raw/pos_control/PennFudanPed', transforms)
dataset_test = PennFudanDataset('/Users/cla24mas/Documents/SC_TSL_15092024_plate_detect/raw/pos_control/PennFudanPed', transforms)

# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])

# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=torchvision_deps.T_and_utils.utils.collate_fn
)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test,
    batch_size=1,
    shuffle=False,
    collate_fn=torchvision_deps.T_and_utils.utils.collate_fn
)

root_dir = '/Users/cla24mas/Documents/SC_TSL_15092024_plate_detect/'

precedent_epoch = 0
num_epochs = 9

epoch = helper_training_functions.train(model, data_loader, data_loader_test, device, num_epochs, precedent_epoch, root_dir, optimizer)

Now checking a prediction:

In [None]:
save_dir = '/Users/cla24mas/Documents/SC_TSL_15092024_plate_detect/results/0003_pos_control/predictions'
load_dir = '/Users/cla24mas/Documents/SC_TSL_15092024_plate_detect/checkpoints/pos_control/'
model, _, _ = helper_training_functions.load_model(load_dir, 2, 'checkpoint_epoch_2')
model.eval()  # Set model to evaluation mode
for i in range(0, validation_size):
    helper_training_functions.plot_prediction(model, dataset_test, device, i, save_dir, 'checkpoint_epoch_2_pos_control')

This indicates the model is behaving as expected. 

An example:

![example prediction](../results/0003_pos_control/predictions/prediction_normalized_0.png)

Based on the evaluation metrics from epoch 0-9, I chose to make predictions using the model saved at the third epoch of training (epoch 2 is the .pth file as epochs are 0-indexed in my code, currently). This point lies between the (potential) over-fitting plateau and the poorer performing model states at epochs 0 and 1. Something good to note here is that the pytorch tutorial only trains for two epochs on this dataset, which is clearly appropriate according to these results.

## Results 

My implementation achieved: 

- AP @ IoU 0.50:0.95 = 87% (8% > Mask R-CNN in Caisse Amisse, Jijón-Palma and António, 2021) 
- AR @ IoU 0.50:0.95 = 92% (equivalent to Mask R-CNN AR in Caisse Amisse, Jijón-Palma and António, 2021) 

![mAP and mAR](../results/0003_pos_control/evaluation_metrics_epochs_0-9.png) 

These metrics fall well within the expected range established by the literature benchmarks, validating that my implementation is functioning correctly. The higher discrepancy in AP could be attributed to the additional mask-predictor head in the reference study.

## Conclusion 

The positive control demonstrates that my Faster R-CNN implementation: 

1. Achieves performance consistent with published benchmarks 
2. Shows no evidence of implementation errors or dysfunction 
3. Can be confidently applied to new detection tasks  

- Caisse Amisse, Jijón-Palma, M.E. and António, J. (2021). PEDESTRIAN SEGMENTATION FROM COMPLEX BACKGROUND BASED ON PREDEFINED POSE FIELDS AND PROBABILISTIC RELAXATION. Boletim de Ciências Geodésicas, [online] 27(3). doi:https://doi.org/10.1590/s1982-21702021000300017.  
- Tahir, H., Shahbaz Khan, M. and Owais Tariq, M. (2021). Performance Analysis and Comparison of Faster R-CNN, Mask R-CNN and ResNet50 for the Detection and Counting of Vehicles. 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). doi:https://doi.org/10.1109/icccis51004.2021.9397079.