<a href="https://colab.research.google.com/github/FarazHeydar/Pedestrian-Detection-Label-Noise/blob/main/Pedestrian_Detection_Label_Noise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YOLOv7 Pedestrian Detection in presence of Label Noise

## **Connect to Google Drive**

* To ensure that files (datasets and saved models) are not deleted when the notebook is closed, we will connect it to our Google Drive.

* When you run this cell, you will be asked to grant access to your Google account.

**Mount Google Drive**

This cell imports the `drive` module from `google.colab` and mounts your Google Drive to the Colab runtime at the `/content/drive` directory. This allows the notebook to access and save files directly in your Drive, ensuring that datasets and trained models persist even after the runtime is disconnected. You will be prompted to authenticate and grant permissions.

In [None]:
from google.colab import drive

# Mount Google Drive
print("Mounting Google Drive...")
drive.mount('/content/drive')

## **Set Up Project Directory and Data Acquisition**

We will aggregate data from three different sources to create a diverse pedestrian detection dataset:

1.  Custom Data (Zip): A small custom dataset (256 images) to test the model's ability to generalize to new environments.
2.  Cityscapes (Kaggle): High-quality urban street scenes.
3.  CityPersons Annotations (Kaggle): Specialized bounding box annotations for pedestrians in Cityscapes.

The script below checks if the data already exists in the Drive.
* If Found: It skips the download to save time.
* If Missing: It automatically downloads the datasets from Kaggle and unzips custom data file.

In [None]:
import os
import shutil
import sys
import kagglehub

# Define Directories in DRIVE
# Everything will be saved here so it stays after closeing Colab
BASE_PROJECT_DIR = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise'
DATASET_ROOT = os.path.join(BASE_PROJECT_DIR, 'datasets')
MODEL_DIR = os.path.join(BASE_PROJECT_DIR, 'model')

# Create main folders if they don't exist
os.makedirs(DATASET_ROOT, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

print(f"Project Base: {BASE_PROJECT_DIR}")
print(f"Data Folder:  {DATASET_ROOT}")

# --- CUSTOM DATA SETUP (Zip File) ---
print("\n--- Checking Custom Data ---")

# Paths
zip_filename = "CustomData.zip"
zip_path = os.path.join(DATASET_ROOT, zip_filename)
extract_path = os.path.join(DATASET_ROOT, 'CustomImages')

# REPLACE THIS with your actual Google Drive File ID for the zip
# If the zip is already manually uploaded to the folder, you can ignore this ID.
CUSTOM_ZIP_ID = "1PUtxnGw6LzSpo6MCrkJUw8ZRPPqZ7D3H"

# Check if UNZIPPED folder exists
if os.path.exists(extract_path):
    print("'CustomImages' folder found in Drive. Skipping unzip.")

else:
    # Check if ZIP file exists
    if not os.path.exists(zip_path):
        print(f"Zip file not found at {zip_path}")
        if CUSTOM_ZIP_ID != "PASTE_YOUR_ID_HERE":
            print("Downloading Zip from Google Drive ID...")
            !gdown --id {CUSTOM_ZIP_ID} -O "{zip_path}"
        else:
            print("Please manually upload 'CustomData.zip' to 'Pedestrian-Detection-Label-Noise/datasets' in your Drive.")

    # Unzip if zip exists but folder doesn't
    if os.path.exists(zip_path):
        print(f"Unzipping {zip_filename} to Drive (this takes time)...")
        !unzip -q "{zip_path}" -d "{DATASET_ROOT}"
        print("Unzip Complete.")

# --- KAGGLE DATA SETUP (Cityscapes & Annotations) ---
print("\n--- Checking Kaggle Datasets ---")

def setup_kaggle_dataset(dataset_handle, target_folder_name):
    # The final destination in your Drive
    drive_target_path = os.path.join(DATASET_ROOT, target_folder_name)

    # CHECK: Does it already exist in Drive?
    if os.path.exists(drive_target_path):
        print(f"{target_folder_name} already exists in Drive. Skipping download.")
        return

    print(f"Downloading {target_folder_name} (this may take time)...")

    # Download to Colab's local cache first
    cached_path = kagglehub.dataset_download(dataset_handle)

    print(f"Moving files to Google Drive: {drive_target_path}...")

    # Copy from Cache to Drive (Persistent)
    shutil.copytree(cached_path, drive_target_path)
    print(f"{target_folder_name} saved to Drive.")

# Cityscapes
setup_kaggle_dataset("kavithak1388/cityscapes", "Cityscapes")

# Annotations
setup_kaggle_dataset("wildred/city-persons-annotations", "CityPersonsAnnotations")

print("\n SETUP COMPLETE: All data is ready in Google Drive.")

## **Clone YOLOv7**

This cell performs the initial project setup:

1. Change Directory (`%cd`): Moves the runtime's focus to this project path.

2. Clone YOLOv7: It checks if a `yolov7` folder already exists. If not, it downloads (clones) the official YOLOv7 repository from GitHub.

3. Enter Repository: It changes the directory again to move inside the newly cloned `yolov7` folder.

4. Verification: Prints the current working directory to confirm we are in the correct location (`.../yolov7`).

In [None]:
%cd {MODEL_DIR}

# Download (Clone) YOLOv7 code if it doesn't already exist.
# This ensures the YOLOv7 repository is available for use.
if not os.path.exists('yolov7'):
    print("Downloading YOLOv7 repository...")
    !git clone https://github.com/WongKinYiu/yolov7.git
else:
    print("YOLOv7 folder already exists.")

# Navigate into the cloned yolov7 directory.
%cd yolov7

# Explicitly checkout the main branch to extract the files
!git checkout main

# Print the current working directory to confirm the location.
print(f"Current directory: {os.getcwd()}")

## **Patch `requirements.txt`**

This cell applies a patch to the YOLOv7 requirements file to avoid installation errors.

* It first ensures it's in the correct `yolov7` directory.

* It then uses the `sed` command to modify `requirements.txt` in place (`-i`).

* The command finds the line specifying `numpy<1.24.0,>=1.18.5` and replaces it with numpy>=1.18.5. This loosens the version constraint, which is a common fix for older projects that conflict with the newer libraries (like NumPy) pre-installed in Google Colab.

In [None]:
# Make sure we are in the correct folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the magic command to modify the file
# This command replaces the text 'numpy<1.24.0,>=1.18.5' with 'numpy>=1.18.5'
# This is done to resolve potential dependency conflicts with newer numpy versions.
!sed -i 's/numpy<1.24.0,>=1.18.5/numpy>=1.18.5/' requirements.txt

print("The requirements.txt file was successfully modified!")

## **Install Dependencies**

This cell handles the installation of all required Python libraries.

1. **Navigate:** It changes the directory back to the `yolov7` project folder.

2. **Patch Dependencies (Pythonic way):** It reads `requirements.txt`, finds any lines that start with `numpy` or `protobuf` (common conflict points), and comments them out. This is a more flexible patch than the previous `sed` command.

3. **Install:** It runs `pip install -r requirements.txt` to install all libraries listed in the now-patched file.

4. **Install Extras:** It explicitly installs `pandas` and `scipy` afterward to ensure they are present.

In [None]:
import os
import sys

# Make sure we are in the correct folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Print the current working directory to confirm the location.
print(f"Current directory: {os.getcwd()}")

requirements_file = 'requirements.txt'
problem_lines_to_comment = ['numpy', 'protobuf']
patched_count = 0

# --- Find and disable both lines ---
# This section attempts to comment out specific problematic dependencies (numpy and protobuf)
# in requirements.txt if they cause installation issues with the current Python environment.
# This is a common workaround for dependency conflicts in older projects.
try:
    with open(requirements_file, 'r') as f:
        lines = f.readlines()

    new_lines = []
    for line in lines:
        # Check if the line starts with one of the problematic words
        found_problem = False
        for problem in problem_lines_to_comment:
            if line.strip().startswith(problem):
                new_line = f"# {line}" # Comment out the line
                new_lines.append(new_line)
                print(f"PATCHED: Commented out line: '{line.strip()}'")
                patched_count += 1
                found_problem = True
                break

        if not found_problem:
            new_lines.append(line)

    if patched_count > 0:

        # Overwrite the modified file with the commented-out lines
        with open(requirements_file, 'w') as f:
            f.writelines(new_lines)
        print(f"requirements.txt successfully patched. {patched_count} lines commented out.")
    else:
        print("Could not find any problem lines to comment out. This is strange.")

except Exception as e:
    print(f"Error reading/writing file: {e}")

# --- The installation ---
# Install the dependencies listed in the modified requirements.txt.
# Also install pandas and scipy explicitly as they might be needed and could conflict if not handled.
print("\n--- Now attempting to install with patched file... ---")
!pip install -r requirements.txt
!pip install pandas scipy

## **Setup Detection Images**

This cell helps prepare our test environment. It takes the unseen images from our custom dataset's `test` folder and copies them into the `detection_test` directory.

We will use these specific images later with `detect.py` to visually see how well our different models (Baseline vs. Noisy vs. Robust) perform on real-world data.

In [None]:
import os
import shutil

# --- Configuration ---
# Source: The 'test' subfolder where you extracted your CustomData.zip
SOURCE_TEST_DIR = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/datasets/DetectionTest'

# Destination: The specific folder YOLOv7 looks at for detection tests
DEST_DETECT_DIR = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/detection_test'

def copy_test_images():
    print(f"--- Setting up Test Images ---")

    # Validation: Check if source exists
    if not os.path.exists(SOURCE_TEST_DIR):
        print(f"Error: Source folder not found at: {SOURCE_TEST_DIR}")
        print("Please verify that your 'CustomData.zip' contained a folder named 'test'.")
        return

    # Preparation: Create destination directory if needed
    if not os.path.exists(DEST_DETECT_DIR):
        os.makedirs(DEST_DETECT_DIR)
        print(f"Created destination directory: {DEST_DETECT_DIR}")
    else:
        print(f"Destination directory exists: {DEST_DETECT_DIR}")

    # Execution: Copy images
    print(f"Copying images...")
    copy_count = 0
    for filename in os.listdir(SOURCE_TEST_DIR):

        # Filter for image files only
        if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp')):
            src_path = os.path.join(SOURCE_TEST_DIR, filename)
            dst_path = os.path.join(DEST_DETECT_DIR, filename)

            shutil.copy(src_path, dst_path)
            copy_count += 1

    print(f"Success! {copy_count} images copied to {DEST_DETECT_DIR}")

if __name__ == "__main__":
    copy_test_images()

## **Create the Data Processing Script**

This cell uses the `%%writefile` command to create a new Python script named `data_processor.py` inside the `yolov7` directory. This script is responsible for preparing and combining our datasets.

The script contains several key functions:

* `create_final_dirs()`: Creates the final directory structure (`Combined_Dataset/images/train, .../labels/train, etc.`) where the processed data will be stored. It deletes any old data first.

* `process_citypersons(split)`: This is the main processing function for the CityPersons dataset.

* It loads the `.mat` annotation files `(anno_train.mat` or `anno_val.mat`).

* It iterates through each image, copies the image file to the final directory.

* Crucially, it extracts the visible bounding box annotations (`instance[6:9]`) instead of the full bounding box. This is the main "fix" for this model, helping it learn to detect partially occluded pedestrians better.

* It converts these box coordinates into the YOLO format (class_id, x_center, y_center, width, height) and saves them as `.txt` files.

* `process_custom_data()`: This function finds all custom images and labels, copies them to the final `train` folder, and "sanitizes" the labels by ensuring the class ID is set to `0` (for the single 'pedestrian' class).

* The `if __name__ == "__main__"`: block at the end ensures that when this script is run, it executes all the steps: creates directories, processes the CityPersons train split, processes the val split, and adds the custom data.

In [None]:
%%writefile /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data_processor.py

import os
import shutil
import scipy.io as sio
from tqdm import tqdm
import numpy as np

# Define base paths and specific directory paths for the dataset and model.
BASE_PATH = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise'
DATASET_PATH = os.path.join(BASE_PATH, 'datasets')
CITY_IMAGES_DIR = os.path.join(DATASET_PATH, 'Cityscapes', 'Cityscape', 'leftImg8bit')
CITY_ANN_DIR = os.path.join(DATASET_PATH, 'CityPersonsAnnotations')
CUSTOM_IMAGES_DIR = os.path.join(DATASET_PATH, 'CustomImages')
CUSTOM_LABELS_DIR = os.path.join(DATASET_PATH, 'CustomAnnotations')
FINAL_DATASET_DIR = os.path.join(BASE_PATH, 'model', 'yolov7', 'data', 'Combined_Dataset')
FINAL_IMG_TRAIN = os.path.join(FINAL_DATASET_DIR, 'images', 'train')
FINAL_LBL_TRAIN = os.path.join(FINAL_DATASET_DIR, 'labels', 'train')
FINAL_IMG_VAL = os.path.join(FINAL_DATASET_DIR, 'images', 'val')
FINAL_LBL_VAL = os.path.join(FINAL_DATASET_DIR, 'labels', 'val')

def create_final_dirs():
    # Creates final folders (cleans if present)
    print("Creating final directory structure (wiping old data)...")
    if os.path.exists(FINAL_DATASET_DIR):
        shutil.rmtree(FINAL_DATASET_DIR) # <-- Clear previous dataset

    # Create necessary directories for training and validation images and labels.
    os.makedirs(FINAL_IMG_TRAIN, exist_ok=True)
    os.makedirs(FINAL_LBL_TRAIN, exist_ok=True)
    os.makedirs(FINAL_IMG_VAL, exist_ok=True)
    os.makedirs(FINAL_LBL_VAL, exist_ok=True)

# --- Processing the CityPersons dataset (from .mat to YOLO) ---
# This function processes the CityPersons dataset, converting its .mat annotation format
# into YOLO format and copying the corresponding images to the final dataset directory.
# It specifically uses the 'visible' bounding box annotations for pedestrians.
def process_citypersons(split='train'):
    print(f"\nProcessing CityPersons '{split}' split...")
    if split == 'train':
        mat_file = os.path.join(CITY_ANN_DIR, 'anno_train.mat')
        img_base_dir = os.path.join(CITY_IMAGES_DIR, 'train')
        final_img_dir = FINAL_IMG_TRAIN
        final_lbl_dir = FINAL_LBL_TRAIN
        mat_key = 'anno_train_aligned'
    else:
        mat_file = os.path.join(CITY_ANN_DIR, 'anno_val.mat')
        img_base_dir = os.path.join(CITY_IMAGES_DIR, 'val')
        final_img_dir = FINAL_IMG_VAL
        final_lbl_dir = FINAL_LBL_VAL
        mat_key = 'anno_val_aligned'
    if not os.path.exists(mat_file):
        print(f"!!! Error: MAT file not found at {mat_file}")
        return
    mat = sio.loadmat(mat_file)
    print(f"Found {len(mat[mat_key][0])} image records in {split}.mat")
    copied_image_count = 0

    for img_data in tqdm(mat[mat_key][0]):
        city = img_data[0][0][0][0]
        image_id = img_data[0][0][1][0]
        image_path = os.path.join(img_base_dir, city, image_id)
        if not os.path.exists(image_path):
            continue

        shutil.copy(image_path, os.path.join(final_img_dir, image_id))
        copied_image_count += 1

        yolo_labels = []
        for instance in img_data[0][0][2]:
            class_label = int(instance[0])
            if class_label != 1: # Pedestrian only (class ID 1 in CityPersons, but mapped to 0 for YOLOv7 single class)
                continue

            # We use indices 6, 7, 8, 9 for the *visible* box
            # instead of 1, 2, 3, 4 (full box) to improve detection of occluded pedestrians.
            x1, y1, w, h = float(instance[6]), float(instance[7]), float(instance[8]), float(instance[9])

            # If the visible width or height is zero, skip it as it's not a valid bounding box.
            if w <= 0 or h <= 0:
                continue

            img_width, img_height = 2048, 1024 # Standard Cityscapes image dimensions.
            x_center = (x1 + w / 2) / img_width
            y_center = (y1 + h / 2) / img_height
            norm_w = w / img_width
            norm_h = h / img_height
            yolo_labels.append(f"0 {x_center:.6f} {y_center:.6f} {norm_w:.6f} {norm_h:.6f}") # '0' is the pedestrian class ID for YOLOv7

        if yolo_labels:
            label_filename = os.path.splitext(image_id)[0] + '.txt'
            with open(os.path.join(final_lbl_dir, label_filename), 'w') as f:
                f.write("\n".join(yolo_labels))

    print(f"Successfully copied {copied_image_count} CityPersons '{split}' images.")

# --- Custom Data Processing ---
# This function processes custom image and label data.
# It copies custom images and sanitizes custom labels to ensure they conform to YOLO format
# (e.g., setting the class ID to 0 for pedestrians).
def process_custom_data():
    print("\nProcessing and SANITIZING Custom Labels (searching all subfolders)...")
    img_count = 0
    label_count = 0

    # Copy custom images from all subfolders to the training image directory.
    for root, dirs, files in os.walk(CUSTOM_IMAGES_DIR):
        for filename in files:
            if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
                shutil.copy(os.path.join(root, filename), FINAL_IMG_TRAIN)
                img_count += 1
    print(f"Copied {img_count} custom images.")

    # Disinfection and copying of custom labels.
    # It iterates through all custom label files, ensures the class ID is '0', and saves them.
    for root, dirs, files in os.walk(CUSTOM_LABELS_DIR):
        for filename in files:
            if filename.lower().endswith('.txt'):
                label_path = os.path.join(root, filename)
                sanitized_lines = []
                try:
                    with open(label_path, 'r') as f: lines = f.readlines()
                    for line in lines:
                        parts = line.strip().split()
                        if len(parts) == 5:
                            parts[0] = '0' # Make sure the class ID is 0 for pedestrian.
                            sanitized_lines.append(" ".join(parts))
                    if sanitized_lines:
                        final_label_path = os.path.join(FINAL_LBL_TRAIN, filename)
                        with open(final_label_path, 'w') as f_out: f_out.write("\n".join(sanitized_lines))
                        label_count += 1
                except Exception as e:
                    print(f"Warning: Could not process custom label file {filename}. Error: {e}")
    print(f"Copied and SANITIZED {label_count} custom labels.")

# --- Run everything ---
# Main execution block to create directories, process CityPersons, and custom data.
if __name__ == "__main__":
    create_final_dirs()
    process_citypersons(split='train')
    process_citypersons(split='val')
    process_custom_data()
    print("\n--- All data successfully processed and merged (Using VISIBLE Bounding Boxes)! ---")

## **Run the Data Processing Script**

This cell executes the `data_processor.py` script that was just written. It first ensures the notebook is in the `yolov7` directory and then runs the script using `!python`.

The output shows the script in action:

1. Creating the final directories (and wiping old data).

2. Processing the 2975 records from the CityPersons `train.mat` file.

3. Processing the 500 records from the `val.mat` file.

4. Processing and sanitizing the 256 custom images and labels.

5. Confirming that all data was processed using the "VISIBLE Bounding Boxes."

In [None]:
# Make sure we are in the yolov7 folder for script execution.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the data_processor.py script to prepare the dataset.
# This script combines CityPersons and custom data, converts annotations, and organizes files.
!python data_processor.py

## **Create the Dataset Configuration File (YAML)**

This cell writes the `combined_dataset.yaml` file. This is a critical configuration file that tells the YOLOv7 training script:

* `train`: The path to the training images directory.

* `val`: The path to the validation images directory.

* `nc`: The total number of classes (which is `1` in this project).

* `names`: A list of the class names (just `'pedestrian'`).

In [None]:
%%writefile /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/combined_dataset.yaml

# This YAML file configures the dataset for YOLOv7 training.

# Training and validation data paths (with full address)
# These paths point to the image directories for training and validation.
train: /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/images/train
val: /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/images/val

# Number of classes
nc: 1

# Class names
# 'pedestrian' is the single class being detected in this project.
names: ['pedestrian']

## **Download Pre-trained Weights**

This cell downloads the official YOLOv7 pre-trained weights (`yolov7.pt`) from the YOLOv7 GitHub releases. These weights were trained on the large-scale COCO dataset.

We use these weights as a starting point for our training (a process called transfer learning). This is much faster and more effective than training a model from scratch, as the model already understands basic features like edges, shapes, and textures.

In [None]:
# Make sure we are in the yolov7 folder before downloading weights.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Download official YOLOv7 pre-trained weights from the GitHub repository.
# These weights serve as a starting point for training, often referred to as the 'brain' of the model.
!wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt

print("\n--- YOLOv7 official weights downloaded! ---")

## **Patch PyTorch Loading Functions**

This cell applies another compatibility patch, this time to the YOLOv7 source code itself. It uses `sed` to modify `train.py` and `utils/general.py`.

It finds instances of `torch.load(...)` and adds the `weights_only=False` argument. This is necessary to resolve loading errors that can occur with newer versions of PyTorch when loading checkpoints saved by older versions.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the magic command to modify the file
# This command finds line 71 and adds the code to it. This is a common workaround
# for compatibility issues when loading pre-trained weights in PyTorch, ensuring
# that only the weights are loaded and not other model components which might cause conflicts.
!sed -i "71s/torch.load(weights, map_location=device)/torch.load(weights, map_location=device, weights_only=False)/" train.py

# This command is similar to the above, applying the same fix to line 87 of train.py.
!sed -i "87s/torch.load(weights, map_location=device)/torch.load(weights, map_location=device, weights_only=False)/" train.py

# This command applies the weights_only=False fix to line 802 of utils/general.py,
# addressing potential loading issues in a different utility script.
!sed -i "802s/torch.load(f, map_location=torch.device('cpu'))/torch.load(f, map_location=torch.device('cpu'), weights_only=False)/" utils/general.py

print("\n--- train.py file successfully modified! ---")

## **Start Training: Baseline Model (Clean Data)**

This is the main training command for our baseline model. It executes `train.py` using the clean, combined dataset we just prepared.

Key arguments:

* `--weights yolov7.pt`: Starts training using the pre-trained COCO weights.

* `--data data/combined_dataset.yaml`: Points to our dataset configuration file.

* `--img 640`: Resizes all images to 640x640.

* `--batch-size 16`: Uses a batch size of 16.

* `--epochs 100`: Trains for 100 complete passes over the dataset.

* `--name baseline_clean_run`: Saves the training logs and resulting weights (like `best.pt` and last.pt) to the `runs/train/baseline_clean_run` directory.

* --`device 0`: Specifies the use of the first available GPU.

The output logs the training progress for each epoch, showing the loss (box, obj, cls) and validation metrics (Precision, Recall, mAP@.5, mAP@.5:.95). This run took approximately 10.5 hours.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Execute the training command!
# This command starts the YOLOv7 training process with specified parameters:
# --weights: uses the pre-trained yolov7.pt weights as a starting point.
# --data: specifies the dataset configuration file (combined_dataset.yaml).
# --img: sets the image size to 640x640 pixels for training.
# --batch-size: uses a batch size of 16 images per iteration.
# --epochs: trains the model for 100 epochs.
# --name: assigns a name to this training run for organization.
# --device: specifies GPU device 0 for training (if available).
!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 100 \
    --name baseline_clean_run \
    --device 0

## **Create the Noise Injection Script**

This cell writes the `noise_injector.py` script, which is used to conduct our experiments with noisy labels.

The script defines several functions:

* `create_backup()`: Saves a clean copy of the training labels to a `train_CLEAN_BACKUP` directory.

* `restore_backup()`: Wipes the active `train` label directory and replaces it with the clean backup. This is crucial for ensuring experiments are repeatable and start from the same clean state.

* `inject_noise(noise_type)`:

  1. First, it restores the clean backup.

  2. It then gets a list of all training label files and randomly selects n% of them.

  3. It iterates through this n% sample and applies noise:

      * `--task inject_fn 0.n` (False Negative): Simulates missing labels by randomly deleting one bounding box (one line) from the file.

      * `--task inject_fp 0.n` (False Positive): Simulates incorrect labels by adding a new, randomly generated fake bounding box to the file.

* The script uses `argparse` to be runnable from the command line with a `--task` argument.

In [None]:
%%writefile /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/noise_injector.py

import os
import shutil
import random
import glob
from tqdm import tqdm
import argparse

# Define the base directory for the dataset.
BASE_PATH = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset'

# Path to the directory containing clean (unmodified) training labels, used as a backup.
CLEAN_LABEL_DIR_SOURCE = os.path.join(BASE_PATH, 'labels', 'train_CLEAN_BACKUP') # <-- Backup folder

# Path to the target directory where noisy labels will be placed (this is the active training label folder).
NOISY_LABEL_DIR_TARGET = os.path.join(BASE_PATH, 'labels', 'train') # <-- Write directly to the train folder

def create_backup():
    # Creates a backup copy of clean training labels.
    print("Creating clean label backup...")
    # Check if a backup already exists to avoid overwriting unless intended.
    if os.path.exists(CLEAN_LABEL_DIR_SOURCE):
        print("Backup already exists.")
        return

    # Define the current training label directory to be backed up.
    clean_label_dir = os.path.join(BASE_PATH, 'labels', 'train')
    # Ensure the source directory exists before attempting to copy.
    if not os.path.exists(clean_label_dir):
        print("ERROR: Clean train directory not found to create backup!")
        return

    # Copy the entire directory containing clean labels to the backup path.
    shutil.copytree(clean_label_dir, CLEAN_LABEL_DIR_SOURCE)
    print(f"Backup created at: {CLEAN_LABEL_DIR_SOURCE}")

def restore_backup():
    # Restores clean labels to the active training directory from backup.
    print("Restoring clean labels from backup...")
    # Check if a backup exists to perform a restore operation.
    if not os.path.exists(CLEAN_LABEL_DIR_SOURCE):
        print("ERROR: No backup found!")
        return

    # Delete the current (potentially noisy) training label folder.
    if os.path.exists(NOISY_LABEL_DIR_TARGET):
        shutil.rmtree(NOISY_LABEL_DIR_TARGET)

    # Copy the clean labels from the backup to the active training label folder.
    shutil.copytree(CLEAN_LABEL_DIR_SOURCE, NOISY_LABEL_DIR_TARGET)
    print("Clean labels restored.")


def inject_noise(noise_type, percentage):

    # Injects specified type of noise directly into the 'train' labels folder.
    # Ensure a clean backup exists before injecting noise to guarantee reversibility.
    if not os.path.exists(CLEAN_LABEL_DIR_SOURCE):
        print("ERROR: Clean backup not found. Please create backup first.")
        return

    print(f"Injecting '{noise_type}' noise at {percentage*100}%...")

    # First, restore clean labels to ensure noise is injected into a consistent state.
    restore_backup()

    # Get all label files in the (now clean) training directory.
    clean_label_files = glob.glob(os.path.join(NOISY_LABEL_DIR_TARGET, '*.txt'))
    print(f"Found {len(clean_label_files)} clean label files to process.")

    # Calculate how many files to apply noise to, based on the percentage.
    num_files_to_noise = int(len(clean_label_files) * percentage)

    # Randomly select a subset of files for noise injection.
    files_to_noise = random.sample(clean_label_files, num_files_to_noise)
    files_to_noise_set = set(files_to_noise) # Convert to set for efficient lookup if needed (though not critical here).

    print(f"Injecting {percentage*100}% noise into {len(files_to_noise_set)} files...")

    # Iterate through the selected files and apply the specified noise type.
    for clean_file_path in tqdm(files_to_noise_set):
        try:
            with open(clean_file_path, 'r') as f:
                lines = f.readlines()
        except Exception as e:
            print(f"Warning: Could not read {clean_file_path}, skipping. Error: {e}")
            continue

        if noise_type == 'fn':

            # --- Experiment 1: "Missing Labels" Noise (False Negative) ---
            if lines: # Only modify if the file is not already empty.

                # Remove a random existing bounding box annotation line.
                lines.pop(random.randint(0, len(lines) - 1))

            # Overwrite the original file with the modified (noisy) content.
            with open(clean_file_path, 'w') as f_fn:
                f_fn.writelines(lines)

        elif noise_type == 'fp':

            # --- Experiment 2: "False Positive" Noise ---
            # Generate random coordinates and dimensions for a new, fake bounding box.
            fake_x = random.uniform(0.1, 0.9)
            fake_y = random.uniform(0.1, 0.9)
            fake_w = random.uniform(0.01, 0.05)
            fake_h = random.uniform(0.02, 0.08)

            # Add the new fake bounding box as a pedestrian (class 0).
            lines.append(f"0 {fake_x:.6f} {fake_y:.6f} {fake_w:.6f} {fake_h:.6f}\n")

            # Overwrite the original file with the modified (noisy) content.
            with open(clean_file_path, 'w') as f_fp:
                f_fp.writelines(lines)

    print(f"--- Noise injection for '{noise_type}' complete! ---")

# --- Run everything (via command line) ---
# This block allows the script to be run with command-line arguments to specify the task.
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--task', type=str, required=True)

    # Define the '--task' argument which can be 'backup', 'restore', 'inject_fn', or 'inject_fp'.
    parser.add_argument('--p', type=float, default=0.2, help='Noise percentage (e.g., 0.1 for 10%)')
    opt = parser.parse_args()

    # Call the appropriate function based on the provided task argument.
    if opt.task == 'backup':
        create_backup()
    elif opt.task == 'restore':
        restore_backup()
    elif opt.task == 'inject_fn':
        inject_noise('fn', opt.p)
    elif opt.task == 'inject_fp':
        inject_noise('fp', opt.p)
    else:
        print("Invalid task. Choose: backup, restore, inject_fn, or inject_fp, Noise percentage (e.g., 0.1 for 10%)")

## **Clear Dataset Cache**

YOLOv7 creates these cache files to speed up loading. Since we just modified the contents of the `labels/train` directory by injecting noise, we must delete these cache files. This forces YOLOv7 to re-read the label folders and create a new cache that reflects the noisy data for the next training run.

Part of this code deletes the existing dataset cache files (`train.cache` and `val.cache`) before training the model.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Corrupted cache file '{cache_file_train}' removed.")
print(f"Corrupted cache file '{cache_file_val}' deleted.")
print("Ready to try again!")

## **Create Clean Label Backup**

This cell executes the `noise_injector.py` script with the `--task backup` argument. This creates the essential backup of the clean training labels, storing them in the `.../labels/train_CLEAN_BACKUP` directory. This backup is what allows us to inject noise and then restore the clean state later.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to create a backup of the clean labels.
# This step is crucial before injecting any noise, allowing for easy restoration.
!python noise_injector.py --task backup

## **Sensitivity Analysis: Varying False Negative Rates**

To understand the relationship between noise intensity and model degradation, we will now conduct a **Sensitivity Analysis**.

We will train the model with different levels of False Negative (FN) noise:
1.  10% FN Noise: To see if the model can survive low-level corruption.
3.  30% FN Noise: To see how severely the model breaks under heavy corruption.
3.  20% FN Noise: To see if the model can survive mid-level corruption.

**Start Training: "False Negative" (FN) Model**

This cell trains a new model from scratch (using `yolov7.pt` weights) on the FN-noisy dataset we creates.

The command is identical to the baseline training. We can now see how the model's performance (mAP, etc.) is affected by training on data with different percentage of missing labels. This run took approximately one hours each.

**10% False Negative Noise**

Injecting 10% missing labels (`p=0.1`) and training for 10 epochs.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to inject 'false negative' noise into the dataset.
# This simulates scenarios where some true objects are not labeled.
!python noise_injector.py --task inject_fn --p 0.1

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Corrupted cache file '{cache_file_train}' removed.")
print(f"Corrupted cache file '{cache_file_val}' deleted.")
print("Ready to try again!")

!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 10 \
    --name sensitivity_fn_10p \
    --device 0

**30% False Negative Noise**

Injecting 10% missing labels (`p=0.3`) and training for 10 epochs.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to inject 'false negative' noise into the dataset.
# This simulates scenarios where some true objects are not labeled.
!python noise_injector.py --task inject_fn --p 0.3

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Corrupted cache file '{cache_file_train}' removed.")
print(f"Corrupted cache file '{cache_file_val}' deleted.")
print("Ready to try again!")

!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 10 \
    --name sensitivity_fn_30p \
    --device 0

**20% False Negative Noise**

Injecting 20% missing labels (`p=0.2`) and training for 10 epochs.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to inject 'false negative' noise into the dataset.
# This simulates scenarios where some true objects are not labeled.
!python noise_injector.py --task inject_fn --p 0.2

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Corrupted cache file '{cache_file_train}' removed.")
print(f"Corrupted cache file '{cache_file_val}' deleted.")
print("Ready to try again!")

!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 10 \
    --name sensitivity_fn_20p \
    --device 0

## **Experiment 1: Inject "False Negative" (FN) Noise**

This cell runs the first experiment by executing the noise injector with `--task inject_fn`.

The script performs the following:

Restores the clean labels from the backup to the `labels/train` directory.

Selects 20% of the label files (492 files in this case).

Injects Noise: It loops through those 492 files and randomly deletes one line (one bounding box) from each, simulating missing labels (False Negatives). The `labels/train` directory now contains our 20% FN noisy dataset, ready for training.

**Start Training: "False Negative" (FN) Model**

This cell trains a new model from scratch (using `yolov7.pt` weights) on the FN-noisy dataset we just created.

The command is identical to the baseline training, except for the `--name` argument, which is set to `noisy_fn_run`. This ensures the results are saved to a new directory, `runs/train/noisy_fn_run`, for comparison. We can now see how the model's performance (mAP, etc.) is affected by training on data with 20% missing labels. This run took approximately 9.6 hours.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Corrupted cache file '{cache_file_train}' removed.")
print(f"Corrupted cache file '{cache_file_val}' deleted.")
print("Ready to try again!")

# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model2/yolov7

# Run training on the False Negative (FN) dataset.
# This trains a new model using the labels that have 'false negative' noise.
# The training parameters are identical to the baseline run for fair comparison.
!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 100 \
    --name noisy_fn_run \
    --device 0

## **Sensitivity Analysis: Varying False Positive Rates**

To understand the relationship between noise intensity and model degradation, we will now conduct a **Sensitivity Analysis**.

We will train the model with different levels of False Positive (FP) noise:
1.  10% FN Noise: To see if the model can survive low-level corruption.
3.  30% FN Noise: To see how severely the model breaks under heavy corruption.
3.  20% FN Noise: To see if the model can survive mid-level corruption.

**Start Training: "False Positive" (FP) Model**

This cell trains a new model from scratch (using `yolov7.pt` weights) on the FN-noisy dataset we creates.

The command is identical to the baseline training. We can now see how the model's performance (mAP, etc.) is affected by training on data with different percentage of missing labels. This run took approximately one hours each.

**10% False Positive Noise**

Injecting 10% missing labels (`p=0.1`) and training for 10 epochs.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to inject 'false negative' noise into the dataset.
# This simulates scenarios where some true objects are not labeled.
!python noise_injector.py --task inject_fp --p 0.1

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Corrupted cache file '{cache_file_train}' removed.")
print(f"Corrupted cache file '{cache_file_val}' deleted.")
print("Ready to try again!")

!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 10 \
    --name sensitivity_fp_10p \
    --device 0

**30% False Positive Noise**

Injecting 30% missing labels (`p=0.3`) and training for 10 epochs.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to inject 'false negative' noise into the dataset.
# This simulates scenarios where some true objects are not labeled.
!python noise_injector.py --task inject_fp --p 0.3

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Corrupted cache file '{cache_file_train}' removed.")
print(f"Corrupted cache file '{cache_file_val}' deleted.")
print("Ready to try again!")

!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 10 \
    --name sensitivity_fp_30p \
    --device 0

**20% False Positive Noise**

Injecting 20% missing labels (`p=0.2`) and training for 10 epochs.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to inject 'false negative' noise into the dataset.
# This simulates scenarios where some true objects are not labeled.
!python noise_injector.py --task inject_fp --p 0.2

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Corrupted cache file '{cache_file_train}' removed.")
print(f"Corrupted cache file '{cache_file_val}' deleted.")
print("Ready to try again!")

!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 10 \
    --name sensitivity_fp_20p \
    --device 0

## **Experiment 2: Inject "False Positive" (FP) Noise**

This cell runs the second experiment by executing `--task inject_fp`.

The script again:

1. Restores the clean labels from the backup (this wipes the FN noise from the previous experiment).

2. Selects a new random 20% sample of label files (492 files).

3. Injects Noise: It loops through these files and adds one new, randomly generated fake bounding box to each, simulating incorrect labels (False Positives). The `labels/train` directory now contains our 20% FP noisy dataset.

**Start Training: "False Positive" (FP) Model**

This cell trains the third and final model, this time on the FP-noisy dataset.

The command is identical to the others, but the `--name` is `noisy_fp_run`. This saves the results to `runs/train/noisy_fp_run`. This will allow us to compare the baseline model against the FN-noisy model and the FP-noisy model to see how different types of label noise impact training performance. This run took approximately 15.7 hours.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to inject 'false negative' noise into the dataset.
# This simulates scenarios where some true objects are not labeled.
!python noise_injector.py --task inject_fp --p 0.2

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This forces YOLOv7 to regenerate the cache files, incorporating the newly injected noise.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")
print("Ready for the second test!")

# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run training on the False Positive dataset.
# This trains a new model using labels that have 'false positive' noise (extra, incorrect bounding boxes).
# The training parameters are consistent with previous runs for comparison.
!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 100 \
    --name noisy_fp_run \
    --device 0

## **Patch Model Loading for Testing**

This cell applies the same `weights_only=False` patch to `models/experimental`.py. This file is used by the `test.py` and `detect.py` scripts when loading trained models. This patch is necessary to ensure we can successfully load our saved `best.pt` files for evaluation.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the magic command to modify the file
# This modification is crucial for correctly loading model weights during testing or inference,
# especially when the weights might contain more than just the model's state dictionary.
# By setting `weights_only=False`, PyTorch allows loading of a broader range of checkpoint structures.
!sed -i "s/torch.load(w, map_location=map_location)/torch.load(w, map_location=map_location, weights_only=False)/" models/experimental.py

print("\n--- The file models/experimental.py was modified successfully! ---")
print("Now you can run the test command again.")

## **Evaluate Baseline Model (Clean Data)**

This cell evaluates the performance of the baseline (clean) model.

* It first clears the dataset cache files.

* It then runs `test.py` (which evaluates on the validation set).

* `--weights`: It specifies the path to the `best.pt` file from our `baseline_clean_run`.

* `--name test_baseline_clean`: Saves the test results to this folder.

The output shows the final metrics for the baseline model: mAP@.5 of 0.34.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This ensures that YOLOv7 regenerates fresh cache files for the dataset,
# reflecting any changes made to the label files or dataset structure.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")

# Change to the YOLOv7 directory to execute the test script.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the test.py script to evaluate the performance of the baseline model.
# --data: Specifies the dataset configuration (combined_dataset.yaml).
# --img: Sets the image size for testing to 640x640 pixels.
# --batch: Sets the batch size for inference to 32.
# --weights: Points to the trained weights of the baseline model.
# --name: Assigns a name to this test run for organized output.
!python test.py \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch 32 \
    --weights runs/train/baseline_clean_run/weights/best.pt \
    --name test_baseline_clean

## **Evaluate "False Negative" (FN) Model**

This cell evaluates the performance of the model trained on False Negative (FN) noise.

* It clears the cache.

* It runs `test.py` using the `best.pt` file from the `noisy_fn_run`.

The output shows a massive drop in performance. The model trained on missing labels achieved an mAP@.5 of only 0.0292.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This ensures that YOLOv7 regenerates fresh cache files for the dataset,
# reflecting any changes made to the label files or dataset structure.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")

# Change to the YOLOv7 directory to execute the test script.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the test.py script to evaluate the performance of the model trained with 'false negative' noise.
# --data: Specifies the dataset configuration (combined_dataset.yaml).
# --img: Sets the image size for testing to 640x640 pixels.
# --batch: Sets the batch size for inference to 32.
# --weights: Points to the trained weights of the model with false negative noise.
# --name: Assigns a name to this test run for organized output.
!python test.py \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch 32 \
    --weights runs/train/noisy_fn_run/weights/best.pt \
    --name test_noisy_fn

## **Evaluate "False Positive" (FP) Model**

This cell evaluates the performance of the model trained on False Positive (FP) noise.

* It clears the cache.

* It runs `test.py` using the `best.pt` file from the `noisy_fp_run`.

This model performed much better than the FN model and even outperformed the baseline. It achieved an mAP@.5 of 0.442.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This ensures that YOLOv7 regenerates fresh cache files for the dataset,
# reflecting any changes made to the label files or dataset structure.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")

# Change to the YOLOv7 directory to execute the test script.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the test.py script to evaluate the performance of the model trained with 'false positive' noise.
# --data: Specifies the dataset configuration (combined_dataset.yaml).
# --img: Sets the image size for testing to 640x640 pixels.
# --batch: Sets the batch size for inference to 32.
# --weights: Points to the trained weights of the model with false positive noise.
# --name: Assigns a name to this test run for organized output.
!python test.py \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch 32 \
    --weights runs/train/noisy_fp_run/weights/best.pt \
    --name test_noisy_fp

## **Analyze Bounding Box Size Distribution**

This cell performs data analysis on our clean dataset (using the `train_CLEAN_BACKUP` folder).

1. It iterates through all the label files (`.txt`).

2. For each bounding box, it extracts the normalized width (`parts[3]`).

3. It converts this normalized width back into pixel values, assuming a standard Cityscapes image width of 2048px.

4. Finally, it uses `matplotlib` to plot a histogram of these widths.

The resulting chart shows the distribution of pedestrian bounding box sizes. The vertical red line at 10 pixels highlights that a significant number of objects in the dataset are "Extremely Small," which is a key challenge for the detector.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import glob
from tqdm import tqdm

# --- Clean labels folder path ---
# This variable defines the path to the directory containing clean (unmodified) label files.
# This is typically a backup of the original labels before any noise injection.
clean_label_dir = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train_CLEAN_BACKUP'
# Alternative: If no backup was made, use the validation set labels for analysis.
# clean_label_dir = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val'

print(f"Scanning {clean_label_dir}...")

widths = []
heights = [] # This list is commented out as it's not currently used in the plot.
img_width_px = 2048 # Cityscapes image width in pixels, used for denormalizing bounding box widths.

# Retrieve all text files in the specified clean label directory.
label_files = glob.glob(os.path.join(clean_label_dir, '*.txt'))

# Iterate through each label file to extract bounding box dimensions.
for file_path in tqdm(label_files):
    try:
        with open(file_path, 'r') as f:
            lines = f.readlines()

        for line in lines:
            parts = line.strip().split()
            if len(parts) == 5:
                # The normalized width (w) is typically the 4th element (index 3) in YOLO format.
                w_normalized = float(parts[3])
                h_normalized = float(parts[4]) # Normalized height is not used in this specific plot.

                # Convert the normalized width to pixel values for analysis.
                widths.append(w_normalized * img_width_px)

    except Exception as e:
        print(f"Error reading {file_path}: {e}")

# --- Drawing a chart ---
# Create a new figure for the plot with a specified size.
plt.figure(figsize=(10, 6))

# Generate a histogram of the pedestrian bounding box widths.
# bins: number of bars in the histogram.
# range: limits of the x-axis for the histogram.
plt.hist(widths, bins=100, range=(0, 400), color='blue', alpha=0.7)
plt.title('Distribution of Pedestrian Bounding Box Widths (in Pixels)')
plt.xlabel('Width (pixels)')
plt.ylabel('Number of Instances')
plt.grid(axis='y', linestyle='--') # Add a grid for better readability on the y-axis.

# Add a vertical dashed red line to highlight very small objects (width < 10 pixels).
plt.axvline(x=10, color='red', linestyle='--', label='Extremely Small (<10px)')
plt.legend()

print("\Chart is ready")
plt.show() # Display the generated plot.

## **Training Curves: Stability over 100 epochs.**

The following cell visualizes the training progress of our three distinct experimental runs over 100 epochs. By plotting the Mean Average Precision (mAP), we can directly compare how different types of label noise affect the model's learning trajectory.

The plot compares:

1.  <span style="color:blue">Baseline (Blue)</span>: The model trained on clean, unmodified data.
2.  <span style="color:red">False Negative (Red)</span>: The model trained with 20% missing labels.
3.  <span style="color:green">False Positive (Green)</span>: The model trained with 20% extra (fake) labels.

**Technical Details:**
* X-Axis: Training Epochs (0-100).
* Y-Axis: mAP score (Scale 0.0 to 0.5).

In [None]:
import matplotlib.pyplot as plt

# Data extracted from the logs
# Epochs 0-99
# 1. Baseline (Blue)
baseline_map = [
    0.267, 0.00003, 0.0074, 0.0158, 0.00003, 0.0086, 0.0087, 0.0157, 0.0487, 0.0244,
    0.0449, 0.0548, 0.103, 0.039, 0.0827, 0.0902, 0.119, 0.0836, 0.0604, 0.0885,
    0.124, 0.147, 0.162, 0.154, 0.157, 0.167, 0.0428, 0.134, 0.138, 0.155,
    0.182, 0.0553, 0.115, 0.161, 0.155, 0.155, 0.182, 0.181, 0.200, 0.206,
    0.206, 0.203, 0.215, 0.219, 0.204, 0.218, 0.217, 0.231, 0.226, 0.244,
    0.240, 0.257, 0.244, 0.260, 0.257, 0.255, 0.267, 0.259, 0.274, 0.273,
    0.274, 0.283, 0.294, 0.279, 0.279, 0.286, 0.288, 0.296, 0.290, 0.299,
    0.301, 0.300, 0.306, 0.309, 0.304, 0.311, 0.312, 0.316, 0.322, 0.321,
    0.318, 0.323, 0.326, 0.326, 0.330, 0.327, 0.331, 0.331, 0.330, 0.333,
    0.332, 0.333, 0.337, 0.336, 0.337, 0.341, 0.344, 0.340, 0.341, 0.342
]

# 2. FN (Red)
fn_map = [
    0.000001, 0.0, 0.000001, 0.0, 0.000002, 0.00006, 0.00003, 0.00008, 0.00002, 0.0002,
    0.0001, 0.0002, 0.00001, 0.00002, 0.00009, 0.0005, 0.0005, 0.002, 0.0003, 0.0003,
    0.0002, 0.001, 0.0014, 0.0015, 0.0023, 0.0018, 0.0022, 0.001, 0.00002, 0.0013,
    0.0005, 0.002, 0.0029, 0.001, 0.0006, 0.0035, 0.0044, 0.0029, 0.0058, 0.003,
    0.000007, 0.0051, 0.0002, 0.00008, 0.00004, 0.0018, 0.0019, 0.0005, 0.0001, 0.0005,
    0.0001, 0.0015, 0.0019, 0.003, 0.0013, 0.0063, 0.0084, 0.0096, 0.0098, 0.0107,
    0.0103, 0.0035, 0.0043, 0.0108, 0.010, 0.0069, 0.009, 0.0092, 0.0145, 0.0126,
    0.0128, 0.0172, 0.0157, 0.0161, 0.0153, 0.0149, 0.014, 0.0156, 0.0178, 0.0163,
    0.0189, 0.0158, 0.019, 0.022, 0.0217, 0.0205, 0.0229, 0.0243, 0.0278, 0.0226,
    0.0246, 0.0286, 0.0298, 0.0297, 0.026, 0.0293, 0.0275, 0.0269, 0.0288
]

# 3. FP (Green)
fp_map = [
    0.166, 0.306, 0.385, 0.319, 0.0373, 0.0059, 0.0000005, 0.0012, 0.0046, 0.015,
    0.0005, 0.0114, 0.0236, 0.0243, 0.0321, 0.044, 0.0505, 0.0562, 0.0868, 0.0795,
    0.0969, 0.137, 0.166, 0.156, 0.159, 0.172, 0.174, 0.0024, 0.0661, 0.140,
    0.152, 0.179, 0.201, 0.233, 0.229, 0.243, 0.262, 0.242, 0.276, 0.269,
    0.279, 0.250, 0.270, 0.278, 0.295, 0.274, 0.288, 0.315, 0.312, 0.326,
    0.335, 0.332, 0.338, 0.344, 0.343, 0.365, 0.345, 0.364, 0.369, 0.362,
    0.370, 0.377, 0.381, 0.387, 0.392, 0.383, 0.389, 0.385, 0.396, 0.393,
    0.392, 0.404, 0.409, 0.407, 0.404, 0.400, 0.419, 0.416, 0.419, 0.419,
    0.421, 0.426, 0.427, 0.429, 0.425, 0.429, 0.437, 0.441, 0.436, 0.437,
    0.438, 0.436, 0.439, 0.444, 0.441, 0.444, 0.444, 0.426, 0.435
]

# Create the plot
plt.figure(figsize=(12, 7))

# Plot lines with specific colors
plt.plot(baseline_map, label='Baseline (Blue)', color='blue', linewidth=2)
plt.plot(fn_map, label='FN (Red)', color='red', linewidth=2)
plt.plot(fp_map, label='FP (Green)', color='green', linewidth=2)

# Formatting the plot
plt.xlabel('Epochs', fontsize=12)
plt.ylabel('mAP@0.5', fontsize=12)
plt.title('Training Curve Plot (mAP vs Epochs)', fontsize=14)
plt.legend(loc='upper left', fontsize=10)
plt.grid(True, linestyle='--', alpha=0.6)
plt.xlim(0, 100)
plt.ylim(0, 0.5)

print("\Chart is ready")
plt.show() # Display the generated plot.

## **Run Detection on Test Images (Baseline Model)**

This cell runs `detect.py` to generate visual predictions on a set of unseen test images (`located in data/Combined_Dataset/images/test/`).

* It uses the `best.pt` weights from the `baseline_clean_run`.

* The output images (with bounding boxes drawn on them) are saved to `runs/detect/detect_baseline_clean`.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This ensures that YOLOv7 regenerates fresh cache files for the dataset,
# reflecting any changes made to the label files or dataset structure.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")

# Change to the YOLOv7 directory to execute the detection script.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the detect.py script to perform inference using the baseline model on a test set.
# --weights: Specifies the path to the trained weights of the baseline model.
# --source: Defines the directory containing the test images for detection.
# --name: Assigns a name to this detection run for organized output.
!python detect.py \
    --weights runs/train/baseline_clean_run/weights/best.pt \
    --source data/detection_test/ \
    --name detect_baseline_clean

## **Run Detection on Test Images (FN Noisy Model)**

This cell runs detection on the same test images, but this time using the weights from the noisy_fn_run.

* The output images are saved to runs/detect/detect_noisy_fn.

* As expected from the poor mAP score, the output log shows that this model detects very few (or no) pedestrians in most images.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This ensures that YOLOv7 regenerates fresh cache files for the dataset,
# reflecting any changes made to the label files or dataset structure.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")

# Change to the YOLOv7 directory to execute the detection script.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the detect.py script to perform inference using the model trained with 'false negative' noise.
# --weights: Specifies the path to the trained weights of the FN noisy model.
# --source: Defines the directory containing the test images for detection.
# --name: Assigns a name to this detection run for organized output.
!python detect.py \
    --weights runs/train/noisy_fn_run/weights/best.pt \
    --source data/detection_test/ \
    --name detect_noisy_fn

## **Run Detection on Test Images (FP Noisy Model)**

This cell runs detection on the test images using the best.pt weights from the noisy_fp_run.

* The output images are saved to runs/detect/detect_noisy_fp.

* These visual results, along with the high mAP score, demonstrate the surprising improvement of the model when trained with false positive noise.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This ensures that YOLOv7 regenerates fresh cache files for the dataset,
# reflecting any changes made to the label files or dataset structure.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")

# Change to the YOLOv7 directory to execute the detection script.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the detect.py script to perform inference using the model trained with 'false positive' noise.
# --weights: Specifies the path to the trained weights of the FP noisy model.
# --source: Defines the directory containing the test images for detection.
# --name: Assigns a name to this detection run for organized output.
!python detect.py \
    --weights runs/train/noisy_fp_run/weights/best.pt \
    --source data/detection_test/ \
    --name detect_noisy_fp

## **Mitigation Strategy: Training with Label Smoothing**

Our experiments showed that **False Negative** noise (missing labels) is highly destructive to the model's performance. When the model sees a pedestrian but the label says \"background\" (missing box), it is penalized heavily, forcing it to unlearn the features of a pedestrian.

To mitigate this, we employ **Label Smoothing**.

* **Standard Training:** Uses \"Hard Labels\" (0 or 1). If a label is missing, the model is forced to predict 0 with high confidence.
* **Label Smoothing:** Softens the target (e.g., 0.9 instead of 1.0). This prevents the model from becoming over-confident in its predictions. In the presence of noisy labels, this prevents the model from overfitting to the incorrect (missing) annotations.


**Experiment:**
We will retrain the model on the **False Negative** dataset, but this time adding the argument `--label-smoothing 0.1`.

In [None]:
# Make sure we are in the yolov7 folder
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the noise_injector.py script to inject 'false negative' noise into the dataset.
# This simulates scenarios where some true objects are not labeled.
!python noise_injector.py --task inject_fn

# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This ensures that YOLOv7 regenerates fresh cache files for the dataset,
# reflecting any changes made to the label files or dataset structure.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")

# Change to the YOLOv7 directory to execute the detection script.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run training on the FN dataset with label-smoothing strategy
!python train.py \
    --weights yolov7.pt \
    --data data/combined_dataset.yaml \
    --img 640 \
    --batch-size 16 \
    --epochs 100 \
    --name robust_fn_run \
    --device 0 \
    --label-smoothing 0.1

## **Run Detection on Test Images (Robust FN Noisy Model)**

This final cell runs detection on the test images using the best.pt weights from the robust_fn_run.

* The output images are saved to runs/detect/detect_robust_fn.

* These visual results, along with the high mAP score, demonstrate the robustness and even improvement of the model.

In [None]:
# Define the paths of the corrupted cache files
cache_file_train = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/train.cache'
cache_file_val = '/content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7/data/Combined_Dataset/labels/val.cache'

# Execute the delete command to remove old cache files.
# This ensures that YOLOv7 regenerates fresh cache files for the dataset,
# reflecting any changes made to the label files or dataset structure.
!rm -f {cache_file_train}
!rm -f {cache_file_val}

print(f"Cache file '{cache_file_train}' deleted.")
print(f"Cache file '{cache_file_val}' was deleted.")

# Change to the YOLOv7 directory to execute the detection script.
%cd /content/drive/MyDrive/Pedestrian-Detection-Label-Noise/model/yolov7

# Run the detect.py script to perform inference using the model trained with 'false negative' noise.
# --weights: Specifies the path to the trained weights of the robust FN model.
# --source: Defines the directory containing the test images for detection.
# --name: Assigns a name to this detection run for organized output.
!python detect.py \
    --weights runs/train/robust_fn_run/weights/best.pt \
    --source data/detection_test/ \
    --name detect_robust_fn