# Competition Overview

**Overview: The Synthetic-to-Real Object Detection Challenge 2**

This challenge, a follow-up to the first competition, tasks participants with training an object detection model using exclusively synthetic data and applying it to real-world images. The goal is to detect soup cans in previously unseen photos, an exercise that simulates real-world AI applications where models trained in controlled, virtual environments must perform in unpredictable, physical ones.

**Challenge, Objective, and Process**

The primary objective is to achieve the highest possible Mean Average Precision at IoU 0.5 (mAP@50) score. Participants will train a model using synthetic images of soup cans generated by Falcon, a digital twin simulation software. The trained model will then be used to predict the bounding boxes of soup cans in a set of real-world test images.

Participants are provided with a starter dataset on Kaggle but can significantly boost their performance by leveraging additional datasets available on FalconCloud or by generating their own custom synthetic data using the Falcon Editor tool. The competition encourages participants to manipulate and combine these datasets, exploring the benefits of synthetic data to close the "Sim2Real gap"—the discrepancy in performance between models trained on synthetic data and those deployed in reality.

**Submission and Evaluation**

Submissions must be a single .csv file with predictions in YOLO format. Each row should contain the image_id and a prediction_string that includes the class, confidence, and normalized coordinates of the detected bounding box.

The final score is determined by mAP@50, which measures the accuracy of the predicted bounding boxes against the ground truth labels. A detection is considered correct if its Intersection over Union (IoU) with the true bounding box is at least 0.5. The competition features a public leaderboard for real-time ranking and a private leaderboard for final scoring. To maintain competition integrity, top performers are subject to a verification check.

**Getting Started and Improving Your Score**

Participants can choose from three paths to improve their performance:

- **Use Provided Data:** Train and tune a model using only the data available on Kaggle.
- **Use Supplemental Datasets:** Download additional, more comprehensive datasets from the Falcon website to fine-tune the model.
- **Generate Your Own Data:** For the most significant performance boost, participants are highly encouraged to use the Falcon Editor to create custom synthetic datasets that target specific edge cases, such as varied lighting, occlusions, and camera viewpoints.

Participants can join the Discord community for support and can explore tutorials and documentation provided by Duality AI to learn how to use the Falcon tools effectively.

**Participants Benefits**

- **Prizes:** Cash awards, digital certificates, and public recognition on platforms like LinkedIn and Hugging Face.
- **Skill Development:** Hands-on experience with synthetic data, a skill highly valued in AI fields like robotics, industrial automation, and computer vision.
- **Portfolio Enhancement:** A chance to showcase real-world AI and machine learning skills to recruiters and peers.
- **Free Access to Falcon:** Gain hands-on experience with a professional synthetic data generation tool.

### [Competition Link](https://www.kaggle.com/competitions/synthetic-2-real-object-detection-challenge-2/overview)

## Data Description

The Synthetic-to-Real Object Detection Challenge 2 provides a dataset for training and validating an object detection model exclusively on synthetic images and then testing its performance on real-world images. The dataset is organized into three main directories: train, val, and test, each containing an images and a labels subdirectory.

<pre>
/data
    /train
        /images/ # Synthetic images
        /labels/ # YOLO format annotations
    /val
        /images/ # Synthetic validation images
        /labels/ # YOLO format annotations
    /test
        /images/ # Real-world images (ones you'll be judged on)
</pre>

The training and validation data consist of synthetic images, all with a resolution of 1920x1080 pixels. The corresponding labels are in YOLO format, stored in separate .txt files. These files contain normalized coordinates (x_center, y_center, width, height) and a class_id, which is always 0 for the single object class, "soup can."

The test data consists of real-world images, also 1920x1080 pixels, for which participants must predict the bounding boxes. A crucial note is that external data is allowed only if it is generated using the Falcon tool, which is provided for free to participants. The competition encourages participants to create their own custom datasets with varied environments, lighting, and occlusions to improve model generalization and performance on the real-world test set.

## Data Dictionary

This data dictionary describes the file structure and format of the datasets provided for the competition.

**Main Datasets**

<pre>
/data
    /train
        /images/ # Synthetic images
        /labels/ # YOLO format annotations
    /val
        /images/ # Synthetic validation images
        /labels/ # YOLO format annotations
    /test
        /images/ # Real-world images (ones you'll be judged on)
</pre>

**Label File (.txt) Data Dictionary**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-k9u1{border-color:inherit;color:#1B1C1D;font-size:100%;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-k9u1">Field Name</th>
    <th class="tg-7zrl">Data Type</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">class_id</td>
    <td class="tg-7zrl">int</td>
    <td class="tg-0lax">The object class ID. In this competition, it is always 0, representing a soup can.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">x_center</td>
    <td class="tg-7zrl">float</td>
    <td class="tg-0lax">The normalized x-coordinate of the center of the bounding box (from 0 to 1).</td>
  </tr>
  <tr>
    <td class="tg-7zrl">y_center</td>
    <td class="tg-7zrl">float</td>
    <td class="tg-0lax">The normalized y-coordinate of the center of the bounding box (from 0 to 1).</td>
  </tr>
  <tr>
    <td class="tg-7zrl">width</td>
    <td class="tg-7zrl">float</td>
    <td class="tg-0lax">The normalized width of the bounding box (from 0 to 1).</td>
  </tr>
  <tr>
    <td class="tg-7zrl">height</td>
    <td class="tg-7zrl">float</td>
    <td class="tg-0lax">The normalized height of the bounding box (from 0 to 1).</td>
  </tr>
</tbody></table>

**Submission File (.csv) Data Dictionary**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-k9u1{border-color:inherit;color:#1B1C1D;font-size:100%;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-k9u1">Column Name</th>
    <th class="tg-7zrl">Data Type</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">image_id</td>
    <td class="tg-7zrl">object</td>
    <td class="tg-0lax">The filename of the test image without the extension (e.g., 0002).</td>
  </tr>
  <tr>
    <td class="tg-7zrl">prediction_string</td>
    <td class="tg-7zrl">object</td>
    <td class="tg-0lax">A string containing all the predicted bounding boxes for an image. <br>Each prediction is a space-separated sequence: class confidence x_center y_center width height.</td>
  </tr>
</tbody>
</table>

## Addional Integration

**What is Falcon?**

Falcon is a digital twin simulation software developed by Duality AI. Its purpose is to generate synthetic (computer-generated) images for training object detection models. Falcon is the tool used to create the synthetic data of soup cans. Participants can use the provided datasets generated by Falcon or create their own custom datasets using the Falcon Editor to improve their model's ability to perform in real-world scenarios.

Falcon's purpose is to help users generate high-quality training data that can bridge the "Sim2Real gap", which is the difference in performance between models trained on simulated data and their performance in real-world scenarios. The tools are a key resource for participants to improve their model's ability to generalize.

There are two key parts of the Falcon ecosystem:

- **Falcon (the core software):** This is described as a digital twin simulation software used to generate the synthetic soup can images for the competition's training and validation datasets. It creates a controlled virtual environment where data can be generated easily and at scale.
- **Falcon Editor:** This is a more advanced tool that allows users to create their own unique supplemental datasets. It enables the user to customize scene parameters such as lighting, occlusions, camera angles, and backgrounds to simulate complex real-world scenarios and create targeted data to improve model performance on "edge cases" or difficult situations.

The competition encourages participants to use these tools to bridge the "Sim2Real gap," which refers to the challenge of a model generalizing from a virtual, simulated environment to the unpredictable complexity of the real world.

**Albumentations**

Albumentations is a Python library for fast and flexible image augmentations that can be used for deep learning models, particularly in computer vision tasks. It's designed to be highly performant and offers a wide range of transformation techniques to artificially increase the size and diversity of a training dataset.

**Key Features and Purpose**

- **Data Augmentation:** Its primary function is to apply various transformations to images, such as resizing, flipping, rotating, and adjusting brightness or contrast. This helps a model generalize better to unseen data and become more robust to real-world variations.
- **Performance:** The library is optimized for speed, which is crucial for training large-scale deep learning models efficiently.
- **Flexibility:** It can handle various data types beyond just images, including bounding boxes, segmentation masks, and keypoints, ensuring that all annotations are correctly transformed along with the image. This is particularly useful in object detection and semantic segmentation tasks.

**Why It's Used in Deep Learning**

In deep learning, models can easily overfit to the training data. Data augmentation with Albumentations helps prevent this by creating new, slightly modified versions of existing images. This process makes the model less sensitive to specific features of the training set and more capable of recognizing objects under different conditions, a concept known as regularization.

**Albumentations Integration**

Albumentations is integrated into the code to perform data augmentation for the Faster R-CNN model. It's used to apply various image transformations to the training and validation datasets, which helps the model generalize better.

### [Insapiration Notebook](https://www.kaggle.com/code/mohanapavanbezawada/synthetic-2-real-object-detection)

# Pipeline Overview

**Pipeline Overview and Theoretical Foundation**

This pipeline implements a robust object detection pipeline for the Synthetic-to-Real Object Detection Challenge. The core strategy is to train two distinct deep learning models—YOLOv8 and Faster R-CNN—on synthetic data and then combine their predictions on real-world test images using an ensembling technique called Weighted Boxes Fusion (WBF). The final, fused predictions are then formatted for submission.

This approach is designed to improve overall accuracy and generalization, a common strategy in machine learning to mitigate the weaknesses of a single model by leveraging the strengths of multiple models.

**How the Pipeline Works?**

The pipeline is structured in four main phases: data preparation, model training, prediction, and ensembling.

**1. Data Preparation and Configuration**

- **Data Paths and data.yaml:** The code first defines the file paths for the synthetic training, validation, and real-world test data. It creates a data.yaml file, a configuration standard for YOLOv8, which specifies the data locations and the single class name ('object').
- **Custom Dataset:** A SoupCanDataset class is created to handle the data for the Faster R-CNN model. It reads images and their corresponding YOLO-formatted labels, converting the normalized coordinates into a format (Pascal VOC) compatible with Faster R-CNN.
- **Data Augmentation:** The code uses the Albumentations library to apply data augmentation techniques like horizontal flipping, resizing, and color jitter to the training data. This process helps the models generalize better to unseen variations in the real-world test images.

**2. Model Training**

- **YOLOv8 Training:** The pipeline trains a YOLOv8x model, which is pre-trained on a large dataset like COCO. It's fine-tuned on the provided synthetic training data for a specified number of epochs. The training process uses a cosine learning rate scheduler to smoothly decrease the learning rate over time and an SGD optimizer.
- **Faster R-CNN Training:** Simultaneously, a Faster R-CNN model with a ResNet50-FPN backbone is loaded. This model is also pre-trained, and its final layer is replaced to adapt to the single-object "soup can" class. It's trained on the same data with a similar setup to the YOLOv8 model, including early stopping to prevent overfitting.

**3. Prediction and Ensembling**

- **Inference:** After training, both the YOLOv8 and Faster R-CNN models are loaded. They are then used to predict bounding boxes and confidence scores on the real-world test images.
- **Weighted Boxes Fusion (WBF):** This is the crucial step where the predictions from the two models are combined. The code collects all bounding boxes from both models for each image. It assigns weights to each model's predictions (1.0 for YOLOv8 and 0.8 for Faster R-CNN, indicating a slightly higher trust in YOLOv8's predictions). WBF then merges overlapping boxes from different models into a single, refined bounding box, using a weighted average of their coordinates and confidence scores. This ensemble method is more effective than traditional methods like Non-Maximum Suppression (NMS) at preserving all true positives.
- **Final Output:** The fused bounding boxes are then converted back into the YOLO format (class confidence x_center y_center width height) and saved to a .txt file for each image.

**4. Submission**

- **CSV Conversion:** Finally, the script iterates through all the prediction .txt files and compiles them into the required submission.csv format, including a "no boxes" entry for any images without detections.

**Theoretical Background**

**Object Detection Models: YOLO and Faster R-CNN**

Both YOLO (You Only Look Once) and Faster R-CNN are seminal architectures in object detection, but they approach the problem differently.

YOLOv8 is a one-stage detector. It treats object detection as a single regression problem, simultaneously predicting bounding boxes and class probabilities directly from an image. This makes it incredibly fast and efficient, ideal for real-time applications.

Faster R-CNN is a two-stage detector. The first stage, the Region Proposal Network (RPN), proposes potential object regions. The second stage then classifies these proposals and refines their bounding boxes. This multi-step process makes Faster R-CNN more accurate, though generally slower, than single-stage models.

**The Ensemble Strategy:** 

By training both a fast, efficient model (YOLOv8) and a slower, more accurate model (Faster R-CNN), the pipeline aims to combine their strengths. YOLOv8 might capture objects missed by Faster R-CNN, and Faster R-CNN might provide more precise bounding boxes. The ensemble of these models often leads to better performance than either model could achieve alone.

**Weighted Boxes Fusion (WBF)**

WBF is an advanced ensembling technique that addresses a key limitation of standard methods like NMS.

- **The Problem with NMS:** Non-Maximum Suppression (NMS) is a post-processing algorithm that suppresses duplicate bounding boxes. It selects the box with the highest confidence and discards other boxes that significantly overlap with it. However, if multiple high-confidence models predict slightly different, but correct, boxes, NMS might discard valuable predictions.

- **The Solution of WBF:** WBF is an alternative to NMS. Instead of eliminating boxes, it clusters and fuses them. It calculates a weighted average of the coordinates and confidence scores of all overlapping boxes to create a single, more refined bounding box. This approach is more robust because it leverages consensus among models, and it can even improve the accuracy of the final bounding box. The weights assigned to each model's predictions allow you to specify which model's output you trust more. This is particularly useful in this pipeline, where you might have a hypothesis about one model's performance over the other.

# Install necessary packages and import necessary libraries

In [1]:
# Example: Download YOLOv8 weights (adjust for YOLOv11 if available)
!wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8x.pt -O /kaggle/working/yolo8x.pt
!wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov11x.pt -O /kaggle/working/yolo11x.pt

# Install necessary packages
!pip install ultralytics
!pip install ensemble_boxes
!pip install -U ultralytics
!pip install optuna

# Cache clearing
!pip install --no-cache-dir torch torchvision
!pip install --no-cache-dir ultralytics
!pip install --no-cache-dir albumentations
!pip install --no-cache-dir ensemble-boxes
!pip install --no-cache-dir pycocotools

--2025-09-16 11:21:39--  https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8x.pt
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://release-assets.githubusercontent.com/github-production-release-asset/521807533/c13b916e-8b1f-47ab-a613-7022adfa73c6?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-09-16T12%3A19%3A47Z&rscd=attachment%3B+filename%3Dyolov8x.pt&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-09-16T11%3A19%3A00Z&ske=2025-09-16T12%3A19%3A47Z&sks=b&skv=2018-11-09&sig=bqvZ9pWDIKXvbp8RJiSD2Ci50iwKZ0HEJ0MFExh4y2M%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc1ODAyMjAwMCwibmJmIjoxNzU4MDIxNzAwLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ultralytics import YOLO
from pathlib import Path
import csv
import os
import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torch.utils.data import Dataset, DataLoader
import csv
from PIL import Image
import albumentations as A
from albumentations.pytorch import ToTensorV2
from ultralytics import YOLO
from ensemble_boxes import weighted_boxes_fusion
from tqdm import tqdm
import warnings
import torch
import numpy as np
from PIL import Image
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
from ultralytics import YOLO
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.ops import box_iou, nms
from tqdm import tqdm
import torchvision.transforms.v2 as T
import pandas as pd
import csv
from ensemble_boxes import weighted_boxes_fusion
from torchvision.ops import box_iou, nms
import optuna

warnings.filterwarnings('ignore')

# Verify installations
try:
    from ultralytics import YOLO
    print("[notice] All dependencies installed successfully")
    print(f"[notice] PyTorch version: {torch.__version__}, CUDA available: {torch.cuda.is_available()}")
except ImportError as e:
    print(f"[error] Failed to import dependencies: {e}")
    print("[error] Please restart the kernel and re-run the script.")
    raise

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

[notice] All dependencies installed successfully
[notice] PyTorch version: 2.6.0+cu124, CUDA available: True


# Data Preparation

**Theoretical Concepts**

**Data Augmentation**

Data augmentation is a powerful regularization technique used to increase the diversity of a dataset without collecting new data. By applying various transformations to the training images, the model is exposed to a wider range of scenarios (different lighting, rotations, sizes). This helps the model become more robust and less likely to overfit to the specific examples in the original dataset. For a Sim2Real challenge, data augmentation is critical for bridging the gap between clean synthetic data and messy real-world images. It simulates the noise and variability of the real world, forcing the model to learn more generalizable features.

**Normalization**

Normalization is a preprocessing step where pixel values are scaled to a standard range, typically a mean of 0 and a standard deviation of 1.  This is done to prevent certain features (e.g., bright pixels) from dominating the learning process. It helps to stabilize and speed up the training of neural networks by ensuring that all input features are on a similar scale. The mean and std values used are common to models pre-trained on the ImageNet dataset, a practice that ensures consistency and better performance when using a pre-trained backbone.

**Data Loaders**

A data loader is an iterator in PyTorch that provides an efficient way to load and prepare data in batches. It handles key tasks like shuffling the data, parallelizing data loading using multiple worker processes (num_workers), and batching the data. This is essential for modern deep learning, as models are trained in batches rather than on one image at a time, which is much more computationally efficient. The collate_fn is a crucial part of the data loader for object detection, as it handles the "ragged" data structure where each image can have a different number of objects (and thus a different number of bounding boxes).

## Dataset configuration and Hyperparameters Configuration

**Dataset Configuration and Hyperparameters Configuration:** 

The data.yaml file is created, which defines the paths for the training, validation, and testing images and labels. It also specifies the single class, 'object', to be detected.

**Hyperparameter Configuration**

The following parameters were configured for the training and evaluation of the object detection model:

**Dataset Configuration**

- **data_yaml:** A dictionary-like configuration that defines the dataset's structure.
- **path:** The root directory of the dataset.
- **train:** The relative path to the training images directory.
- **val:** The relative path to the validation images directory, used for tuning hyperparameters and evaluating the model during training.
- **test:** The relative path to the test images directory, used for the final, independent evaluation of the trained model's performance.
- **nc:** The number of classes in the dataset. In this case, 1 indicates a single class.
- **names:** A list of class names, with ['object'] specifying the name for the single class.

**Training Hyperparameters**

- **TRAIN_EPOCHS:** The total number of times the model will iterate through the entire training dataset. An epoch represents one complete pass of the training data through the algorithm.
- **IMG_SIZE:** The size to which all images will be resized before being fed into the model. A size of 640×640 pixels is a standard input dimension for many modern object detection models.
- **PATIENCE:** The number of epochs to wait for the validation loss to improve before stopping the training. This technique, known as early stopping, prevents overfitting and saves computational resources. A patience of 20 means the training will halt if the model's performance on the validation set does not improve for 20 consecutive epochs.
- **BATCH_SIZE:** The number of training examples utilized in one iteration. A batch size of 8 means the model processes 8 images at a time before updating its internal parameters.
- **CONF_THRESHOLD:** The confidence score threshold used to filter object detections. Only bounding boxes with a confidence score greater than 0.25 will be considered as valid detections.
- **IOU_THRESHOLD:** The Intersection over Union (IoU) threshold used for Non-Maximum Suppression (NMS). This value determines how much overlap is allowed between predicted bounding boxes for the same object. An IoU threshold of 0.5 means any predicted bounding box with an IoU of 0.5 or more with a higher-confidence box will be suppressed.
- **device:** Specifies the computing device to be used for training. The code automatically selects a GPU (cuda) if available, otherwise, it defaults to the CPU (cpu) for training.
- **base_path:** The base directory for the dataset, providing the full file path for the model to access the data.

In [3]:
# Dataset configuration
data_yaml = """
path: /kaggle/input/synthetic-2-real-object-detection-challenge-2/Synthetic to Real Object Detection Challenge 2
train: train/images
val: val/images
test: testImages/images
nc: 1
names: ['object']
"""

# Hyperparameters Configuration
TRAIN_EPOCHS = 100
IMG_SIZE = 640
PATIENCE = 10
BATCH_SIZE = 8
CONF_THRESHOLD = 0.25
IOU_THRESHOLD = 0.5
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
base_path = "/kaggle/input/synthetic-2-real-object-detection-challenge-2/Synthetic to Real Object Detection Challenge 2"

**File Saving**

The provided code snippet is responsible for saving the data.yaml string to a file named data.yaml in the /kaggle/working directory. This is a crucial step to make the dataset configuration accessible to the training script, which will typically read its data paths and class information from this file. The os.makedirs('/kaggle/working', exist_ok=True) command ensures that the directory exists before the file is written, preventing a FileNotFoundError.

In [4]:
# Save data.yaml
os.makedirs('/kaggle/working', exist_ok=True)
with open('/kaggle/working/data.yaml', 'w') as file:
    file.write(data_yaml)

## Data Augmentation

**Data Augmentation (Albumentations):**

This step sets up the data processing and loading for an object detection project, using data augmentation to prepare the images for model training. The goal is to get the data from its raw format into a state that is both usable by deep learning models and robust enough to handle real-world variations.

The albumentations library is used to apply various data augmentation techniques. train_transforms includes a series of random changes like horizontal flips, brightness/contrast adjustments, and color jitter to help the model generalize better. val_transforms only includes resizing and normalization to ensure consistent evaluation.

**How this step Works?**

This section of the code focuses on transforming and loading the dataset. It defines a set of image transformations and then creates a dataset and a data loader for both the training and validation sets.

**Image Transformations:** The albumentations library is used to define image transformations.
- **train_transforms:** This is a composition of several transformations applied during training.
  - **A.Resize(IMG_SIZE, IMG_SIZE):** Resizes all images to a consistent 640x640 resolution, which is a requirement for many neural networks.
  - **A.HorizontalFlip(p=0.5):** Randomly flips the image horizontally with a 50% probability, a common technique for data augmentation.
  - **A.RandomBrightnessContrast(p=0.2):** Randomly adjusts the brightness and contrast of the image, simulating different lighting conditions.
  - **A.ColorJitter(p=0.2):** Randomly changes the color properties (e.g., hue, saturation), making the model less sensitive to color variations.
  - **A.Normalize(...):** Normalizes the pixel values using the mean and standard deviation of the ImageNet dataset. This is a standard practice that helps the model converge faster and more effectively.
  - **ToTensorV2():** Converts the image from a NumPy array to a PyTorch tensor.
- **val_transforms:** This is a simpler set of transformations for validation. It only includes Resize and Normalize because the goal of validation is to evaluate the model's performance on unseen data without introducing random variations that could obscure its true performance.

In [5]:
# Data transforms
train_transforms = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.ColorJitter(p=0.2),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2()
], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels']))

val_transforms = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2()
], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels']))

## Prepared Dataset

**Custom Dataset and Data Loaders**

This section details the implementation of a custom dataset class and data loaders, which are crucial for preparing the data in a format suitable for the object detection model.

**The SoupCanDataset Class**

The SoupCanDataset is a custom class that inherits from PyTorch's torch.utils.data.Dataset. This is a standard practice for handling datasets that are not directly supported by pre-built PyTorch utilities. The class is a crucial bridge between the raw image and label files and the data format required by the training pipeline.

- **Initialization (__init__):** The constructor takes the directories for images and labels as input. It then finds and stores the paths to all image files (.png, .jpg, .jpeg).

- **Length (__len__):** This method returns the total number of images in the dataset, allowing DataLoader to know how many samples to iterate over.

- **Item Retrieval (__getitem__):** This is the most important method. When called with an index idx, it performs the following steps:

  - Loads the image file at the specified index using the Pillow library (Image.open).
  - Reads the corresponding label file. The labels are expected to be in the YOLO format, which uses normalized coordinates (x_center, y_center, width, height).
  - Coordinate Conversion Theory: Object detection models often require bounding box coordinates in a specific format. The SoupCanDataset converts the normalized YOLO coordinates to the absolute pixel coordinates (x1, y1, x2, y2), where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner of the bounding box. This conversion is necessary for compatibility with many object detection frameworks, such as Faster R-CNN, which expect this format.
  - It populates a target dictionary with the bounding boxes, class labels, and other metadata required by the model, such as image_id, area, and iscrowd.
  - It applies any specified data augmentations or transformations (self.transforms) to the image. It also scales the bounding box coordinates to match the new image size if a transformation is applied.

**Data Loaders**

The DataLoader utility from PyTorch is used to create iterable data loaders from the custom SoupCanDataset. Data loaders provide an efficient way to feed data to the training loop by handling batching, shuffling, and multi-processing.

- **collate_fn:** This custom function is a critical part of the data loading process for object detection. Because each image can have a variable number of bounding boxes, a standard DataLoader cannot create a single tensor from a batch of images and their targets. The collate_fn function takes a list of samples and combines them into a single batch, ensuring that the bounding boxes and labels are handled correctly, preventing errors during training.
- **train_loader:** This loader is configured with shuffle=True to randomize the order of the training data in each epoch. Shuffling is a best practice to prevent the model from learning the order of the data, which could lead to overfitting.
- **val_loader:** The validation loader is set with shuffle=False because the order of evaluation data does not affect the model's performance metrics.
- **num_workers:** This parameter enables multi-process data loading, which helps to speed up the training process by loading data in the background while the GPU is busy with computations.

In [6]:
# --- Custom Dataset ---
class SoupCanDataset(Dataset):
    def __init__(self, image_dir, label_dir, transforms=None):
        self.image_dir = Path(image_dir)
        self.label_dir = Path(label_dir)
        self.transforms = transforms
        self.images = [p for p in self.image_dir.glob("*") if p.suffix.lower() in ['.png', '.jpg', '.jpeg']]

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img_path = self.images[idx]
        label_path = self.label_dir / f"{img_path.stem}.txt"

        img = Image.open(img_path).convert("RGB")
        img_width, img_height = img.size

        boxes = []
        labels = []
        if label_path.exists():
            with open(label_path, 'r') as f:
                for line in f:
                    try:
                        cls_id, x_center, y_center, width, height = map(float, line.strip().split())
                        x1 = (x_center - width / 2) * img_width
                        y1 = (y_center - height / 2) * img_height
                        x2 = (x_center + width / 2) * img_width
                        y2 = (y_center + height / 2) * img_height
                        boxes.append([x1, y1, x2, y2])
                        labels.append(int(cls_id))
                    except ValueError:
                        print(f"[warning] Invalid label in {label_path}: {line.strip()}")

        boxes = torch.tensor(boxes, dtype=torch.float32) if boxes else torch.empty((0, 4), dtype=torch.float32)
        labels = torch.tensor(labels, dtype=torch.int64) if labels else torch.empty((0,), dtype=torch.int64)

        target = {
            "boxes": boxes,
            "labels": labels,
            "image_id": torch.tensor([idx]),
            "area": (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]) if len(boxes) > 0 else torch.empty((0,)),
            "iscrowd": torch.zeros((len(boxes),), dtype=torch.int64)
        }

        if self.transforms:
            img = self.transforms(img)
            if len(boxes) > 0:
                scale = IMG_SIZE / max(img_width, img_height)
                boxes[:, [0, 2]] = boxes[:, [0, 2]] * scale
                boxes[:, [1, 3]] = boxes[:, [1, 3]] * scale
                target["boxes"] = boxes.clamp(min=0, max=IMG_SIZE-1)

        return img, target

def collate_fn(batch):
    return tuple(zip(*batch))

# Create datasets and dataloaders
train_dataset = SoupCanDataset(f"{base_path}/train/images", f"{base_path}/train/labels", transforms=train_transforms)
val_dataset = SoupCanDataset(f"{base_path}/val/images", f"{base_path}/val/labels", transforms=val_transforms)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, collate_fn=collate_fn)

# Model Training

This step trains and combines the predictions of two powerful models, YOLOv8 and Faster R-CNN, to achieve more robust results. This process is particularly effective for bridging the gap between synthetic and real-world data, as each model can learn different features from the training set.

**Theoretical Concepts**

- **YOLO (You Only Look Once):** A single-stage object detector that frames object detection as a single regression problem. It predicts bounding boxes and class probabilities simultaneously, making it exceptionally fast.
- **Faster R-CNN (ResNet50):** A two-stage object detector. The first stage, a Region Proposal Network (RPN), proposes regions of interest, and the second stage refines these proposals into final detections. This two-step process generally yields higher localization accuracy than single-stage models.
- **Model Ensemble:** The practice of combining multiple machine learning models to improve overall performance. Since YOLOv8 and Faster R-CNN have different architectural strengths (speed vs. accuracy), an ensemble can leverage the best of both worlds.
- **Transfer Learning:** The use of a pre-trained model as a starting point. This is crucial when the target dataset is small or a significant domain gap exists (as in Sim2Real challenges).
- **Weighted Boxes Fusion (WBF):** A sophisticated ensemble technique for bounding box predictions. Unlike NMS, which simply discards redundant boxes, WBF merges overlapping boxes, preserving valuable information from multiple models to generate a more accurate final prediction.

**Model Training Overview**

This phase focuses on fine-tuning two distinct object detection models.

**YOLOv8 Training:**

- **Initialization and Model Loading:** The pipeline begins by loading pre-trained versions of the YOLOv8 models using the file paths "yolov8x.pt" and "yolov8n.pt". The 'x' variant is the largest and most computationally intensive, while 'n' is a smaller, more compact variant. This approach leverages transfer learning - a fundamental concept in deep learning where a model pre-trained on a massive dataset (like COCO) is used as a starting point. This significantly accelerates training and enhances performance on the new custom dataset.
- **Taining Execution:** The core of the code is the .train() function call, which manages the entire training process. It takes numerous arguments, known as hyperparameters, that control the training procedure, from the number of training cycles to various data augmentation techniques. The models are trained on the data specified by the data parameter.

**Faster R-CNN Training:**

- **Model Loading:** The code loads a pre-trained Faster R-CNN with a ResNet50-FPN backbone. This is a classic two-stage architecture known for its high accuracy.
- **Head Replacement:** The final classification head of the model is replaced with a new one that is specifically configured for the single object class in the dataset (plus a background class). This is a standard practice in transfer learning to adapt a pre-trained model for a new task.
- **Custom Training Loop:** Unlike YOLOv8, Faster R-CNN is trained using a manual, explicit training loop. This loop iterates through epochs, calculates the combined loss (from box regression and classification), and updates the model's weights using an optimizer and a learning rate scheduler.
- **Early Stopping:** The loop incorporates a crucial early stopping mechanism. It monitors the model's performance on a validation set and saves the best-performing weights. If the model's performance doesn't improve for a certain number of epochs (PATIENCE), training is halted to prevent overfitting.

**Ensemble Inference and Validation**

The final stage involves combining the trained models to create a more robust ensemble system.

- **Ensemble Inference:** The run_ensemble_inference function takes an image, runs it through the trained YOLOv8x, YOLOv8n, and Faster R-CNN models, and combines their predicted bounding boxes.
- **Non-Maximum Suppression (NMS):** After combining the predictions from all three models, NMS is applied to filter out redundant and overlapping bounding boxes. This process ensures that for each object, only the most confident and representative bounding box remains.
- **Validation:** The validate_ensemble function evaluates the combined model's performance on the validation dataset. It calculates two critical metrics:
  - **Precision:** Measures the accuracy of the positive predictions. It is the ratio of true positives to the total number of predicted positives (TP / (TP + FP)).
  - **Recall:** Measures the model's ability to find all relevant instances. It is the ratio of true positives to the total number of actual positives (TP / (TP + FN)).

## Fine-Tune YOLOv8 (YOLO8n and YOLO8x)

**YOLOv8n (Nano)**

YOLOv8n is the smallest and fastest model in the YOLOv8 series. The 'n' stands for 'nano,' reflecting its compact size. It has the fewest parameters and the lowest computational requirements.

- **Size:** It has a significantly smaller number of parameters, making its model file size very small. This is ideal for deployment on devices with limited memory, such as mobile phones, embedded systems, and drones.
- **Speed:** Due to its smaller size, it performs inference much faster. This makes it a great choice for real-time applications where a high frame rate is critical, even if it means sacrificing some accuracy.
- **Accuracy:** It is the least accurate of the YOLOv8 models. While still a powerful object detector, it may not perform as well on complex scenes or with small, hard-to-detect objects compared to its larger counterparts.

**YOLOv8x (XLarge)**

YOLOv8x is the largest and most accurate model in the YOLOv8 family. The 'x' stands for 'extra-large.' It has the highest number of parameters, making it computationally intensive.

- **Size:** It has a large number of parameters, resulting in a much larger model file. This requires more storage and memory on the device where it's deployed.
- **Speed:** It is the slowest of the YOLOv8 models. Its size and complexity mean that it takes more time to process each image, which can be a limiting factor for real-time applications.
- **Accuracy:** It offers the highest accuracy among the YOLOv8 models. Its deeper and wider network architecture allows it to learn more complex features, leading to better performance in detecting objects, especially in challenging conditions.

**Model Fine-Tuning Methodology**

- **Initialization and Error Handling:** The pipeline begins with a notice that YOLOv8 training is starting. A try...except block is used to ensure the program can gracefully handle potential failures during the training process, such as issues with file paths or hardware.
- **Model Loading:** The lines yolo8x_model = YOLO("yolov8x.pt") and yolo8n_model = YOLO("yolov8n.pt") initialize the models. The .pt files refer to pre-trained versions of the YOLOv8 models. The 'x' variant is the largest and most computationally intensive, while 'n' is a smaller, more compact variant. By loading these files, the pipeline is leveraging transfer learning, which means it is starting with a model that has already learned to detect a wide range of objects from a massive dataset (typically COCO). This significantly speeds up training and improves performance on the new dataset.
- **Training Execution:** The core of the code is the .train(...) function call. This method executes the entire training process. It takes numerous arguments, known as hyperparameters, which control the training procedure, from the number of training cycles to various data augmentation techniques. The model is trained on the data specified by the data parameter.

**Hyperparameters:**

The training process is controlled by a set of hyperparameters that define how the model learns.

- **epochs:** The number of times the entire dataset is passed forward and backward through the neural network during training.
- **imgsz:** The size of the input image for the model.
- **patience:** The number of epochs with no improvement in the validation loss after which training will be stopped. This is a form of early stopping.
- **cos_lr:** A boolean that enables a cosine annealing learning rate scheduler.
- **dropout:** The probability of randomly dropping neurons during training to prevent overfitting.
- **mosaic:** The probability of using mosaic data augmentation, which stitches four images together to create a single training image.
- **lr0:** The initial learning rate for the optimizer.
- **optimizer:** The optimization algorithm used to update the model's weights (e.g., SGD, Adam).
- **momentum:** A parameter for SGD that helps accelerate the descent in the right direction.
- **weight_decay:** A regularization term that penalizes large weights, preventing the model from becoming too complex and overfitting.
- **single_cls:** A boolean to indicate if the dataset contains only a single object class.
- **plots:** A boolean to enable the generation of plots of training and validation metrics.
- **cache:** A boolean to cache the dataset in memory or on disk for faster loading during training.
- **flipud:** The probability of performing a random vertical flip on the images for data augmentation.
- **scale:** The probability of performing a random scaling augmentation on the images.
- **name:** The name for the training run, which determines the directory where results are saved.
- **verbose:** A boolean to display detailed training output.

**Theories and Concepts**

**YOLO (You Only Look Once):** YOLO is a prominent single-stage object detection model. Unlike older, two-stage detectors, YOLO processes an entire image in one pass to simultaneously predict all bounding boxes and their class probabilities. This makes it exceptionally fast, suitable for real-time applications. The YOLOv8 architecture improves upon previous versions by incorporating new techniques to enhance accuracy while maintaining high speed.
**Stochastic Gradient Descent (SGD) and Adam:** The optimizer parameter specifies the algorithm used to update the model's weights during training.

- **SGD** works by calculating the gradient of the loss function with respect to the model's weights for a small batch of data. It then updates the weights in the opposite direction of the gradient to minimize the loss.
- **Adam** is a more advanced optimizer that adapts the learning rate for each weight, often leading to faster convergence. The use of different optimizers for the YOLOv8x and YOLOv8n models allows for experimentation to find the best configuration for each model.

**Learning Rate Schedulers:** The cos_lr=True parameter activates a cosine annealing learning rate scheduler. A learning rate is the step size the optimizer takes. This scheduler is a technique that adjusts the learning rate throughout training, starting high to make fast progress and gradually decreasing it in a cosine-like curve. This helps the model to settle into an optimal configuration without getting stuck in local minima.

**Data Augmentation:** Techniques like mosaic, flipud, and scale are forms of data augmentation. These methods artificially expand the diversity of the training dataset by applying random transformations to the images. This prevents the model from overfitting to the specific examples it has seen, making it more robust and better at generalizing to new, unseen data.

**Regularization:** Hyperparameters such as dropout and weight_decay are forms of regularization. They are designed to prevent the model from overfitting.

- **Dropout** randomly "drops" a percentage of neuron connections during training, forcing the network to learn more robust features.
- **Weight Decay** adds a penalty to the loss function for large weights, discouraging the model from relying too heavily on any single feature.

### Fine-Tune YOLO8n

In [7]:
print("[notice] Training YOLO8n...")
try:
    yolo8n_model = YOLO("yolov8n.pt")
    yolo8n_results = yolo8n_model.train(
        data="/kaggle/working/data.yaml",
        epochs=TRAIN_EPOCHS,
        imgsz=IMG_SIZE,
        patience=PATIENCE,
        cos_lr=True,
        dropout=0.2,
        mosaic=0.5,
        lr0=0.001,
        optimizer="Adam",
        momentum=0.9,
        weight_decay=0.0005,
        single_cls=True,
        plots=True,
        cache=True,
        flipud=0.5,
        scale=0.8,
        name="yolo8n_trained",
        verbose=True
    )
except Exception as e:
    print(f"[error] YOLO8n training failed: {e}")
    raise

[notice] Training YOLO8n...
Ultralytics 8.3.200 🚀 Python-3.11.13 torch-2.6.0+cu124 CUDA:0 (Tesla T4, 15095MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=16, bgr=0.0, box=7.5, cache=True, cfg=None, classes=None, close_mosaic=10, cls=0.5, compile=False, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=True, cutmix=0.0, data=/kaggle/working/data.yaml, degrees=0.0, deterministic=True, device=None, dfl=1.5, dnn=False, dropout=0.2, dynamic=False, embed=None, epochs=100, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.5, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=640, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.001, lrf=0.01, mask_ratio=4, max_det=300, mixup=0.0, mode=train, model=yolov8n.pt, momentum=0.9, mosaic=0.5, multi_scale=False, name=yolo8n_trained2, nbs=64, nms=False, opset=None, optimize=False, optimizer=Adam, overlap_mask=True, pat

### Fine-Tune YOLO8x

In [8]:
# --- Train YOLO Models ---
print("[notice] Training YOLO8x...")
try:
    yolo8x_model = YOLO("yolov8x.pt")
    yolo8x_results = yolo8x_model.train(
        data="/kaggle/working/data.yaml",
        epochs=TRAIN_EPOCHS,
        imgsz=IMG_SIZE,
        patience=PATIENCE,
        cos_lr=True,
        dropout=0.4,
        mosaic=0.2,
        lr0=0.0001,
        optimizer="SGD",
        momentum=0.975,
        weight_decay=0.0001,
        single_cls=True,
        plots=True,
        cache=True,
        flipud=0.25,
        scale=1.0,
        name="yolo8x_trained",
        verbose=True
    )
except Exception as e:
    print(f"[error] YOLO8x training failed: {e}")
    raise

[notice] Training YOLO8x...
[KDownloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolov8x.pt to 'yolov8x.pt': 100% ━━━━━━━━━━━━ 130.5MB 254.4MB/s 0.5s 0.5s<0.0s
Ultralytics 8.3.200 🚀 Python-3.11.13 torch-2.6.0+cu124 CUDA:0 (Tesla T4, 15095MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=16, bgr=0.0, box=7.5, cache=True, cfg=None, classes=None, close_mosaic=10, cls=0.5, compile=False, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=True, cutmix=0.0, data=/kaggle/working/data.yaml, degrees=0.0, deterministic=True, device=None, dfl=1.5, dnn=False, dropout=0.4, dynamic=False, embed=None, epochs=100, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.25, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=640, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.0001, lrf=0.01, mask_ratio=4, max_det=300, mixup=0.0, mode=train, model=yolov

In [None]:
Use

In [9]:
import torch
import numpy as np
from PIL import Image
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
from ultralytics import YOLO
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.ops import box_iou, nms
from tqdm import tqdm
import torchvision.transforms.v2 as T
import pandas as pd
import csv
from ensemble_boxes import weighted_boxes_fusion
import os

# Constants
TRAIN_EPOCHS = 1
IMG_SIZE = 640
PATIENCE = 20
BATCH_SIZE = 8
CONF_THRESHOLD = 0.25
IOU_THRESHOLD = 0.5
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
base_path = "/kaggle/input/synthetic-2-real-object-detection-challenge-2/Synthetic to Real Object Detection Challenge 2"

# --- Data Transformations ---
train_transforms = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.RandomHorizontalFlip(p=0.5),
    T.Resize(size=(IMG_SIZE, IMG_SIZE)),
])

val_transforms = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(size=(IMG_SIZE, IMG_SIZE)),
])

# --- Custom Dataset ---
class SoupCanDataset(Dataset):
    def __init__(self, image_dir, label_dir, transforms=None):
        self.image_dir = Path(image_dir)
        self.label_dir = Path(label_dir)
        self.transforms = transforms
        self.images = [p for p in self.image_dir.glob("*") if p.suffix.lower() in ['.png', '.jpg', '.jpeg']]

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img_path = self.images[idx]
        label_path = self.label_dir / f"{img_path.stem}.txt"

        img = Image.open(img_path).convert("RGB")
        img_width, img_height = img.size

        boxes = []
        labels = []
        if label_path.exists():
            with open(label_path, 'r') as f:
                for line in f:
                    try:
                        cls_id, x_center, y_center, width, height = map(float, line.strip().split())
                        x1 = (x_center - width / 2) * img_width
                        y1 = (y_center - height / 2) * img_height
                        x2 = (x_center + width / 2) * img_width
                        y2 = (y_center + height / 2) * img_height
                        boxes.append([x1, y1, x2, y2])
                        labels.append(int(cls_id))
                    except ValueError:
                        print(f"[warning] Invalid label in {label_path}: {line.strip()}")

        boxes = torch.tensor(boxes, dtype=torch.float32) if boxes else torch.empty((0, 4), dtype=torch.float32)
        labels = torch.tensor(labels, dtype=torch.int64) if labels else torch.empty((0,), dtype=torch.int64)

        target = {
            "boxes": boxes,
            "labels": labels,
            "image_id": torch.tensor([idx]),
            "area": (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]) if len(boxes) > 0 else torch.empty((0,)),
            "iscrowd": torch.zeros((len(boxes),), dtype=torch.int64)
        }

        if self.transforms:
            img = self.transforms(img)
            if len(boxes) > 0:
                scale = IMG_SIZE / max(img_width, img_height)
                boxes[:, [0, 2]] = boxes[:, [0, 2]] * scale
                boxes[:, [1, 3]] = boxes[:, [1, 3]] * scale
                target["boxes"] = boxes.clamp(min=0, max=IMG_SIZE-1)

        return img, target

def collate_fn(batch):
    return tuple(zip(*batch))

# Create datasets and dataloaders
train_dataset = SoupCanDataset(f"{base_path}/train/images", f"{base_path}/train/labels", transforms=train_transforms)
val_dataset = SoupCanDataset(f"{base_path}/val/images", f"{base_path}/val/labels", transforms=val_transforms)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, collate_fn=collate_fn)

# --- Train YOLO Models ---
print("[notice] Training YOLO8x...")
try:
    yolo8x_model = YOLO("yolov8x.pt")
    yolo8x_results = yolo8x_model.train(
        data="/kaggle/working/data.yaml",
        epochs=TRAIN_EPOCHS,
        imgsz=IMG_SIZE,
        patience=PATIENCE,
        cos_lr=True,
        dropout=0.4,
        mosaic=0.2,
        lr0=0.0001,
        optimizer="SGD",
        momentum=0.975,
        weight_decay=0.0001,
        single_cls=True,
        plots=True,
        cache=True,
        flipud=0.25,
        scale=1.0,
        name="yolo8x_trained",
        verbose=True
    )
except Exception as e:
    print(f"[error] YOLO8x training failed: {e}")
    raise

print("[notice] Training YOLO8n...")
try:
    yolo8n_model = YOLO("yolov8n.pt")
    yolo8n_results = yolo8n_model.train(
        data="/kaggle/working/data.yaml",
        epochs=TRAIN_EPOCHS,
        imgsz=IMG_SIZE,
        patience=PATIENCE,
        cos_lr=True,
        dropout=0.2,
        mosaic=0.5,
        lr0=0.001,
        optimizer="Adam",
        momentum=0.9,
        weight_decay=0.0005,
        single_cls=True,
        plots=True,
        cache=True,
        flipud=0.5,
        scale=0.8,
        name="yolo8n_trained",
        verbose=True
    )
except Exception as e:
    print(f"[error] YOLO8n training failed: {e}")
    raise

# --- Train Faster R-CNN with ResNet50-FPN ---
print("[notice] Training Faster R-CNN with ResNet50-FPN...")
def get_faster_rcnn_model(num_classes):
    model = fasterrcnn_resnet50_fpn(pretrained=True)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes + 1)  # +1 for background
    return model

faster_rcnn_model = get_faster_rcnn_model(num_classes=1)  # 1 class (object) + background
faster_rcnn_model.to(device)

optimizer = torch.optim.SGD(faster_rcnn_model.parameters(), lr=0.0001, momentum=0.975, weight_decay=0.0001)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=TRAIN_EPOCHS)

def train_faster_rcnn(model, train_loader, val_loader, epochs, patience):
    best_loss = float('inf')
    patience_counter = 0
    
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for images, targets in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}"):
            images = list(image.to(device) for image in images)
            targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
            
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
            
            optimizer.zero_grad()
            losses.backward()
            optimizer.step()
            train_loss += losses.item()
        
        train_loss /= len(train_loader)
        
        # Validation with loss computation in training mode
        model.train()  # Temporarily set to train mode for loss
        val_loss = 0
        with torch.no_grad():
            for images, targets in val_loader:
                images = list(image.to(device) for image in images)
                targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
                loss_dict = model(images, targets)
                losses = sum(loss for loss in loss_dict.values())
                val_loss += losses.item()
        
        val_loss /= len(val_loader)
        print(f"[notice] Epoch {epoch+1}: Train Loss = {train_loss:.4f}, Val Loss = {val_loss:.4f}")
        
        if val_loss < best_loss:
            best_loss = val_loss
            torch.save(model.state_dict(), "/kaggle/working/faster_rcnn_best.pt")
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("[notice] Early stopping triggered")
                break
        
        scheduler.step()

try:
    train_faster_rcnn(faster_rcnn_model, train_loader, val_loader, TRAIN_EPOCHS, PATIENCE)
except Exception as e:
    print(f"[error] Faster R-CNN training failed: {e}")
    raise

# --- Load and Prepare Models for Ensemble ---
print("[notice] Loading trained models for ensemble...")
yolo8x_model = YOLO("/kaggle/working/runs/detect/yolo8x_trained2/weights/best.pt")  # Fixed path
yolo8n_model = YOLO("/kaggle/working/runs/detect/yolo8n_trained/weights/best.pt")
faster_rcnn_model.load_state_dict(torch.load("/kaggle/working/faster_rcnn_best.pt"))
faster_rcnn_model.to(device)
faster_rcnn_model.eval()

# --- Ensemble Inference ---
test_images_path = f"{base_path}/testImages/images"
output_dir = "/kaggle/working/predictions/labels"
os.makedirs(output_dir, exist_ok=True)

test_transforms = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(size=(IMG_SIZE, IMG_SIZE)),
])

for img_path in tqdm(list(Path(test_images_path).glob("*")), desc="Predicting"):
    if img_path.suffix.lower() not in ['.png', '.jpg', '.jpeg']:
        continue
    
    img_name = img_path.stem
    img = Image.open(img_path).convert("RGB")
    img_width, img_height = img.size
    
    # Preprocess image
    img_tensor = test_transforms(img).unsqueeze(0).to(device)  # Add batch dimension
    
    # YOLOv8x predictions
    yolo8x_boxes = []
    yolo8x_scores = []
    yolo8x_labels = []
    try:
        yolo8x_results = yolo8x_model.predict(img_path, conf=CONF_THRESHOLD, verbose=False)
        for result in yolo8x_results:
            boxes = result.boxes.data
            if boxes is not None:
                for box in boxes:
                    x1, y1, x2, y2, conf, cls_id = box.tolist()
                    if conf >= CONF_THRESHOLD:
                        # Normalize to [0, 1] relative to original image size
                        yolo8x_boxes.append([x1/img_width, y1/img_height, x2/img_width, y2/img_height])
                        yolo8x_scores.append(conf)
                        yolo8x_labels.append(int(cls_id))
    except Exception as e:
        print(f"[warning] YOLOv8x prediction failed for {img_name}: {e}")
    
    # YOLOv8n predictions
    yolo8n_boxes = []
    yolo8n_scores = []
    yolo8n_labels = []
    try:
        yolo8n_results = yolo8n_model.predict(img_path, conf=CONF_THRESHOLD, verbose=False)
        for result in yolo8n_results:
            boxes = result.boxes.data
            if boxes is not None:
                for box in boxes:
                    x1, y1, x2, y2, conf, cls_id = box.tolist()
                    if conf >= CONF_THRESHOLD:
                        yolo8n_boxes.append([x1/img_width, y1/img_height, x2/img_width, y2/img_height])
                        yolo8n_scores.append(conf)
                        yolo8n_labels.append(int(cls_id))
    except Exception as e:
        print(f"[warning] YOLOv8n prediction failed for {img_name}: {e}")
    
    # Faster R-CNN predictions
    frcnn_boxes = []
    frcnn_scores = []
    frcnn_labels = []
    try:
        with torch.no_grad():
            predictions = faster_rcnn_model([img_tensor[0]])[0]
            for box, score, label in zip(predictions['boxes'], predictions['scores'], predictions['labels']):
                if score >= CONF_THRESHOLD and label == 1:  # Class 1 is object
                    x1, y1, x2, y2 = box.tolist()
                    # Scale boxes to original image size and normalize
                    scale_x = img_width / IMG_SIZE
                    scale_y = img_height / IMG_SIZE
                    x1, x2 = x1 * scale_x / img_width, x2 * scale_x / img_width
                    y1, y2 = y1 * scale_y / img_height, y2 * scale_y / img_height
                    frcnn_boxes.append([x1, y1, x2, y2])
                    frcnn_scores.append(score.item())
                    frcnn_labels.append(0)  # Map to class 0 for consistency
    except Exception as e:
        print(f"[warning] Faster R-CNN prediction failed for {img_name}: {e}")
    
    # Ensemble using Weighted Boxes Fusion
    boxes_list = [yolo8x_boxes, yolo8n_boxes, frcnn_boxes]
    scores_list = [yolo8x_scores, yolo8n_scores, frcnn_scores]
    labels_list = [yolo8x_labels, yolo8n_labels, frcnn_labels]
    weights = [1.0, 0.9, 0.8]  # YOLOv8x > YOLOv8n > Faster R-CNN
    
    try:
        fused_boxes, fused_scores, fused_labels = weighted_boxes_fusion(
            boxes_list,
            scores_list,
            labels_list,
            weights=weights,
            iou_thr=IOU_THRESHOLD,
            skip_box_thr=CONF_THRESHOLD
        )
    except Exception as e:
        print(f"[warning] WBF failed for {img_name}: {e}")
        fused_boxes, fused_scores, fused_labels = [], [], []
    
    # Save ensemble predictions in YOLO format
    output_txt = Path(output_dir) / f"{img_name}.txt"
    with open(output_txt, "w") as f:
        for box, score, label in zip(fused_boxes, fused_scores, fused_labels):
            x1, y1, x2, y2 = box
            x_center = (x1 + x2) / 2
            y_center = (y1 + y2) / 2
            width = x2 - x1
            height = y2 - y1
            f.write(f"{int(label)} {score:.6f} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}\n")

print(f"[notice] All ensemble detections saved to: {output_dir}")

# --- Convert Predictions to Submission CSV ---
def predictions_to_csv(
    preds_folder: str = "/kaggle/working/predictions/labels",
    output_csv: str = "/kaggle/working/submission.csv",
    test_images_folder: str = "/kaggle/input/synthetic-2-real-object-detection-challenge-2/Synthetic to Real Object Detection Challenge 2/testImages/images",
    allowed_extensions: tuple = (".jpg", ".png", ".jpeg")
):
    preds_path = Path(preds_folder)
    test_images_path = Path(test_images_folder)
    
    test_images = {p.stem for p in test_images_path.glob("*") if p.suffix.lower() in allowed_extensions}
    predictions = []
    predicted_images = set()
    
    for txt_file in preds_path.glob("*.txt"):
        image_id = txt_file.stem
        predicted_images.add(image_id)
        with open(txt_file, "r") as f:
            valid_lines = [line.strip() for line in f if len(line.strip().split()) == 6]
        pred_str = " ".join(valid_lines) if valid_lines else "no boxes"
        predictions.append({"image_id": image_id, "prediction_string": pred_str})
    
    missing_images = test_images - predicted_images
    for image_id in missing_images:
        predictions.append({"image_id": image_id, "prediction_string": "no boxes"})
    
    submission_df = pd.DataFrame(predictions)
    submission_df.to_csv(output_csv, index=False, quoting=csv.QUOTE_MINIMAL)
    print(f"[notice] Submission saved to {output_csv}")

predictions_to_csv()

# --- Updated Ensemble Inference and Validation ---
@torch.no_grad()
def run_ensemble_inference(image_tensor, conf_thres=0.25, iou_thres=0.5):
    """
    Runs inference with YOLO and Faster R-CNN models and combines their results.
    Accepts a PyTorch tensor with a batch dimension.
    """
    # YOLOv8x inference
    yolo8x_results = yolo8x_model.predict(image_tensor, conf=conf_thres, iou=iou_thres, verbose=False)
    if len(yolo8x_results[0].boxes.data) == 0:
        yolo8x_preds = torch.empty((0, 6), device=device)
    else:
        yolo8x_preds = yolo8x_results[0].boxes.data.to(device)

    # YOLOv8n inference
    yolo8n_results = yolo8n_model.predict(image_tensor, conf=conf_thres, iou=iou_thres, verbose=False)
    if len(yolo8n_results[0].boxes.data) == 0:
        yolo8n_preds = torch.empty((0, 6), device=device)
    else:
        yolo8n_preds = yolo8n_results[0].boxes.data.to(device)

    # Faster R-CNN inference
    faster_rcnn_image = image_tensor.to(device)  # Move to GPU directly
    faster_rcnn_results = faster_rcnn_model([faster_rcnn_image[0]])  # Pass as list, single image per call
    if len(faster_rcnn_results[0]['boxes']) == 0:
        faster_rcnn_preds = torch.empty((0, 6), device=device)
    else:
        boxes = faster_rcnn_results[0]['boxes'].to(device)
        scores = faster_rcnn_results[0]['scores'].to(device)
        labels = faster_rcnn_results[0]['labels'].to(device)
        mask = (labels == 1)  # Filter for class 1 (object)
        faster_rcnn_preds = torch.cat((boxes[mask], scores[mask].unsqueeze(1), torch.zeros_like(scores[mask].unsqueeze(1))), dim=1)

    # Combine detections
    combined_detections = torch.cat((yolo8x_preds, yolo8n_preds, faster_rcnn_preds), dim=0)

    if combined_detections.shape[0] == 0:
        return torch.empty((0, 4), device=device), torch.empty((0,), device=device), torch.empty((0,), device=device)

    # Apply NMS
    combined_boxes = combined_detections[:, :4]
    combined_scores = combined_detections[:, 4]
    combined_classes = combined_detections[:, 5]

    keep_indices = nms(combined_boxes, combined_scores, iou_thres)
    
    final_boxes = combined_boxes[keep_indices]
    final_scores = combined_scores[keep_indices]
    final_classes = combined_classes[keep_indices]

    # Scale boxes to match IMG_SIZE
    orig_shape = yolo8x_results[0].orig_shape
    scale_x = IMG_SIZE / orig_shape[1]
    scale_y = IMG_SIZE / orig_shape[0]
    final_boxes[:, [0, 2]] *= scale_x
    final_boxes[:, [1, 3]] *= scale_y
    final_boxes = final_boxes.clamp(min=0, max=IMG_SIZE-1)

    return final_boxes, final_scores, final_classes

@torch.no_grad()
def validate_ensemble(val_loader):
    """
    Validates the ensemble model on the validation dataset.
    """
    print("[notice] Validating ensemble model...")
    all_metrics = []

    for images, targets in tqdm(val_loader, desc="Validating Ensemble"):
        images = torch.stack(images).to(device)
        batch_size = len(images)

        final_boxes_batch = []
        final_scores_batch = []
        final_classes_batch = []
        for i in range(batch_size):
            final_boxes, final_scores, final_classes = run_ensemble_inference(images[i].unsqueeze(0))
            final_boxes_batch.append(final_boxes)
            final_scores_batch.append(final_scores)
            final_classes_batch.append(final_classes)

        for i in range(batch_size):
            gt_boxes = targets[i]['boxes'].to(device)
            if len(gt_boxes) == 0 or len(final_boxes_batch[i]) == 0:
                all_metrics.append({'precision': 0, 'recall': 0})
                continue

            iou_matrix = box_iou(gt_boxes, final_boxes_batch[i])
            detected_count = 0

            for gt_idx in range(len(gt_boxes)):
                if torch.max(iou_matrix[gt_idx]) >= 0.5:
                    detected_count += 1

            true_positives = detected_count
            predicted_positives = len(final_boxes_batch[i])
            actual_positives = len(gt_boxes)

            precision = true_positives / predicted_positives if predicted_positives > 0 else 0
            recall = true_positives / actual_positives if actual_positives > 0 else 0
            all_metrics.append({'precision': precision, 'recall': recall})

    if all_metrics:
        avg_precision = np.mean([m['precision'] for m in all_metrics])
        avg_recall = np.mean([m['recall'] for m in all_metrics])
        print(f"Ensemble Validation Results:")
        print(f"Average Precision: {avg_precision:.4f}")
        print(f"Average Recall: {avg_recall:.4f}")
    else:
        print("No detections found. Validation metrics not calculated.")

# Run the validation on the ensembled models
validate_ensemble(val_loader)

[notice] Training YOLO8x...
Ultralytics 8.3.199 🚀 Python-3.11.13 torch-2.6.0+cu124 CUDA:0 (Tesla T4, 15095MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=16, bgr=0.0, box=7.5, cache=True, cfg=None, classes=None, close_mosaic=10, cls=0.5, compile=False, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=True, cutmix=0.0, data=/kaggle/working/data.yaml, degrees=0.0, deterministic=True, device=None, dfl=1.5, dnn=False, dropout=0.4, dynamic=False, embed=None, epochs=1, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.25, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=640, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.0001, lrf=0.01, mask_ratio=4, max_det=300, mixup=0.0, mode=train, model=yolov8x.pt, momentum=0.975, mosaic=0.2, multi_scale=False, name=yolo8x_trained3, nbs=64, nms=False, opset=None, optimize=False, optimizer=SGD, overlap_mask=True, pa

Epoch 1/1: 100%|██████████| 26/26 [00:47<00:00,  1.82s/it]


[notice] Epoch 1: Train Loss = 0.3144, Val Loss = 0.1110
[notice] Loading trained models for ensemble...


Predicting: 100%|██████████| 159/159 [01:48<00:00,  1.46it/s]


[notice] All ensemble detections saved to: /kaggle/working/predictions/labels
[notice] Submission saved to /kaggle/working/submission.csv
[notice] Validating ensemble model...


Validating Ensemble: 100%|██████████| 21/21 [00:31<00:00,  1.52s/it]

Ensemble Validation Results:
Average Precision: 0.0462
Average Recall: 0.1043





In [14]:
import torch
import numpy as np
from PIL import Image
from pathlib import Path
from ultralytics import YOLO
import torchvision.transforms.v2 as T
from tqdm import tqdm
import os
import pandas as pd
import csv
from ensemble_boxes import weighted_boxes_fusion

# Constants
IMG_SIZE = 640  # Same as training
CONF_THRESHOLD = 0.25  # Adjust as needed
IOU_THRESHOLD = 0.5  # For WBF
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
base_path = "/kaggle/input/synthetic-2-real-object-detection-challenge-2/Synthetic to Real Object Detection Challenge 2"

# Load trained models
yolo8x_model = YOLO("/kaggle/working/runs/detect/yolo8x_trained3/weights/best.pt")
yolo8n_model = YOLO("/kaggle/working/runs/detect/yolo8n_trained2/weights/best.pt")
faster_rcnn_model = fasterrcnn_resnet50_fpn(pretrained=False)
in_features = faster_rcnn_model.roi_heads.box_predictor.cls_score.in_features
faster_rcnn_model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes=2)  # 1 class + background
faster_rcnn_model.load_state_dict(torch.load("/kaggle/working/faster_rcnn_best.pt"))
faster_rcnn_model.eval()
faster_rcnn_model.to(device)

# Test transformations (aligned with training)
test_transforms = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(size=(IMG_SIZE, IMG_SIZE)),
])

# Prediction and ensemble
test_images_path = f"{base_path}/testImages/images"
output_dir = "/kaggle/working/predictions/labels"
os.makedirs(output_dir, exist_ok=True)

for img_path in tqdm(list(Path(test_images_path).glob("*")), desc="Predicting"):
    if img_path.suffix.lower() not in ['.png', '.jpg', '.jpeg']:
        continue
    
    img_name = img_path.stem
    img = Image.open(img_path).convert("RGB")
    img_width, img_height = img.size
    
    # Preprocess image
    img_tensor = test_transforms(img).unsqueeze(0).to(device)  # Add batch dimension
    
    # YOLOv8x predictions
    yolo8x_boxes = []
    yolo8x_scores = []
    yolo8x_labels = []
    try:
        yolo8x_results = yolo8x_model.predict(img_path, conf=CONF_THRESHOLD, verbose=False)
        for result in yolo8x_results:
            boxes = result.boxes.data
            if boxes is not None:
                for box in boxes:
                    x1, y1, x2, y2, conf, cls_id = box.tolist()
                    if conf >= CONF_THRESHOLD:
                        # Normalize to [0, 1] relative to original image size
                        yolo8x_boxes.append([x1/img_width, y1/img_height, x2/img_width, y2/img_height])
                        yolo8x_scores.append(conf)
                        yolo8x_labels.append(int(cls_id))
    except Exception as e:
        print(f"[warning] YOLOv8x prediction failed for {img_name}: {e}")
    
    # YOLOv8n predictions
    yolo8n_boxes = []
    yolo8n_scores = []
    yolo8n_labels = []
    try:
        yolo8n_results = yolo8n_model.predict(img_path, conf=CONF_THRESHOLD, verbose=False)
        for result in yolo8n_results:
            boxes = result.boxes.data
            if boxes is not None:
                for box in boxes:
                    x1, y1, x2, y2, conf, cls_id = box.tolist()
                    if conf >= CONF_THRESHOLD:
                        yolo8n_boxes.append([x1/img_width, y1/img_height, x2/img_width, y2/img_height])
                        yolo8n_scores.append(conf)
                        yolo8n_labels.append(int(cls_id))
    except Exception as e:
        print(f"[warning] YOLOv8n prediction failed for {img_name}: {e}")
    
    # Faster R-CNN predictions
    frcnn_boxes = []
    frcnn_scores = []
    frcnn_labels = []
    try:
        with torch.no_grad():
            predictions = faster_rcnn_model([img_tensor[0]])[0]
            for box, score, label in zip(predictions['boxes'], predictions['scores'], predictions['labels']):
                if score >= CONF_THRESHOLD and label == 1:  # Class 1 is object
                    x1, y1, x2, y2 = box.tolist()
                    # Scale boxes to original image size and normalize
                    scale_x = img_width / IMG_SIZE
                    scale_y = img_height / IMG_SIZE
                    x1, x2 = x1 * scale_x / img_width, x2 * scale_x / img_width
                    y1, y2 = y1 * scale_y / img_height, y2 * scale_y / img_height
                    frcnn_boxes.append([x1, y1, x2, y2])
                    frcnn_scores.append(score.item())
                    frcnn_labels.append(0)  # Map to class 0 for consistency
    except Exception as e:
        print(f"[warning] Faster R-CNN prediction failed for {img_name}: {e}")
    
    # Ensemble using Weighted Boxes Fusion
    boxes_list = [yolo8x_boxes, yolo8n_boxes, frcnn_boxes]
    scores_list = [yolo8x_scores, yolo8n_scores, frcnn_scores]
    labels_list = [yolo8x_labels, yolo8n_labels, frcnn_labels]
    weights = [1.0, 0.9, 0.8]  # YOLOv8x > YOLOv8n > Faster R-CNN
    
    try:
        fused_boxes, fused_scores, fused_labels = weighted_boxes_fusion(
            boxes_list,
            scores_list,
            labels_list,
            weights=weights,
            iou_thr=IOU_THRESHOLD,
            skip_box_thr=CONF_THRESHOLD
        )
    except Exception as e:
        print(f"[warning] WBF failed for {img_name}: {e}")
        fused_boxes, fused_scores, fused_labels = [], [], []
    
    # Save ensemble predictions in YOLO format
    output_txt = Path(output_dir) / f"{img_name}.txt"
    with open(output_txt, "w") as f:
        for box, score, label in zip(fused_boxes, fused_scores, fused_labels):
            x1, y1, x2, y2 = box
            x_center = (x1 + x2) / 2
            y_center = (y1 + y2) / 2
            width = x2 - x1
            height = y2 - y1
            f.write(f"{int(label)} {score:.6f} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}\n")

print(f"[notice] All ensemble detections saved to: {output_dir}")

# Convert predictions to submission.csv (unchanged from original)
def predictions_to_csv(
    preds_folder: str = "/kaggle/working/predictions/labels",
    output_csv: str = "/kaggle/working/submission.csv",
    test_images_folder: str = "/kaggle/input/synthetic-2-real-object-detection-challenge-2/Synthetic to Real Object Detection Challenge 2/testImages/images",
    allowed_extensions: tuple = (".jpg", ".png", ".jpeg")
):
    preds_path = Path(preds_folder)
    test_images_path = Path(test_images_folder)
    
    test_images = {p.stem for p in test_images_path.glob("*") if p.suffix.lower() in allowed_extensions}
    predictions = []
    predicted_images = set()
    
    for txt_file in preds_path.glob("*.txt"):
        image_id = txt_file.stem
        predicted_images.add(image_id)
        with open(txt_file, "r") as f:
            valid_lines = [line.strip() for line in f if len(line.strip().split()) == 6]
        pred_str = " ".join(valid_lines) if valid_lines else "no boxes"
        predictions.append({"image_id": image_id, "prediction_string": pred_str})
    
    missing_images = test_images - predicted_images
    for image_id in missing_images:
        predictions.append({"image_id": image_id, "prediction_string": "no boxes"})
    
    submission_df = pd.DataFrame(predictions)
    submission_df.to_csv(output_csv, index=False, quoting=csv.QUOTE_MINIMAL)
    print(f"[notice] Submission saved to {output_csv}")

predictions_to_csv()

Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 205MB/s]
Predicting: 100%|██████████| 159/159 [01:43<00:00,  1.53it/s]

[notice] All ensemble detections saved to: /kaggle/working/predictions/labels
[notice] Submission saved to /kaggle/working/submission.csv





## Fine-Tune ResNet50-FPN

This process involves fine-tuning a pre-trained Faster R-CNN model on a custom dataset using a manual training loop.

**Model Fine-Tuning Methodology**

- **Initialization and Model Loading:** The code loads a pre-trained Faster R-CNN with a ResNet50-FPN backbone. This is a classic two-stage architecture known for its high accuracy. It also leverages transfer learning by using a model that has already learned general features from a large dataset.
- **Head Replacement:** The final classification head (model.roi_heads.box_predictor) of the pre-trained model is replaced with a new one specifically configured for the single object class in the dataset, plus a background class. This is a crucial step to adapt the model for the new task.
- **Custom Training Loop:** Unlike the built-in .train() method used for YOLO, Faster R-CNN is trained using a manual, explicit training loop. This loop provides more granular control over the training process.
  - **Training Loop:** For each epoch, the model is set to training mode (model.train()). It then iterates through batches of images and their corresponding labels from the train_loader.
  - **Forward Pass:** model(images, targets) performs a forward pass, calculating the predicted bounding boxes and class probabilities. It also returns a dictionary of loss values.
  - **Backward Pass:** losses.backward() computes the gradients of the total loss with respect to the model's weights.
  - **Weight Update:** optimizer.step() uses these gradients to update the model's weights, minimizing the loss.
  - **Validation Loop:** After each training epoch, a validation step is performed. The model is set to evaluation mode (model.eval()) and processes the validation data without updating weights. This provides an unbiased measure of the model's performance on unseen data. The loss is calculated and used to monitor for overfitting.

- **Early Stopping:** The loop incorporates a crucial early stopping mechanism. It monitors the model's performance on a validation set and saves the best-performing weights. If the model's performance doesn't improve for a certain number of epochs (PATIENCE), training is halted to prevent overfitting.

**Hyperparameters:**

The training process is controlled by a set of hyperparameters that define how the model learns.

- **lr:** The initial learning rate for the optimizer, set to 0.0001.
- **momentum:** A parameter for SGD that helps accelerate the descent in the right direction, set to 0.975.
- **weight_decay:** A regularization term that penalizes large weights, set to 0.0001.
- **epochs:** The number of times the entire dataset is passed forward and backward through the neural network during training, controlled by TRAIN_EPOCHS.
- **patience:** The number of epochs with no improvement in the validation loss after which training will be stopped, controlled by PATIENCE.
- **T_max:** A parameter for the cosine annealing learning rate scheduler, set to TRAIN_EPOCHS.

**Theories and Concepts**

- **Faster R-CNN:** This is a two-stage object detection model. The first stage, a Region Proposal Network (RPN), scans the image and proposes regions that likely contain an object. The second stage then takes these proposals, classifies the objects within them, and refines the bounding boxes.
- **Transfer Learning:** The code uses a model pre-trained on a large dataset like COCO. By using these pre-trained weights, the model already has a strong foundation for understanding visual features. Fine-tuning these weights on a new, smaller dataset allows it to quickly adapt and perform well without needing to train from scratch.
- **Loss Functions:** Object detection models use a combination of different losses. Faster R-CNN's total loss is a sum of several components: a classification loss to determine the correct object class, and a localization loss to fine-tune the bounding box coordinates.
- **SGD with Momentum:** SGD is an optimization algorithm that iteratively updates model weights. Momentum adds a fraction of the previous weight update to the current one, helping to accelerate training and overcome local minima in the loss landscape.
- **Learning Rate Scheduler:** The learning rate controls how much the model's weights are adjusted in each step. A scheduler like cosine annealing gradually reduces the learning rate over time. This helps the model make large updates at the beginning of training and smaller, more precise adjustments later, which can lead to better convergence.
- **Early Stopping:** This is a powerful regularization technique used to prevent overfitting. By stopping training when validation performance plateaus, it ensures the model doesn't become too specialized to the training data and maintains its ability to generalize to new, unseen data.

**Preprocess dataset for RestNet50**

In [None]:
# --- Data Transformations ---
train_transforms = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.RandomHorizontalFlip(p=0.5),
    T.Resize(size=(IMG_SIZE, IMG_SIZE)),
])

val_transforms = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(size=(IMG_SIZE, IMG_SIZE)),
])

# --- Custom Dataset ---
class SoupCanDataset(Dataset):
    def __init__(self, image_dir, label_dir, transforms=None):
        self.image_dir = Path(image_dir)
        self.label_dir = Path(label_dir)
        self.transforms = transforms
        self.images = [p for p in self.image_dir.glob("*") if p.suffix.lower() in ['.png', '.jpg', '.jpeg']]

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img_path = self.images[idx]
        label_path = self.label_dir / f"{img_path.stem}.txt"

        img = Image.open(img_path).convert("RGB")
        img_width, img_height = img.size

        boxes = []
        labels = []
        if label_path.exists():
            with open(label_path, 'r') as f:
                for line in f:
                    try:
                        cls_id, x_center, y_center, width, height = map(float, line.strip().split())
                        x1 = (x_center - width / 2) * img_width
                        y1 = (y_center - height / 2) * img_height
                        x2 = (x_center + width / 2) * img_width
                        y2 = (y_center + height / 2) * img_height
                        boxes.append([x1, y1, x2, y2])
                        labels.append(int(cls_id))
                    except ValueError:
                        print(f"[warning] Invalid label in {label_path}: {line.strip()}")

        boxes = torch.tensor(boxes, dtype=torch.float32) if boxes else torch.empty((0, 4), dtype=torch.float32)
        labels = torch.tensor(labels, dtype=torch.int64) if labels else torch.empty((0,), dtype=torch.int64)

        target = {
            "boxes": boxes,
            "labels": labels,
            "image_id": torch.tensor([idx]),
            "area": (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]) if len(boxes) > 0 else torch.empty((0,)),
            "iscrowd": torch.zeros((len(boxes),), dtype=torch.int64)
        }

        if self.transforms:
            img = self.transforms(img)
            if len(boxes) > 0:
                scale = IMG_SIZE / max(img_width, img_height)
                boxes[:, [0, 2]] = boxes[:, [0, 2]] * scale
                boxes[:, [1, 3]] = boxes[:, [1, 3]] * scale
                target["boxes"] = boxes.clamp(min=0, max=IMG_SIZE-1)

        return img, target

def collate_fn(batch):
    return tuple(zip(*batch))

# Create datasets and dataloaders
train_dataset = SoupCanDataset(f"{base_path}/train/images", f"{base_path}/train/labels", transforms=train_transforms)
val_dataset = SoupCanDataset(f"{base_path}/val/images", f"{base_path}/val/labels", transforms=val_transforms)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, collate_fn=collate_fn)

**Fine-Tune RestNet50**

In [12]:
# --- Train Faster R-CNN with ResNet50-FPN ---
print("[notice] Training Faster R-CNN with ResNet50-FPN...")
def get_faster_rcnn_model(num_classes):
    model = fasterrcnn_resnet50_fpn(pretrained=True)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes + 1)  # +1 for background
    return model

faster_rcnn_model = get_faster_rcnn_model(num_classes=1)  # 1 class (object) + background
faster_rcnn_model.to(device)

optimizer = torch.optim.SGD(faster_rcnn_model.parameters(), lr=0.0001, momentum=0.975, weight_decay=0.0001)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=TRAIN_EPOCHS)

def train_faster_rcnn(model, train_loader, val_loader, epochs, patience):
    best_loss = float('inf')
    patience_counter = 0
    
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for images, targets in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}"):
            images = list(image.to(device) for image in images)
            targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
            
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
            
            optimizer.zero_grad()
            losses.backward()
            optimizer.step()
            train_loss += losses.item()
        
        train_loss /= len(train_loader)
        
        # Validation with loss computation in training mode
        model.train()  # Temporarily set to train mode for loss
        val_loss = 0
        with torch.no_grad():
            for images, targets in val_loader:
                images = list(image.to(device) for image in images)
                targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
                loss_dict = model(images, targets)
                losses = sum(loss for loss in loss_dict.values())
                val_loss += losses.item()
        
        val_loss /= len(val_loader)
        print(f"[notice] Epoch {epoch+1}: Train Loss = {train_loss:.4f}, Val Loss = {val_loss:.4f}")
        
        if val_loss < best_loss:
            best_loss = val_loss
            torch.save(model.state_dict(), "/kaggle/working/faster_rcnn_best.pt")
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("[notice] Early stopping triggered")
                break
        
        scheduler.step()

try:
    train_faster_rcnn(faster_rcnn_model, train_loader, val_loader, TRAIN_EPOCHS, PATIENCE)
except Exception as e:
    print(f"[error] Faster R-CNN training failed: {e}")
    raise

[notice] Training Faster R-CNN with ResNet50-FPN...


Epoch 1/100: 100%|██████████| 26/26 [00:38<00:00,  1.47s/it]


[notice] Epoch 1: Train Loss = 0.2796, Val Loss = 0.1207


Epoch 2/100: 100%|██████████| 26/26 [00:38<00:00,  1.49s/it]


[notice] Epoch 2: Train Loss = 0.0994, Val Loss = 0.0787


Epoch 3/100: 100%|██████████| 26/26 [00:40<00:00,  1.54s/it]


[notice] Epoch 3: Train Loss = 0.0645, Val Loss = 0.0497


Epoch 4/100: 100%|██████████| 26/26 [00:41<00:00,  1.58s/it]


[notice] Epoch 4: Train Loss = 0.0415, Val Loss = 0.0403


Epoch 5/100: 100%|██████████| 26/26 [00:41<00:00,  1.59s/it]


[notice] Epoch 5: Train Loss = 0.0374, Val Loss = 0.0366


Epoch 6/100: 100%|██████████| 26/26 [00:41<00:00,  1.60s/it]


[notice] Epoch 6: Train Loss = 0.0311, Val Loss = 0.0332


Epoch 7/100: 100%|██████████| 26/26 [00:41<00:00,  1.61s/it]


[notice] Epoch 7: Train Loss = 0.0301, Val Loss = 0.0297


Epoch 8/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 8: Train Loss = 0.0288, Val Loss = 0.0280


Epoch 9/100: 100%|██████████| 26/26 [00:42<00:00,  1.63s/it]


[notice] Epoch 9: Train Loss = 0.0258, Val Loss = 0.0247


Epoch 10/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 10: Train Loss = 0.0247, Val Loss = 0.0253


Epoch 11/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 11: Train Loss = 0.0238, Val Loss = 0.0236


Epoch 12/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 12: Train Loss = 0.0234, Val Loss = 0.0231


Epoch 13/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 13: Train Loss = 0.0223, Val Loss = 0.0227


Epoch 14/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 14: Train Loss = 0.0228, Val Loss = 0.0218


Epoch 15/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 15: Train Loss = 0.0221, Val Loss = 0.0220


Epoch 16/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 16: Train Loss = 0.0199, Val Loss = 0.0209


Epoch 17/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 17: Train Loss = 0.0188, Val Loss = 0.0208


Epoch 18/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 18: Train Loss = 0.0213, Val Loss = 0.0210


Epoch 19/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 19: Train Loss = 0.0175, Val Loss = 0.0207


Epoch 20/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 20: Train Loss = 0.0174, Val Loss = 0.0199


Epoch 21/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 21: Train Loss = 0.0162, Val Loss = 0.0201


Epoch 22/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 22: Train Loss = 0.0165, Val Loss = 0.0205


Epoch 23/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 23: Train Loss = 0.0177, Val Loss = 0.0190


Epoch 24/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 24: Train Loss = 0.0166, Val Loss = 0.0191


Epoch 25/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 25: Train Loss = 0.0149, Val Loss = 0.0195


Epoch 26/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 26: Train Loss = 0.0161, Val Loss = 0.0199


Epoch 27/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 27: Train Loss = 0.0151, Val Loss = 0.0187


Epoch 28/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 28: Train Loss = 0.0161, Val Loss = 0.0188


Epoch 29/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 29: Train Loss = 0.0165, Val Loss = 0.0189


Epoch 30/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 30: Train Loss = 0.0135, Val Loss = 0.0188


Epoch 31/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 31: Train Loss = 0.0132, Val Loss = 0.0186


Epoch 32/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 32: Train Loss = 0.0136, Val Loss = 0.0186


Epoch 33/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 33: Train Loss = 0.0151, Val Loss = 0.0184


Epoch 34/100: 100%|██████████| 26/26 [00:42<00:00,  1.63s/it]


[notice] Epoch 34: Train Loss = 0.0136, Val Loss = 0.0183


Epoch 35/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 35: Train Loss = 0.0136, Val Loss = 0.0188


Epoch 36/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 36: Train Loss = 0.0127, Val Loss = 0.0182


Epoch 37/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 37: Train Loss = 0.0139, Val Loss = 0.0177


Epoch 38/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 38: Train Loss = 0.0165, Val Loss = 0.0177


Epoch 39/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 39: Train Loss = 0.0122, Val Loss = 0.0180


Epoch 40/100: 100%|██████████| 26/26 [00:42<00:00,  1.63s/it]


[notice] Epoch 40: Train Loss = 0.0117, Val Loss = 0.0179


Epoch 41/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 41: Train Loss = 0.0127, Val Loss = 0.0181


Epoch 42/100: 100%|██████████| 26/26 [00:42<00:00,  1.63s/it]


[notice] Epoch 42: Train Loss = 0.0122, Val Loss = 0.0182


Epoch 43/100: 100%|██████████| 26/26 [00:42<00:00,  1.63s/it]


[notice] Epoch 43: Train Loss = 0.0113, Val Loss = 0.0175


Epoch 44/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 44: Train Loss = 0.0120, Val Loss = 0.0174


Epoch 45/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 45: Train Loss = 0.0112, Val Loss = 0.0179


Epoch 46/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 46: Train Loss = 0.0126, Val Loss = 0.0177


Epoch 47/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 47: Train Loss = 0.0115, Val Loss = 0.0172


Epoch 48/100: 100%|██████████| 26/26 [00:41<00:00,  1.61s/it]


[notice] Epoch 48: Train Loss = 0.0121, Val Loss = 0.0173


Epoch 49/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 49: Train Loss = 0.0117, Val Loss = 0.0172


Epoch 50/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 50: Train Loss = 0.0114, Val Loss = 0.0177


Epoch 51/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 51: Train Loss = 0.0124, Val Loss = 0.0178


Epoch 52/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 52: Train Loss = 0.0106, Val Loss = 0.0178


Epoch 53/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 53: Train Loss = 0.0116, Val Loss = 0.0173


Epoch 54/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 54: Train Loss = 0.0125, Val Loss = 0.0177


Epoch 55/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 55: Train Loss = 0.0102, Val Loss = 0.0179


Epoch 56/100: 100%|██████████| 26/26 [00:42<00:00,  1.63s/it]


[notice] Epoch 56: Train Loss = 0.0117, Val Loss = 0.0182


Epoch 57/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 57: Train Loss = 0.0114, Val Loss = 0.0178


Epoch 58/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 58: Train Loss = 0.0110, Val Loss = 0.0175


Epoch 59/100: 100%|██████████| 26/26 [00:42<00:00,  1.62s/it]


[notice] Epoch 59: Train Loss = 0.0115, Val Loss = 0.0174
[notice] Early stopping triggered


## Ensemble Model Validation

This phase describes the final steps of model training, where the trained models are combined for robust object detection and their performance is evaluated.

**Ensemble Methodology:**

- **Loading Trained Models:** The pipeline begins by loading the best-performing models saved during the training phase. The yolo8x and yolo8n models are loaded from their respective best.pt files, while the faster_rcnn model's state dictionary is loaded into its architecture. The faster_rcnn model is also set to evaluation mode (.eval()), which is crucial for inference as it disables training-specific layers like dropout.
- **Ensemble Inference (run_ensemble_inference):** This function takes a single image and runs inference on all three models (YOLOv8x, YOLOv8n, and Faster R-CNN) to get their individual bounding box predictions. It then combines all these predictions into a single list.
- **Non-Maximum Suppression (NMS):** After combining the predictions, NMS is applied to filter out redundant and overlapping bounding boxes. This process ensures that for each object, only the most confident and representative bounding box remains, preventing the same object from being detected multiple times by different models.
- **Validation (validate_ensemble):** This function evaluates the combined model's performance on the validation dataset. It compares the final, ensembled predictions with the ground truth targets to calculate key performance metrics.

**Hyperparameters:**

The inference process is controlled by a few key hyperparameters.

- **conf_thres:** The confidence threshold, which filters out any detections with a confidence score below this value.
- **iou_thres:** The Intersection over Union (IoU) threshold used by the NMS algorithm to determine which overlapping bounding boxes should be suppressed.

**Theories and Concepts:**

- **Non-Maximum Suppression (NMS):** A fundamental post-processing algorithm in object detection. It is used to select the best bounding box among many overlapping candidates for a single object.
- **Precision and Recall:** These are standard metrics for evaluating object detection models.
  - Precision measures the accuracy of the model's positive predictions. It is the ratio of true positives to the total number of predicted positives (TP/(TP+FP)).
  - Recall measures the model's ability to find all relevant instances. It is the ratio of true positives to the total number of actual positives (TP/(TP+FN)).

- **Torch.no_grad():** This context manager disables gradient calculation, which is essential during inference and validation. It reduces memory usage and speeds up computation since the model's weights will not be updated.

### Load and Prepared Model for Ensemble Validation

In [13]:
# --- Load and Prepare Models for Ensemble ---
print("[notice] Loading trained models for ensemble...")
yolo8x_model = YOLO("/kaggle/working/runs/detect/yolo8n_trained2/weights/best.pt")  # Fixed path
yolo8n_model = YOLO("/kaggle/working/runs/detect/yolo8x_trained/weights/best.pt")
faster_rcnn_model.load_state_dict(torch.load("/kaggle/working/faster_rcnn_best.pt"))
faster_rcnn_model.to(device)
faster_rcnn_model.eval()

[notice] Loading trained models for ensemble...


FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(

### Ensemble Model Validation

In [14]:
# --- Ensemble Validation ---
@torch.no_grad()
def run_ensemble_inference(image_tensor, conf_thres=0.25, iou_thres=0.5):
    """
    Runs inference with YOLO and Faster R-CNN models and combines their results.
    Accepts a PyTorch tensor with a batch dimension.
    """
    # YOLOv8x inference
    yolo8x_results = yolo8x_model.predict(image_tensor, conf=conf_thres, iou=iou_thres, verbose=False)
    if len(yolo8x_results[0].boxes.data) == 0:
        yolo8x_preds = torch.empty((0, 6), device=device)
    else:
        yolo8x_preds = yolo8x_results[0].boxes.data.to(device)

    # YOLOv8n inference
    yolo8n_results = yolo8n_model.predict(image_tensor, conf=conf_thres, iou=iou_thres, verbose=False)
    if len(yolo8n_results[0].boxes.data) == 0:
        yolo8n_preds = torch.empty((0, 6), device=device)
    else:
        yolo8n_preds = yolo8n_results[0].boxes.data.to(device)

    # Faster R-CNN inference
    faster_rcnn_image = image_tensor.to(device)  # Move to GPU directly
    faster_rcnn_results = faster_rcnn_model([faster_rcnn_image[0]])  # Pass as list, single image per call
    if len(faster_rcnn_results[0]['boxes']) == 0:
        faster_rcnn_preds = torch.empty((0, 6), device=device)
    else:
        boxes = faster_rcnn_results[0]['boxes'].to(device)
        scores = faster_rcnn_results[0]['scores'].to(device)
        labels = faster_rcnn_results[0]['labels'].to(device)
        mask = (labels == 1)  # Filter for class 1 (object)
        faster_rcnn_preds = torch.cat((boxes[mask], scores[mask].unsqueeze(1), torch.zeros_like(scores[mask].unsqueeze(1))), dim=1)

    # Combine detections
    combined_detections = torch.cat((yolo8x_preds, yolo8n_preds, faster_rcnn_preds), dim=0)

    if combined_detections.shape[0] == 0:
        return torch.empty((0, 4), device=device), torch.empty((0,), device=device), torch.empty((0,), device=device)

    # Apply NMS
    combined_boxes = combined_detections[:, :4]
    combined_scores = combined_detections[:, 4]
    combined_classes = combined_detections[:, 5]

    keep_indices = nms(combined_boxes, combined_scores, iou_thres)
    
    final_boxes = combined_boxes[keep_indices]
    final_scores = combined_scores[keep_indices]
    final_classes = combined_classes[keep_indices]

    # Scale boxes to match IMG_SIZE
    orig_shape = yolo8x_results[0].orig_shape
    scale_x = IMG_SIZE / orig_shape[1]
    scale_y = IMG_SIZE / orig_shape[0]
    final_boxes[:, [0, 2]] *= scale_x
    final_boxes[:, [1, 3]] *= scale_y
    final_boxes = final_boxes.clamp(min=0, max=IMG_SIZE-1)

    return final_boxes, final_scores, final_classes

@torch.no_grad()
def validate_ensemble(val_loader):
    """
    Validates the ensemble model on the validation dataset.
    """
    print("[notice] Validating ensemble model...")
    all_metrics = []

    for images, targets in tqdm(val_loader, desc="Validating Ensemble"):
        images = torch.stack(images).to(device)
        batch_size = len(images)

        final_boxes_batch = []
        final_scores_batch = []
        final_classes_batch = []
        for i in range(batch_size):
            final_boxes, final_scores, final_classes = run_ensemble_inference(images[i].unsqueeze(0))
            final_boxes_batch.append(final_boxes)
            final_scores_batch.append(final_scores)
            final_classes_batch.append(final_classes)

        for i in range(batch_size):
            gt_boxes = targets[i]['boxes'].to(device)
            if len(gt_boxes) == 0 or len(final_boxes_batch[i]) == 0:
                all_metrics.append({'precision': 0, 'recall': 0})
                continue

            iou_matrix = box_iou(gt_boxes, final_boxes_batch[i])
            detected_count = 0

            for gt_idx in range(len(gt_boxes)):
                if torch.max(iou_matrix[gt_idx]) >= 0.5:
                    detected_count += 1

            true_positives = detected_count
            predicted_positives = len(final_boxes_batch[i])
            actual_positives = len(gt_boxes)

            precision = true_positives / predicted_positives if predicted_positives > 0 else 0
            recall = true_positives / actual_positives if actual_positives > 0 else 0
            all_metrics.append({'precision': precision, 'recall': recall})

    if all_metrics:
        avg_precision = np.mean([m['precision'] for m in all_metrics])
        avg_recall = np.mean([m['recall'] for m in all_metrics])
        print(f"Ensemble Validation Results:")
        print(f"Average Precision: {avg_precision:.4f}")
        print(f"Average Recall: {avg_recall:.4f}")
    else:
        print("No detections found. Validation metrics not calculated.")

# Run the validation on the ensembled models
validate_ensemble(val_loader)

[notice] Validating ensemble model...


Validating Ensemble: 100%|██████████| 21/21 [00:30<00:00,  1.46s/it]

Ensemble Validation Results:
Average Precision: 0.2331
Average Recall: 0.2331





### Ensemble Model Hyperparameters Tuning

The section consists of three main components: ensemble inference, validation, and automated optimization.

**run_ensemble_inference:** This function is the heart of the system.
 - It takes a single image tensor and a set of hyperparameters (conf_thres, iou_thres) as input.
 - It runs each of the pre-loaded models (yolo8x, yolo8n, and faster_rcnn) to generate object detections. The results from each model are processed to extract the bounding boxes, confidence scores, and class labels.
 - A key part of this function is handling the different output formats of the models. The Faster R-CNN output, which is a dictionary of tensors, is converted into a single tensor with the same structure as the YOLO output ([x1, y1, x2, y2, score, class]).
 - The predictions from all three models are concatenated into a single master tensor.
 - Non-Maximum Suppression (NMS) is then applied to this combined set of predictions. NMS is a post-processing algorithm that eliminates redundant or overlapping bounding boxes that likely belong to the same object.
 - Finally, the function scales the remaining bounding boxes back to their original size and returns the final, de-duplicated set of predictions.

**validate_ensemble:** This function serves as the evaluation loop.
 - It iterates through a validation dataset, fetching images and their corresponding ground truth labels.
 - For each image, it calls run_ensemble_inference to get the ensemble's predictions.
 - It then compares the predicted bounding boxes against the ground truth boxes using the Intersection over Union (IoU) metric. A prediction is considered a "true positive" if its IoU with a ground truth box exceeds a threshold (typically 0.5).
 - Based on these comparisons, the function calculates the Precision, Recall, and F1 Score for the batch. These metrics are then averaged across the entire validation set to provide an overall performance score.

**Automated Hyperparameter Tuning with Optuna:** This section automates the optimization process.
 - An **objective** function is defined, which tells Optuna what to optimize. In this case, the objective is to maximize the F1 score.
 - The **trial** object within the function suggests new values for the hyperparameters - **conf_thres (confidence threshold)** and **iou_thres (IoU threshold)** - within specified ranges.
 - The study.optimize call runs the objective function multiple times with different hyperparameter combinations. Optuna uses intelligent search algorithms (like Bayesian optimization) to efficiently explore the parameter space, saving significant time compared to manual grid search.

**Understanding Automated Hyperparameter Tuning with Optuna**

Optuna is an open-source hyperparameter optimization framework designed to automate the process of finding the best set of hyperparameters for a machine learning model. Unlike traditional methods like grid search or random search, Optuna uses intelligent, state-of-the-art algorithms to efficiently explore the hyperparameter space, often leading to better results in less time.

**The Core Concepts of Optuna**

Optuna operates on a "define-by-run" principle, which means you define the hyperparameter search space dynamically within your code. The entire optimization process is structured around a few key components:

- **Study:** A study is the main object in Optuna. It represents a single optimization run. You can think of it as a container that holds all the information about the trials, including the objective function, the search direction (maximize or minimize), and the results.
- **Trial:** A trial is a single execution of your machine learning training and evaluation process with a specific set of hyperparameters. Optuna runs many trials within a study, each with a different set of hyperparameters proposed by the Sampler.
- **Objective Function:** This is the Python function that you want to optimize. It takes a trial object as its only argument and must return a single numerical value that represents the performance of the model (e.g., accuracy, F1 score, or loss). Inside this function, you define the hyperparameters to be tuned by calling methods on the trial object, such as trial.suggest_float() or trial.suggest_int().

**How the Optimization Process Works**

The optimization process in Optuna follows a straightforward loop:

- **Define the Objective:** You write your objective function that takes a trial object. Inside this function, you:
  - Suggest Hyperparameters: Use trial.suggest_float("learning_rate", 1e-5, 1e-2) to let Optuna know that you want to tune the learning rate within a specific range. Optuna will then dynamically choose a value for this parameter for the current trial.
  - Train and Evaluate: Your code uses the suggested hyperparameters to train a model and evaluate its performance on a validation set.
  - Return the Metric: The function returns the performance metric (e.g., accuracy).
- **Create a Study:** You create a study object and specify the direction of optimization. For example, optuna.create_study(direction="maximize") tells Optuna that a higher value of the objective function is better.
- **Optimize:** You call study.optimize(objective, n_trials=100). Optuna will then execute the objective function for a specified number of trials.

**Intelligent Search Algorithms (Samplers and Pruners)**

What makes Optuna so powerful are the algorithms it uses to select hyperparameters and manage trials.

- **Samplers:** These are the algorithms that propose new sets of hyperparameters for each trial. Optuna's default sampler is the Tree-structured Parzen Estimator (TPE), a Bayesian optimization algorithm.
  - Unlike random search, which samples hyperparameter values independently, TPE builds a probabilistic model of the objective function's performance based on the results of past trials.
  - It maintains two groups of hyperparameters: one for trials with good performance and one for trials with bad performance.
  - It then uses these models to propose new hyperparameters that are more likely to yield a good result, intelligently focusing the search on promising regions of the parameter space.
- **Pruners:** Pruners are a key feature for early stopping. For long-running training processes, a pruner can look at the intermediate results of a trial (e.g., accuracy after each epoch) and decide to stop it early if it's not performing well. This saves a significant amount of computational time and resources. For example, the MedianPruner stops a trial if its performance falls below the median performance of all previous trials at the same epoch.

In [21]:
# --- Ensemble Validation with Hyperparameter Tuning ---
@torch.no_grad()
def run_ensemble_inference(image_tensor, conf_thres, iou_thres):
    """
    Runs inference with YOLO and Faster R-CNN models and combines their results.
    Accepts a PyTorch tensor with a batch dimension.
    """
    # YOLOv8x inference
    yolo8x_results = yolo8x_model.predict(image_tensor, conf=conf_thres, iou=iou_thres, verbose=False)
    if len(yolo8x_results[0].boxes.data) == 0:
        yolo8x_preds = torch.empty((0, 6), device=device)
    else:
        yolo8x_preds = yolo8x_results[0].boxes.data.to(device)

    # YOLOv8n inference
    yolo8n_results = yolo8n_model.predict(image_tensor, conf=conf_thres, iou=iou_thres, verbose=False)
    if len(yolo8n_results[0].boxes.data) == 0:
        yolo8n_preds = torch.empty((0, 6), device=device)
    else:
        yolo8n_preds = yolo8n_results[0].boxes.data.to(device)

    # Faster R-CNN inference
    faster_rcnn_image = image_tensor.to(device)
    faster_rcnn_results = faster_rcnn_model([faster_rcnn_image[0]])
    if len(faster_rcnn_results[0]['boxes']) == 0:
        faster_rcnn_preds = torch.empty((0, 6), device=device)
    else:
        boxes = faster_rcnn_results[0]['boxes'].to(device)
        scores = faster_rcnn_results[0]['scores'].to(device)
        labels = faster_rcnn_results[0]['labels'].to(device)
        mask = (labels == 1)
        faster_rcnn_preds = torch.cat((boxes[mask], scores[mask].unsqueeze(1), torch.zeros_like(scores[mask].unsqueeze(1))), dim=1)

    # Combine detections
    combined_detections = torch.cat((yolo8x_preds, yolo8n_preds, faster_rcnn_preds), dim=0)

    if combined_detections.shape[0] == 0:
        return torch.empty((0, 4), device=device), torch.empty((0,), device=device), torch.empty((0,), device=device)

    # Apply NMS
    combined_boxes = combined_detections[:, :4]
    combined_scores = combined_detections[:, 4]
    combined_classes = combined_detections[:, 5]

    keep_indices = nms(combined_boxes, combined_scores, iou_thres)
    
    final_boxes = combined_boxes[keep_indices]
    final_scores = combined_scores[keep_indices]
    final_classes = combined_classes[keep_indices]

    # Scale boxes to match IMG_SIZE
    orig_shape = yolo8x_results[0].orig_shape
    scale_x = IMG_SIZE / orig_shape[1]
    scale_y = IMG_SIZE / orig_shape[0]
    final_boxes[:, [0, 2]] *= scale_x
    final_boxes[:, [1, 3]] *= scale_y
    final_boxes = final_boxes.clamp(min=0, max=IMG_SIZE-1)

    return final_boxes, final_scores, final_classes

@torch.no_grad()
def validate_ensemble(val_loader, conf_thres, iou_thres):
    """
    Validates the ensemble model on the validation dataset.
    """
    all_metrics = []

    for images, targets in tqdm(val_loader, desc="Validating Ensemble"):
        images = torch.stack(images).to(device)
        batch_size = len(images)

        final_boxes_batch = []
        final_scores_batch = []
        final_classes_batch = []
        for i in range(batch_size):
            final_boxes, final_scores, final_classes = run_ensemble_inference(images[i].unsqueeze(0), conf_thres, iou_thres)
            final_boxes_batch.append(final_boxes)
            final_scores_batch.append(final_scores)
            final_classes_batch.append(final_classes)

        for i in range(batch_size):
            gt_boxes = targets[i]['boxes'].to(device)
            if len(gt_boxes) == 0 or len(final_boxes_batch[i]) == 0:
                all_metrics.append({'precision': 0, 'recall': 0})
                continue

            iou_matrix = box_iou(gt_boxes, final_boxes_batch[i])
            detected_count = 0

            for gt_idx in range(len(gt_boxes)):
                if torch.max(iou_matrix[gt_idx]) >= 0.5:
                    detected_count += 1

            true_positives = detected_count
            predicted_positives = len(final_boxes_batch[i])
            actual_positives = len(gt_boxes)

            precision = true_positives / predicted_positives if predicted_positives > 0 else 0
            recall = true_positives / actual_positives if actual_positives > 0 else 0
            all_metrics.append({'precision': precision, 'recall': recall})

    if all_metrics:
        avg_precision = np.mean([m['precision'] for m in all_metrics])
        avg_recall = np.mean([m['recall'] for m in all_metrics])
        f1_score = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
        print(f"Validation Results - conf_thres={conf_thres:.2f}, iou_thres={iou_thres:.2f}:")
        print(f"Average Precision: {avg_precision:.4f}, Average Recall: {avg_recall:.4f}, F1 Score: {f1_score:.4f}")
    else:
        f1_score = 0
        print(f"Validation Results - conf_thres={conf_thres:.2f}, iou_thres={iou_thres:.2f}: No detections found, F1 Score: {f1_score:.4f}")

    return f1_score

def objective(trial):
    """
    Optuna objective function to tune hyperparameters.
    """
    conf_thres = trial.suggest_float("conf_thres", 0.1, 0.9)
    iou_thres = trial.suggest_float("iou_thres", 0.3, 0.7)
    
    f1_score = validate_ensemble(val_loader, conf_thres, iou_thres)
    return f1_score

# Create and run Optuna study
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

# Print best parameters and F1 score
best_params = study.best_params
best_f1 = study.best_value
print(f"Best hyperparameters: {best_params}")
print(f"Best F1 Score: {best_f1:.4f}")

[I 2025-09-16 12:51:27,969] A new study created in memory with name: no-name-b4543424-49b8-4d34-a23f-5836bc841818
Validating Ensemble: 100%|██████████| 21/21 [00:27<00:00,  1.30s/it]
[I 2025-09-16 12:51:55,188] Trial 0 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.7755666168334344, 'iou_thres': 0.6295414261459875}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.78, iou_thres=0.63:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:27<00:00,  1.31s/it]
[I 2025-09-16 12:52:22,609] Trial 1 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.2969795417892887, 'iou_thres': 0.4991268810744195}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.30, iou_thres=0.50:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:27<00:00,  1.32s/it]
[I 2025-09-16 12:52:50,382] Trial 2 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.3339626316555889, 'iou_thres': 0.6369047024342465}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.33, iou_thres=0.64:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:28<00:00,  1.34s/it]
[I 2025-09-16 12:53:18,516] Trial 3 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.2668385314395314, 'iou_thres': 0.5010290118856199}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.27, iou_thres=0.50:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:28<00:00,  1.35s/it]
[I 2025-09-16 12:53:46,972] Trial 4 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.7702485898984038, 'iou_thres': 0.3057966139151117}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.77, iou_thres=0.31:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:28<00:00,  1.36s/it]
[I 2025-09-16 12:54:15,509] Trial 5 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.6063265558233624, 'iou_thres': 0.5425224392241373}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.61, iou_thres=0.54:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:28<00:00,  1.37s/it]
[I 2025-09-16 12:54:44,265] Trial 6 finished with value: 0.23158493479055783 and parameters: {'conf_thres': 0.17918164945406634, 'iou_thres': 0.5535546733160253}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.18, iou_thres=0.55:
Average Precision: 0.2301, Average Recall: 0.2331, F1 Score: 0.2316


Validating Ensemble: 100%|██████████| 21/21 [00:28<00:00,  1.37s/it]
[I 2025-09-16 12:55:13,110] Trial 7 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.48992242897882876, 'iou_thres': 0.3120360903792454}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.49, iou_thres=0.31:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:28<00:00,  1.37s/it]
[I 2025-09-16 12:55:41,909] Trial 8 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.2511971975770768, 'iou_thres': 0.44675856392378643}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.25, iou_thres=0.45:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:28<00:00,  1.38s/it]
[I 2025-09-16 12:56:10,904] Trial 9 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.6622694639056872, 'iou_thres': 0.39966391353057484}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.66, iou_thres=0.40:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:28<00:00,  1.38s/it]
[I 2025-09-16 12:56:39,922] Trial 10 finished with value: 0.22699386503067484 and parameters: {'conf_thres': 0.883403580598634, 'iou_thres': 0.699439295203502}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.88, iou_thres=0.70:
Average Precision: 0.2270, Average Recall: 0.2270, F1 Score: 0.2270


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 12:57:09,262] Trial 11 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.42944724715536026, 'iou_thres': 0.619423458262578}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.43, iou_thres=0.62:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 12:57:38,639] Trial 12 finished with value: 0.22085889570552147 and parameters: {'conf_thres': 0.8995531334755222, 'iou_thres': 0.41658509468124155}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.90, iou_thres=0.42:
Average Precision: 0.2209, Average Recall: 0.2209, F1 Score: 0.2209


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 12:58:08,118] Trial 13 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.6139655582036667, 'iou_thres': 0.5857209215264755}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.61, iou_thres=0.59:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 12:58:37,528] Trial 14 finished with value: 0.2300204498977505 and parameters: {'conf_thres': 0.13096889978941184, 'iou_thres': 0.4812512757155751}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.13, iou_thres=0.48:
Average Precision: 0.2270, Average Recall: 0.2331, F1 Score: 0.2300


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 12:59:06,809] Trial 15 finished with value: 0.23158493479055783 and parameters: {'conf_thres': 0.3975329971889463, 'iou_thres': 0.6979188299490119}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.40, iou_thres=0.70:
Average Precision: 0.2301, Average Recall: 0.2331, F1 Score: 0.2316


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 12:59:36,109] Trial 16 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.7515758916675244, 'iou_thres': 0.5148175013507363}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.75, iou_thres=0.51:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:00:05,551] Trial 17 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.5068518890803902, 'iou_thres': 0.626327741378838}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.51, iou_thres=0.63:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.41s/it]
[I 2025-09-16 13:00:35,183] Trial 18 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.7728912581470555, 'iou_thres': 0.37907723114253034}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.77, iou_thres=0.38:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:01:04,663] Trial 19 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.31657514316442115, 'iou_thres': 0.46181115487199215}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.32, iou_thres=0.46:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:01:34,092] Trial 20 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.5168145646694251, 'iou_thres': 0.580639355317434}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.52, iou_thres=0.58:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:02:03,551] Trial 21 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.35557440580160304, 'iou_thres': 0.6535949517940864}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.36, iou_thres=0.65:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:02:33,049] Trial 22 finished with value: 0.2300204498977505 and parameters: {'conf_thres': 0.22111503364002832, 'iou_thres': 0.6615607562511341}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.22, iou_thres=0.66:
Average Precision: 0.2270, Average Recall: 0.2331, F1 Score: 0.2300


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:03:02,403] Trial 23 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.3168653485744751, 'iou_thres': 0.5977962401996536}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.32, iou_thres=0.60:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:03:31,584] Trial 24 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.42980596772033175, 'iou_thres': 0.5531801786703071}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.43, iou_thres=0.55:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:04:00,736] Trial 25 finished with value: 0.21903267227849907 and parameters: {'conf_thres': 0.10016648034316336, 'iou_thres': 0.6597416828942637}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.10, iou_thres=0.66:
Average Precision: 0.2065, Average Recall: 0.2331, F1 Score: 0.2190


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:04:30,005] Trial 26 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.3404378885259428, 'iou_thres': 0.5286456838064951}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.34, iou_thres=0.53:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:04:59,329] Trial 27 finished with value: 0.22843496520772427 and parameters: {'conf_thres': 0.1765484079180215, 'iou_thres': 0.6106260122622325}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.18, iou_thres=0.61:
Average Precision: 0.2239, Average Recall: 0.2331, F1 Score: 0.2284


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:05:28,661] Trial 28 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.5627488324898824, 'iou_thres': 0.6451295277059972}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.56, iou_thres=0.65:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:05:58,074] Trial 29 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.2784897954041582, 'iou_thres': 0.49454145883267636}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.28, iou_thres=0.49:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:06:27,346] Trial 30 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.7001854882004066, 'iou_thres': 0.5688750533078639}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.70, iou_thres=0.57:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:06:56,588] Trial 31 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.25294230351119557, 'iou_thres': 0.5098970716373962}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.25, iou_thres=0.51:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:07:25,938] Trial 32 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.39130549869294073, 'iou_thres': 0.4683792220171365}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.39, iou_thres=0.47:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:07:55,148] Trial 33 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.8211847726135847, 'iou_thres': 0.3446953602456799}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.82, iou_thres=0.34:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:08:24,371] Trial 34 finished with value: 0.23158493479055783 and parameters: {'conf_thres': 0.20083817184353564, 'iou_thres': 0.42901668972480933}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.20, iou_thres=0.43:
Average Precision: 0.2301, Average Recall: 0.2331, F1 Score: 0.2316


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:08:53,629] Trial 35 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.28471997883137923, 'iou_thres': 0.5450981557296883}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.28, iou_thres=0.55:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:09:22,998] Trial 36 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.445097579136029, 'iou_thres': 0.6816222709054575}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.45, iou_thres=0.68:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:09:52,492] Trial 37 finished with value: 0.23158493479055783 and parameters: {'conf_thres': 0.16124202385874709, 'iou_thres': 0.5307068741181957}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.16, iou_thres=0.53:
Average Precision: 0.2301, Average Recall: 0.2331, F1 Score: 0.2316


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:10:21,922] Trial 38 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.22938971776994954, 'iou_thres': 0.635673631664164}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.23, iou_thres=0.64:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:10:51,371] Trial 39 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.35860625921993244, 'iou_thres': 0.4492990460999965}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.36, iou_thres=0.45:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:11:20,728] Trial 40 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.6270787821608328, 'iou_thres': 0.3764743250194936}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.63, iou_thres=0.38:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:11:50,068] Trial 41 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.8045341495684214, 'iou_thres': 0.3082433715955938}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.80, iou_thres=0.31:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:12:19,302] Trial 42 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.7071452876713807, 'iou_thres': 0.35318523780640965}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.71, iou_thres=0.35:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]
[I 2025-09-16 13:12:48,597] Trial 43 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.6730750251164458, 'iou_thres': 0.40637625518583}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.67, iou_thres=0.41:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:13:17,939] Trial 44 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.8573480271869985, 'iou_thres': 0.6087096601791255}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.86, iou_thres=0.61:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:13:47,332] Trial 45 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.5486360021884888, 'iou_thres': 0.3319033846390224}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.55, iou_thres=0.33:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:14:16,737] Trial 46 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.7615684648157964, 'iou_thres': 0.5676014821215317}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.76, iou_thres=0.57:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:14:46,144] Trial 47 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.46892370892737023, 'iou_thres': 0.6762661611771166}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.47, iou_thres=0.68:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:15:15,581] Trial 48 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.8659936605146046, 'iou_thres': 0.427177050942485}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.87, iou_thres=0.43:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331


Validating Ensemble: 100%|██████████| 21/21 [00:29<00:00,  1.40s/it]
[I 2025-09-16 13:15:45,069] Trial 49 finished with value: 0.2331288343558282 and parameters: {'conf_thres': 0.386831927849308, 'iou_thres': 0.48909870442432707}. Best is trial 0 with value: 0.2331288343558282.


Validation Results - conf_thres=0.39, iou_thres=0.49:
Average Precision: 0.2331, Average Recall: 0.2331, F1 Score: 0.2331
Best hyperparameters: {'conf_thres': 0.7755666168334344, 'iou_thres': 0.6295414261459875}
Best F1 Score: 0.2331


# Ensemble Inference and Submission Generation

This final phase of the pipeline takes the trained models and combines their predictions to create a single, robust set of detections. This output is then formatted for submission.

**Ensemble Methodology:**

- **Loading Trained Models:** The pipeline begins by loading the best-performing models saved during the training phase. The yolo8x and yolo8n models are loaded from their respective best.pt files, while the faster_rcnn model's state dictionary is loaded into its architecture. The faster_rcnn model is also set to evaluation mode (.eval()), which is crucial for inference as it disables training-specific layers like dropout.
- **Batch Prediction Loop:** The code iterates through each image in the test set. For each image, it runs a separate inference pass on all three models (YOLOv8x, YOLOv8n, and Faster R-CNN) to get individual sets of bounding box predictions. The image is preprocessed and scaled to the correct size (IMG_SIZE) before being passed to the models.
- **Prediction Normalization:** A crucial step is the normalization of the bounding box coordinates. The raw coordinates from the models are converted to a [0, 1] range relative to the original image dimensions. This ensures that the fusion algorithm works with a consistent coordinate system.
- **Weighted Boxes Fusion (WBF):** This is the core of the ensemble process. The weighted_boxes_fusion function is called with the lists of predictions from all three models. It intelligently merges overlapping boxes, giving a final set of detections with higher confidence and accuracy. A list of weights is provided to give more importance to the models that are expected to perform better (e.g., YOLOv8x with a weight of 1.0).
- **Saving Predictions:** The fused predictions are then formatted and saved to individual .txt files in a format compatible with the YOLO specification. Each line in the file contains the class ID, confidence score, and normalized bounding box coordinates in the format of x_center, y_center, width, and height.
- **CSV Submission Generation:** The predictions_to_csv function reads all the generated .txt files and compiles them into a single submission.csv file. This is the required format for competition submissions. The function ensures that every image in the test set is accounted for, adding "no boxes" for any images without detections.

**Hyperparameters:**

The inference process is controlled by a few key hyperparameters.

- **CONF_THRESHOLD:** The confidence threshold, which filters out any detections with a confidence score below this value before they are passed to the WBF algorithm.
- **IOU_THRESHOLD:** The Intersection over Union (IoU) threshold used by the WBF algorithm. It determines how much overlap is required for boxes to be considered for fusion.
- **weights:** A list of floats that assigns a weight to each model's predictions. These weights influence how much each model contributes to the final fused bounding box.

**Theories and Concepts:**

- **Weighted Boxes Fusion (WBF):** A fundamental post-processing algorithm for model ensembling. Unlike simpler methods like Non-Maximum Suppression (NMS), which can discard good predictions, WBF combines the coordinates and confidence scores of overlapping boxes from different models, producing a single, more reliable bounding box.
- **Ensemble Learning:** This code demonstrates the power of ensemble learning, where combining the strengths of different models (e.g., YOLO's speed and Faster R-CNN's accuracy) leads to a more robust and higher-performing final system.
- **Inference:** The process of using a trained model to make predictions on new, unseen data. In this context, the entire pipeline from loading the models to generating the final CSV is an end-to-end inference workflow.
- **YOLO Format:** The output format used for the prediction .txt files is a common standard in object detection, making it easy to share and use the results with other tools. The format represents a bounding box by its center coordinates, width, and height, all normalized to the range of 0 to 1.

## Load and Prepared Models for Ensemble

In [16]:
# --- Load and Prepare Models for Ensemble ---
print("[notice] Loading trained models for ensemble...")
yolo8x_model = YOLO("/kaggle/working/runs/detect/yolo8n_trained2/weights/best.pt")  # Fixed path
yolo8n_model = YOLO("/kaggle/working/runs/detect/yolo8x_trained/weights/best.pt")
faster_rcnn_model.load_state_dict(torch.load("/kaggle/working/faster_rcnn_best.pt"))
faster_rcnn_model.to(device)
faster_rcnn_model.eval()

[notice] Loading trained models for ensemble...


FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(

## Ensemble Inference

In [22]:
CONF_THRESHOLD = 0.7755666168334344
IOU_THRESHOLD = 0.6295414261459875

# --- Ensemble Inference ---
test_images_path = f"{base_path}/testImages/images"
output_dir = "/kaggle/working/predictions/labels"
os.makedirs(output_dir, exist_ok=True)

test_transforms = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(size=(IMG_SIZE, IMG_SIZE)),
])

for img_path in tqdm(list(Path(test_images_path).glob("*")), desc="Predicting"):
    if img_path.suffix.lower() not in ['.png', '.jpg', '.jpeg']:
        continue
    
    img_name = img_path.stem
    img = Image.open(img_path).convert("RGB")
    img_width, img_height = img.size
    
    # Preprocess image
    img_tensor = test_transforms(img).unsqueeze(0).to(device)  # Add batch dimension
    
    # YOLOv8x predictions
    yolo8x_boxes = []
    yolo8x_scores = []
    yolo8x_labels = []
    try:
        yolo8x_results = yolo8x_model.predict(img_path, conf=CONF_THRESHOLD, verbose=False)
        for result in yolo8x_results:
            boxes = result.boxes.data
            if boxes is not None:
                for box in boxes:
                    x1, y1, x2, y2, conf, cls_id = box.tolist()
                    if conf >= CONF_THRESHOLD:
                        # Normalize to [0, 1] relative to original image size
                        yolo8x_boxes.append([x1/img_width, y1/img_height, x2/img_width, y2/img_height])
                        yolo8x_scores.append(conf)
                        yolo8x_labels.append(int(cls_id))
    except Exception as e:
        print(f"[warning] YOLOv8x prediction failed for {img_name}: {e}")
    
    # YOLOv8n predictions
    yolo8n_boxes = []
    yolo8n_scores = []
    yolo8n_labels = []
    try:
        yolo8n_results = yolo8n_model.predict(img_path, conf=CONF_THRESHOLD, verbose=False)
        for result in yolo8n_results:
            boxes = result.boxes.data
            if boxes is not None:
                for box in boxes:
                    x1, y1, x2, y2, conf, cls_id = box.tolist()
                    if conf >= CONF_THRESHOLD:
                        yolo8n_boxes.append([x1/img_width, y1/img_height, x2/img_width, y2/img_height])
                        yolo8n_scores.append(conf)
                        yolo8n_labels.append(int(cls_id))
    except Exception as e:
        print(f"[warning] YOLOv8n prediction failed for {img_name}: {e}")
    
    # Faster R-CNN predictions
    frcnn_boxes = []
    frcnn_scores = []
    frcnn_labels = []
    try:
        with torch.no_grad():
            predictions = faster_rcnn_model([img_tensor[0]])[0]
            for box, score, label in zip(predictions['boxes'], predictions['scores'], predictions['labels']):
                if score >= CONF_THRESHOLD and label == 1:  # Class 1 is object
                    x1, y1, x2, y2 = box.tolist()
                    # Scale boxes to original image size and normalize
                    scale_x = img_width / IMG_SIZE
                    scale_y = img_height / IMG_SIZE
                    x1, x2 = x1 * scale_x / img_width, x2 * scale_x / img_width
                    y1, y2 = y1 * scale_y / img_height, y2 * scale_y / img_height
                    frcnn_boxes.append([x1, y1, x2, y2])
                    frcnn_scores.append(score.item())
                    frcnn_labels.append(0)  # Map to class 0 for consistency
    except Exception as e:
        print(f"[warning] Faster R-CNN prediction failed for {img_name}: {e}")
    
    # Ensemble using Weighted Boxes Fusion
    boxes_list = [yolo8x_boxes, yolo8n_boxes, frcnn_boxes]
    scores_list = [yolo8x_scores, yolo8n_scores, frcnn_scores]
    labels_list = [yolo8x_labels, yolo8n_labels, frcnn_labels]
    weights = [1.0, 0.9, 0.8]  # YOLOv8x > YOLOv8n > Faster R-CNN
    
    try:
        fused_boxes, fused_scores, fused_labels = weighted_boxes_fusion(
            boxes_list,
            scores_list,
            labels_list,
            weights=weights,
            iou_thr=IOU_THRESHOLD,
            skip_box_thr=CONF_THRESHOLD
        )
    except Exception as e:
        print(f"[warning] WBF failed for {img_name}: {e}")
        fused_boxes, fused_scores, fused_labels = [], [], []
    
    # Save ensemble predictions in YOLO format
    output_txt = Path(output_dir) / f"{img_name}.txt"
    with open(output_txt, "w") as f:
        for box, score, label in zip(fused_boxes, fused_scores, fused_labels):
            x1, y1, x2, y2 = box
            x_center = (x1 + x2) / 2
            y_center = (y1 + y2) / 2
            width = x2 - x1
            height = y2 - y1
            f.write(f"{int(label)} {score:.6f} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}\n")

print(f"[notice] All ensemble detections saved to: {output_dir}")

Predicting: 100%|██████████| 159/159 [01:48<00:00,  1.47it/s]

[notice] All ensemble detections saved to: /kaggle/working/predictions/labels





## Submission File Generation

In [23]:
# --- Convert Predictions to Submission CSV ---
def predictions_to_csv(
    preds_folder: str = "/kaggle/working/predictions/labels",
    output_csv: str = "/kaggle/working/submission.csv",
    test_images_folder: str = "/kaggle/input/synthetic-2-real-object-detection-challenge-2/Synthetic to Real Object Detection Challenge 2/testImages/images",
    allowed_extensions: tuple = (".jpg", ".png", ".jpeg")
):
    preds_path = Path(preds_folder)
    test_images_path = Path(test_images_folder)
    
    test_images = {p.stem for p in test_images_path.glob("*") if p.suffix.lower() in allowed_extensions}
    predictions = []
    predicted_images = set()
    
    for txt_file in preds_path.glob("*.txt"):
        image_id = txt_file.stem
        predicted_images.add(image_id)
        with open(txt_file, "r") as f:
            valid_lines = [line.strip() for line in f if len(line.strip().split()) == 6]
        pred_str = " ".join(valid_lines) if valid_lines else "no boxes"
        predictions.append({"image_id": image_id, "prediction_string": pred_str})
    
    missing_images = test_images - predicted_images
    for image_id in missing_images:
        predictions.append({"image_id": image_id, "prediction_string": "no boxes"})
    
    submission_df = pd.DataFrame(predictions)
    submission_df.to_csv(output_csv, index=False, quoting=csv.QUOTE_MINIMAL)
    print(f"[notice] Submission saved to {output_csv}")

predictions_to_csv()

[notice] Submission saved to /kaggle/working/submission.csv
