# Submission

First, you must install the following libraries in your environment

In [1]:
# Needed installations
!pip3 install pandas
!pip3 install tqdm
!pip3 install pillow
!pip3 install ultralytics # install automatically the last version of PyTorch and Torchvision



In [2]:
import os
import shutil
import time
from collections import defaultdict

import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm

import torch
import torch.nn.functional as F
import torchvision
import torchvision.transforms as T
from torchvision.ops import nms

from ultralytics import YOLO
from ultralytics.data.augment import LetterBox
from ultralytics.utils.ops import xywh2xyxy


## Inference class for our yolo model

### Yolo model

For this submission, we opted for the latest YOLO model from Ultralytics: [YOLO11](https://docs.ultralytics.com/models/yolo11/)  alongside its predecessor [YOLOv10](https://docs.ultralytics.com/models/yolov10/). This decision was driven by the models' fast inference capabilities, which meet the challenge's requirement for efficient image processing. By leveraging different models, we aim to take advantage of ensembling diverse architectures to improve overall performance.

We specifically chose relatively small versions of YOLO11 and YOLOv10 to maintain a balance between inference speed and accuracy: YOLO11n (size 384), YOLO11m (size 384), and YOLOv10n (size 384).

All models were fine-tuned using an automated dataset curation process. Details of this process are provided in the training section of the notebook, where we outline our strategy for optimizing YOLO models to achieve the best possible performance.

### Non-Maximum-Suppression

The YOLO model from Ultralytics includes its own Non-Maximum Suppression (NMS) algorithm, designed to eliminate overlapping bounding boxes of the same class. We have chosen a relatively low NMS threshold of 0.4, as cells of the same or different classes should not overlap (at least, they do not in the training data).

**Why?** In the context of cytology, one of the goals is to minimize the overlap between cells. After analyzing the dataset, we concluded that overlapping cells are either non-existent or extremely rare.   


<img src="images/bbox.png" alt="Bounding boxes overlap" height="350"/>  <img src="images/iou.png" alt="IoU train data" height="350"/>  

The standard Ultralytics functions do not return class probabilities for each detection, which is crucial for ensembling. To address this, we modified the inference process. First, we used the underlying Torch model to obtain the predictions in the correct format. Then, we modified the non_max_suppression function from Ultralytics to return the probability vector along with the usual outputs.

Additionally, we handled the image preprocessing ourselves.

Since YOLOv10 is end-to-end, we did not use the class probabilities when ensembling with YOLOv10 models.

In [3]:
# Modified version of the non_max_suppression function from the ultralytics library
def non_max_suppression_modified(
    prediction,
    conf_thres=0.25,
    iou_thres=0.45,
    classes=None,
    agnostic=False,
    max_det=300,
    nc=0,  # number of classes (optional)
    max_nms=30000,
    max_wh=7680,
    in_place=True,
    rotated=False,
):
    # Checks
    assert 0 <= conf_thres <= 1, f"Invalid Confidence threshold {conf_thres}, valid values are between 0.0 and 1.0"
    assert 0 <= iou_thres <= 1, f"Invalid IoU {iou_thres}, valid values are between 0.0 and 1.0"
    if isinstance(prediction, (list, tuple)):  # YOLOv8 model in validation model, output = (inference_out, loss_out)
        prediction = prediction[0]  # select only inference output
    if classes is not None:
        classes = torch.tensor(classes, device=prediction.device)

    bs = prediction.shape[0]  # batch size (BCN, i.e. 1,84,6300)
    nc = nc or (prediction.shape[1] - 4)  # number of classes
    nm = prediction.shape[1] - nc - 4  # number of masks
    mi = 4 + nc  # mask start index
    xc = prediction[:, 4:mi].amax(1) > conf_thres  # candidates

    # Settings
    # min_wh = 2  # (pixels) minimum box width and height

    prediction = prediction.transpose(-1, -2)  # shape(1,84,6300) to shape(1,6300,84)
    if not rotated:
        if in_place:
            prediction[..., :4] = xywh2xyxy(prediction[..., :4])  # xywh to xyxy
        else:
            prediction = torch.cat((xywh2xyxy(prediction[..., :4]), prediction[..., 4:]), dim=-1)  # xywh to xyxy

    t = time.time()

    output = [torch.zeros((0, 6 + nm), device=prediction.device)] * bs
    output_oh= [torch.zeros((0, 4 + nc), device=prediction.device)] * bs

    for xi, x in enumerate(prediction):  # image index, image inference
        # Apply constraints
        # x[((x[:, 2:4] < min_wh) | (x[:, 2:4] > max_wh)).any(1), 4] = 0  # width-height
        
        # get the elements of the image that have a confidence above the threshold
        x = x[xc[xi]]  # confidence
        

        # If none remain process next image
        if not x.shape[0]:
            continue

        # Detections matrix nx6 (xyxy, conf, cls)
        # cls is probs here
        box, cls, mask = x.split((4, nc, nm), 1)


        # ! Main modification here
        conf, j = cls.max(1, keepdim=True)
        x = torch.cat((box, conf, j.float(), mask), 1)[conf.view(-1) > conf_thres]

        x_oh=torch.cat((box,cls),1)#[cls.view(-1)>conf_thres]

        # Check shape
        n = x.shape[0]  # number of boxes
        if not n:  # no boxes
            continue
        
        if n > max_nms:  # excess boxes
            indices = x[:, 4].argsort(descending=True)[:max_nms]
            x = x[indices]
            x_oh = x_oh[indices]  #
            #x = x[x[:, 4].argsort(descending=True)[:max_nms]]  # sort by confidence and remove excess boxes
            

        # Batched NMS
        c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
        scores = x[:, 4]  # scores
        
        boxes = x[:, :4] + c  # boxes (offset by class)
        i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
        i = i[:max_det]  # limit detections

        output[xi] = x[i]
        output_oh[xi] = x_oh[i]

    return output,output_oh

In [4]:
class YoloInference():
    def __init__(self,weights,device="cuda",verbose=False,load_model=True):
        self.device=device
        self.load_model=load_model
        self.weights=weights
        self.model=None
        if load_model:
            self.yolo=YOLO(weights,verbose=verbose).to(self.device)
            self.model=self.yolo.model

        self.model_type=None
        if self.weights.find("yolov10")!=-1:
            self.model_type="yolov10"
        elif self.weights.find("yolo11")!=-1:
            self.model_type="yolo11"


        self.letter_box=LetterBox((384,384))
        self.transform=T.Compose([T.ToTensor()])  

        self.conf_mat=None

        # load the conf matrix for eventual ensembling if it exists
        conf_path=os.path.dirname(os.path.dirname(os.path.dirname(weights)))+"/conf_matrix.npy"
        if os.path.exists(conf_path):
           self.conf_mat=np.load(conf_path)
    
    def predict(self,img,verbose=False,conf=0.05,mc_nms=True,return_probs=False):
        if self.model_type=="yolov10" or not return_probs:
            return self.predict_base(img,verbose=verbose,conf=conf,mc_nms=mc_nms)
        elif self.model_type=="yolo11":
            return self.predict_yolo11_probs(img,verbose=verbose,conf=conf,mc_nms=mc_nms)
        else:
            print("Model type not recognized")
            return None

    def predict_yolo11_probs(self,img,verbose=False,conf=0.05,mc_nms=True):
        # img is a PIL or a path
        if not self.load_model:
            print("LOADING MODEL")
            self.yolo=YOLO(self.weights,verbose=verbose).to(self.device)
            self.model=self.yolo.model

        # Image loading
        if isinstance(img,str):
            img=Image.open(img)
        img_np=self.letter_box(image=np.array(img))
        transform=T.Compose([T.ToTensor()]) # 
        img_tensor=transform(img_np).unsqueeze(0).to("cuda")

        # inference
        prediction,_ = self.model(img_tensor)  
        
        outputs,outputs_oh=non_max_suppression_modified(prediction, conf_thres=conf, iou_thres=0.4,agnostic=mc_nms)
        outputs=outputs[0]
        outputs_oh=outputs_oh[0]
        

        width,height=img.size
        boxes,scores,labels,probs=[],[],[],[]
        for out,out_oh in zip(outputs,outputs_oh):
            out=out.cpu().numpy()
            out_oh=out_oh.cpu().numpy()
            box=out[:4]
            x1,y1,x2,y2=box
            # Rescale to fit the original image size
            x1,y1,x2,y2=max(0,x1*width/384),max(0,y1*height/384),min(width,x2*width/384),min(height,y2*height/384)
            score=out[4]
            label=out[5]
            prob=out_oh[4:]
            boxes.append((np.round(x1).astype(int),np.round(y1).astype(int),np.round(x2).astype(int),np.round(y2).astype(int)))
            scores.append(score)
            labels.append(label)
            probs.append(prob)
        
        boxes = np.asarray(boxes)
        scores = np.asarray(scores)
        labels = np.asarray(labels)
        probs = np.asarray(probs)

        
        if not self.load_model:
            self.unload_model()

        return boxes,scores,labels,probs
    

    def predict_base(self,img,verbose=False,conf=0.05,mc_nms=True):
        if not self.load_model:
            print("LOADING MODEL")
            self.yolo=YOLO(self.weights,verbose=verbose).to(self.device)

        result=self.yolo.predict(img,verbose=verbose,iou=0.4,conf=conf) # iou threshold for non-max suppression

        # may deactivate mc_nms when using wbf in ensemble/tta
        if mc_nms:
            kept=nms(result[0].boxes.xyxy,result[0].boxes.conf,0.4) 
            boxes=result[0].boxes.xyxy[kept]
            scores=result[0].boxes.conf[kept]
            labels=result[0].boxes.cls[kept]
        else:
            boxes = result[0].boxes.xyxy
            scores = result[0].boxes.conf
            labels = result[0].boxes.cls
        

        boxes = boxes.cpu().numpy()
        for i,box in enumerate(boxes):
            x1,y1,x2,y2=box 
            boxes[i]=np.round(x1).astype(int),np.round(y1).astype(int),np.round(x2).astype(int),np.round(y2).astype(int)


        scores=scores.cpu().numpy()
        labels=labels.cpu().numpy()
            
        if not self.load_model:
            self.unload_model()

        return boxes,scores,labels

    def unload_model(self):
        """Unload the model to free up memory."""
        self.yolo = None
        self.model = None
        torch.cuda.empty_cache() 

To ensure compatibility with a wide range of computers, we have implemented an option to load models individually during inference, rather than loading all models at initialization. While this approach makes the model more accessible, it will significantly increase inference time, as model loading itself is a time-consuming task.

Instructions on how to enable the "one by one loader" are provided in the section for filling the predictions CSV.

## Detection models ensembling

To enhance the performance of our method, we decided to apply model ensembling to combine the predictions of multiple models. While ensembling is commonly used in classification tasks and is relatively straightforward, there is limited literature on its application in object detection.

Inspired by the paper [Weighted boxes fusion: Ensembling boxes from different object detection models](https://arxiv.org/pdf/1910.13302), we developed a method to find the optimal bounding box by considering a list of boxes and their corresponding confidence scores. In the context of hematology, overlapping boxes are not permissible (as explained in the Non-Maximum Suppression section). Therefore, we included boxes with different predicted labels in our weighted box fusion algorithm and selected the best class based on the weighted score of each class.

<img src="images/wbf.png" alt="Weighted Box Fusion" height="200"/>

Illustration of the weighted box fusion algorithm

### Weighted Box Fusion Algorithm

In [5]:
def compute_iou(box1, box2):
        '''Function to compute the Intersection over Union (IoU) of two bounding boxes'''
        x1_1, y1_1, x2_1, y2_1 = box1
        x1_2, y1_2, x2_2, y2_2 = box2

        inter_x1 = max(x1_1, x1_2)
        inter_y1 = max(y1_1, y1_2)
        inter_x2 = min(x2_1, x2_2)
        inter_y2 = min(y2_1, y2_2)

        inter_area = max(0, inter_x2 - inter_x1) * max(0, inter_y2 - inter_y1)

        area1 = (x2_1 - x1_1) * (y2_1 - y1_1)
        area2 = (x2_2 - x1_2) * (y2_2 - y1_2)

        union_area = area1 + area2 - inter_area

        return inter_area / union_area if union_area > 0 else 0

def fuse_box(cluster):
    boxes, sohs= cluster["boxes"], cluster["soh"]

    label_score_sum = defaultdict(float) 
    for soh in sohs:
        for label, score in enumerate(soh[:]):
             label_score_sum[label] += score  
        
    best_label = max(label_score_sum, key=label_score_sum.get)

    filtered = [(box, soh) for box, soh in zip(boxes, sohs) if np.argmax(soh) == best_label]

    not_filtered = [(box, soh) for box, soh in zip(boxes, sohs)]    
    x1_mean = sum(box[0] for box, _ in not_filtered) / len(not_filtered)
    y1_mean = sum(box[1] for box, _ in not_filtered) / len(not_filtered)
    x2_mean = sum(box[2] for box, _ in not_filtered) / len(not_filtered)
    y2_mean = sum(box[3] for box, _ in not_filtered) / len(not_filtered)

    soh_mean = sum(soh for _, soh in filtered) / len(boxes)

    return {"box": [round(x1_mean), round(y1_mean), round(x2_mean), round(y2_mean)], "soh": soh_mean}
        
def wbf(boxes,sohs):
    # Sort given boxes by their scores
    max_scores = np.max(sohs,axis=1)
    sorted_indices = np.argsort(max_scores)[::-1]

    boxes=np.array(boxes)
    sohs=np.array(sohs)

    boxes = boxes[sorted_indices] # B in paper
    sohs = sohs[sorted_indices] 

    # Elements of clusters are Dict of boxes and sohs
    clusters=[] #L in paper
    # Elements of fused_boxes are Dict of box and soh
    fused_boxes=[] #F in paper

    for box, soh in zip(boxes, sohs):
        associated=False
        for i,fused_box in enumerate(fused_boxes):
            f_box=fused_box["box"] 
            iou=compute_iou(box, f_box)
            if iou>0.4: # was 0.55 in the paper
                clusters[i]["boxes"].append(box)
                clusters[i]["soh"].append(soh)
                associated=True
                # Computer new fused box
                # Compute the new coordinates of the bounding box
                fused_boxes[i]=fuse_box(clusters[i])
                break

        if not associated:
            clusters.append({"boxes":[box],"soh":[soh]})
            fused_boxes.append({"box":box,"soh":soh})

    # Return the boxes, scores and labels, np array or not ?
    boxes = np.array([fused_box["box"] for fused_box in fused_boxes])
    scores = np.array([np.max(fused_box["soh"]) for fused_box in fused_boxes])
    labels = np.array([np.argmax(fused_box["soh"]) for fused_box in fused_boxes])

    return boxes, scores, labels


### Yolo Ensembling

In [6]:
class EnsembleYolo:
    def __init__(self, models:list[YoloInference],use_probs=False):
        self.models = models
        self.use_probs=use_probs

        self.meta_model=None
            

    def predict(self,img,conf=0.00001,verbose=False):
        """Ensemble inference to get prediction as an integer (batch)"""
        all_boxes=[]
        all_soh=[]

        for model in self.models:
            if model.model_type=="yolov10" or not self.use_probs:
                boxes,scores,preds=model.predict(img,verbose=verbose,conf=conf,mc_nms=False) 
            else:
                boxes,scores,preds,probs=model.predict(img,verbose=verbose,conf=conf,mc_nms=True,return_probs=True)
            all_boxes.append(boxes)

            sohs=[]

            if not self.use_probs or model.model_type=="yolov10":
                for score,pred in zip(scores,preds):
                    soh=F.one_hot(torch.tensor(pred).long(), num_classes=23)*score
                    sohs.append(soh)
            else:
                for prob in probs:
                    #convert prob to tensor
                    prob=torch.tensor(prob)
                    sohs.append(prob)

            all_soh.append(sohs)

        list_all_boxes = [box for boxes in all_boxes for box in boxes]
        list_all_soh = [soh for sohs in all_soh for soh in sohs]

        return wbf(list_all_boxes,list_all_soh)

## Filtering of boxes 

In the test.csv, we know the exact number of detections required for each image. Our strategy was to set a very low threshold for the YOLO model to ensure that we predict more boxes than the actual number of boxes in the image. We then retained only the boxes with the highest scores.

In [7]:
def filter_boxes(occurences,boxes,scores,labels):
    data = list(zip(scores, boxes, labels))
    data.sort(reverse=True, key=lambda x: x[0])
    data = data[:len(occurences)]
    scores, boxes, labels = zip(*data)
    return list(boxes), list(scores), list(labels)

## Inference pipeline to fill the csv

This is the main pipeline for processing all the test images and filling the CSV.

You can download the best pretrained weights with this link : [Google Drive](https://drive.google.com/drive/folders/1gDwqRtLoKqwLIaGFd2SwPffzibOtIEmx?usp=sharing)

In [None]:
# Mapping of the classes
itos={0:'B', 1:'BA', 2:'EO', 3:'Er', 4:'LAM3', 5:'LF', 6:'LGL', 7:'LH_lyAct', 8:'LLC', 9:'LM', 10:'LY', 11:'LZMG', 12:'LyB', 13:'Lysee', 14:'M', 15:'MBL', 16:'MM', 17:'MO', 18:'MoB', 19:'PM', 20:'PNN', 21:'SS', 22:'Thromb'}

# PATH TO YOUR IMAGES
data_path="data/Cytologia/images"

# PATH TO THE TEST CSV
test_csv_path="test.csv"

# PATH WHERE THE PREDICTIONS WILL BE SAVED
csv_path="predictions.csv"

if test_csv_path != csv_path:
    shutil.copy(test_csv_path,csv_path)

# Load the csv
df = pd.read_csv(csv_path)

# Create the YOLO models and load them if possible, else set the parameter load_model to False
yolo_engine1=YoloInference("models/best/yolo11n384_blk.pt",device="cuda",load_model=True) # load_model=False => add this parameter to avoid loading the model if your computer has limited memory
yolo_engine2=YoloInference("models/best/yolo11m384_iou.pt",device="cuda",load_model=True) # load_model=False => add this parameter to avoid loading the model if your computer has limited memory
yolo_engine3=YoloInference("models/best/yolov10n384_blk.pt",device="cuda",load_model=True) # load_model=False => add this parameter to avoid loading the model if your computer has limited memory
yolo_engine4=YoloInference("models/best/yolov10m384_iou.pt",device="cuda",load_model=True) # load_model=False => add this parameter to avoid loading the model if your computer has limited memory
yolo_engine5=YoloInference("models/best/yolo11x384_iou.pt",device="cuda",load_model=True) # load_model=False => add this parameter to avoid loading the model if your computer has limited memory
yolo_engine6=YoloInference("models/best/yolov10s384_iou.pt",device="cuda",load_model=True) # load_model=False => add this parameter to avoid loading the model if your computer has limited memory
yolo_engine7=YoloInference("models/best/yolo11n384_nc.pt",device="cuda",load_model=True) # load_model=False => add this parameter to avoid loading the model if your computer has limited memory

ens_engine=EnsembleYolo([yolo_engine1,yolo_engine2,yolo_engine3,yolo_engine4,yolo_engine5,yolo_engine6,yolo_engine7],use_probs=False)

# Add the columns if they don't exist
if not {'x1', 'y1', 'x2', 'y2', 'class'}.issubset(df.columns):
    for col in ['x1', 'y1', 'x2', 'y2', 'class']:
        df[col] = np.nan

# Get unique NAME (aka the image names)
names = df["NAME"].unique()
tqdm_names= tqdm(names)

# Loop over all the images
for name in tqdm_names:
    img_path=f"{data_path}/{name}"

    # Count the number of occurences of the image in the dataframe to get the number of cells to predict
    occurences = df[df["NAME"]==name]
    # Get the corresponding trustii_ids
    trustii_ids = occurences["trustii_id"].tolist()

    # Inference
    boxes,scores,labels=ens_engine.predict(img_path,conf=0.00001,verbose=False) 

    boxes=list(boxes)
    scores=list(scores)
    labels=list(labels)
    
    # Filter the boxes if there are more boxes than cells
    if len(occurences) < len(boxes):
        boxes,scores,labels=filter_boxes(occurences,boxes,scores,labels)

    # Update the dataframe with the predictions
    for idx,(box,label) in enumerate(zip(boxes,labels)):
        x1,y1,x2,y2 = box
        trustii_id = trustii_ids[idx]
        cls=itos[label]
        df.loc[df["trustii_id"] == trustii_id, ["x1", "y1", "x2", "y2", "class"]] = [x1, y1, x2, y2, cls]

    # If there are more cells in the image than predicted (not supposed to happen since we set the confidence threshold to 0.00001, but still necessary just in case)
    for i in range(len(boxes),len(occurences)):
        trustii_id = trustii_ids[i]
        df.loc[df["trustii_id"] == trustii_id, ["x1", "y1", "x2", "y2", "class"]] = [0, 0, 0, 0, 'PNN']

df.to_csv(csv_path, index=False)
print("CSV prediction file saved in ",csv_path)

  df.loc[df["trustii_id"] == trustii_id, ["x1", "y1", "x2", "y2", "class"]] = [x1, y1, x2, y2, cls]
  0%|          | 29/20751 [00:03<25:38, 13.47it/s]  

## Training method explained

A significant part of our results is due to the training method we employed. As mentioned earlier, we chose YOLO as our backbone model to ensure low inference time while maintaining decent performance.

After visualizing parts of the dataset, we concluded that cleaning the data would improve the convergence and stability of the training process. This decision was driven by several factors:

- White blood cells (WBCs) located at the edges of images are sometimes annotated and sometimes not, which can lead to confusion during training.
- A portion of the dataset contains annotation errors, including misplaced bounding boxes, bounding boxes outside the image, or missing annotations.

The quality of the classification for each cell was not evaluated, as our team lacks the biological expertise to accurately classify cells with certainty. However, such noise in the dataset, if present, could significantly affect performance.

### Dataset cleaning

To clean the training dataset, we trained a YOLO model using cross-validation and cleaned the validation data from each fold. This approach ensures that our model does not overfit the training data and can be effectively used for data curation.

The annotations were formatted in the standard YOLO format, and the 5 models (k=5 for our cross-validation) were trained for 250 epochs using the [Ultralytics Library](https://docs.ultralytics.com/models/yolo11/).


After training, we cleaned the dataset as follows:

- For each YOLO model, we predicted bounding boxes and classes on the validation data.

- We matched the ground truth boxes provided in the train.csv with the YOLO predictions, retaining the YOLO bounding boxes (which were empirically more accurate than the manual annotations) while preserving the ground truth class.
- For YOLO detections with a sufficient score that did not match any ground truth boxes, we masked the corresponding part of the image with a black mask to ensure no unannotated cells remained in the training dataset (whether they were WBCs at the border or unannotated WBCs) (see left figure).
- For ground truth boxes with no matching YOLO boxes (IoU < 0.4), we removed these bounding boxes from the dataset, assuming they were incorrect annotations (see right figure).

The figure below illustrates the cleaning procedure we described.

<img src="images/curation1.png" alt="Curation process 1" height="400"/>    <img src="images/curation2.png" alt="Curation process 2" height="400"/>  




**What Didn’t Work ?** : We also experimented with a more aggressive curation process, aiming to mask cells where the ground truth label and the YOLO-predicted label differed significantly. We implemented two approaches for this: soft-labels and hard-labels, intending to remove potential misannotations made by the labelers.

- Hard labels: This approach simply checks if the predicted label matches the ground truth. If not, the corresponding box is masked.
- Soft labels: This method is more lenient, verifying whether the predicted and ground truth labels belong to the same family of cells. The cell families were defined based on input from the AI chatbot of the challenge and other online resources.   

However, both methods led to a decrease in performance. Some cell types were frequently misclassified, resulting in a highly unbalanced dataset that negatively impacted the training process.

### Training hyperparameters, augmentations etc ...

To train the model effectively, we tested different settings for the training process. This section will outline our assumptions and the tests we conducted to validate them. Finally, we will describe the final training setup that we used.

**Augmentations**: First, it’s important to note that our baseline for augmentations was the set of default augmentations provided by the Ultralytics library, which are applied when no specific augmentation parameters are set:

- *Mosaic*: This augmentation creates new training images by combining patches from different images into a single mosaic. Mosaic augmentation is disabled during the last 10 epochs of training.

- *HSV* (hue, saturation, and value): These three augmentation techniques affect the hue, saturation, and value of an image. Given that the classification of white blood cells heavily depends on the staining used to color the cells, we initially attempted to disable these transformations. After testing, we kept the default values for saturation and value but set the hue to 0, even though the default hue value was already very small.

- *Geometrical augmentations* (translate, scale, flip): We kept all other default augmentations, as they are primarily geometric and unlikely to hinder the training process in our case.

- *Deactivated transformations*: The transformations that were disabled by default were not modified.

- *MixUp*: Since some classes were very similar to each other, we extensively experimented with the MixUp augmentation to soften the class boundaries. Specifically, we tested different MixUp thresholds and tried stopping it before the end of training. Our experiments showed that MixUp did not improve performance for our specific problem.

**Selection of models** : To enhance model performance, we employed model ensembling as described in the inference section. However, for optimal ensembling, it is crucial to choose models that complement each other. We experimented with various configurations and ultimately chose a mix of different YOLOv11 sizes combined with various YOLOv10 sizes. Since YOLOv11 and YOLOv10 produce similar results but use different methods, our experiments demonstrated that they complement each other well.

We specifically chose relatively small versions of YOLO11 and YOLOv10 to strike a balance between inference speed and accuracy: YOLO11n (size 384), YOLO11m (size 384), and YOLOv10n (size 384). The only exception is YOLOv11x, which is a larger model compared to the others. However, being a YOLO model, its inference speed still remains relatively fast.

All models were fine-tuned using an automated dataset curation process. Details of this process are provided in the training section of the notebook, where we outline our strategy for optimizing YOLO models to achieve the best possible performance.

We used the following models for our best solution :


**Models trained with iou curation**: The models `yolo11m384_iou.pt`,`yolo11x384_iou.pt`,`yolov10s384_iou.pt` and `yolov10m384_iou.pt` were trained on a dataset where ground truth boxes with a low IoU to the predicted ones were masked.  
**Models trained with simple curation**: The models `yolo11m384_blk.pt` and `yolov10n384_blk.pt` were trained on a dataset where ground truth boxes were kept, even if they had a low IoU with the predicted ones.  
**Model not curated**: The `yolo11n384_nc.pt` model was trained on the raw data from the CSV, with only minor corrections made for bounding boxes that were out of the image.

You can download the best pretrained weights with this link : [Google Drive](https://drive.google.com/drive/folders/1gDwqRtLoKqwLIaGFd2SwPffzibOtIEmx?usp=sharing)

**Hyperparameters and training** : We trained all our models for 250 epochs with a batch size of 64 (or 32 for larger models, such as YOLOv11l). We reduced the image size from 640 to 384 and used a validation split of 5% of the data to maximize the amount of data available for training.