# Whole training pipeline

This notebook outlines the entire training pipeline used to develop the seven models included in our ensemble for the submission. To enhance readability, we have imported custom functions from various Python files within this GitHub repository. Unlike [`inference.ipynb`](inference.ipynb), this notebook cannot function as a standalone file.

We detail our complete training process in different steps: 
1. Converting the csv training data in a yolo format for training
2. Training uncurated yolo models in cross validation 5 folds
3. Use the 5 folds to curate their corresponding validation data in 2 different ways and save curated data in new csv files.
4. Create the yolo datasets corresponding to the curated training csv
5. Train the 7 models used for the ensembling 

This figure summarizes the training process:

<img src="images/training.png" alt="training process" height="500"/>

## Dependencies

We first need to install some python packages: 

In [None]:
# Needed installations
!pip3 install pandas
!pip3 install tqdm
!pip3 install pillow
!pip3 install ultralytics # install automatically the last version of PyTorch and Torchvision

And to import the necessary:

In [1]:
import os
import shutil

import pandas as pd
from tqdm import tqdm

import torch

from detection import YoloConfig, YoloTrainer, YoloInference
from notebooks.utils_notebook import split_dataset_from_csv,curate_image
from utils import get_device


# Please set the path to the data folder (where images and annotations are stored)
data_folder='path/to/data/folder'


## Converting the csv training data in a yolo format for training

The first step was to convert the training data from the train.csv to a format compatible with YOLO. This is pretty straightforward, we copy the image in an images folder and we create the labels in a labels folder. Each `image.jpg` has a corresponding `image.txt` regrouping the different detections within the image.

In [None]:
data_path=f"{data_folder}/Cytologia/images/"
new_data_path=f"{data_folder}/Cytologia_yolo/"
os.makedirs(new_data_path,exist_ok=True)
csv_path=f"{data_folder}/Cytologia/train.csv"

split_dataset_from_csv(new_data_path,data_path, csv_path)

## Training uncurated yolo models in cross validation 5 folds

Our `YoloConfig` and `YoloTrainer` efficiently manage cross-validation by simply setting the `fold` parameter to the desired number of folds. If `fold` is set to one (or left undefined), cross-validation is not applied.

In [None]:
device=get_device()

classes=['B', 'BA', 'EO', 'Er', 'LAM3', 'LF', 'LGL', 'LH_lyAct', 'LLC', 'LM', 'LY', 'LZMG', 'LyB', 'Lysee', 'M', 'MBL', 'MM', 'MO', 'MoB', 'PM', 'PNN', 'SS', 'Thromb']

conf=YoloConfig(
    dataset=f"{data_folder}/Cytologia_yolo", # WARNING : path must be an absolute path
    backbone="yolo11n.pt",
    img_size=384, # Reduced image size to 384 for faster training and inference
    identifier="no_curation",
    nc=23,
    classes=classes,
    device=device,
    epochs=100, 
    batch_size=64,
    folds=5, 
)
trainer=YoloTrainer(conf)
trainer.train()

del trainer
torch.cuda.empty_cache()

## Use the 5 folds to curate their corresponding validation data in 2 different ways and save curated data in new csv files.

After training, we cleaned the dataset as follows:

- For each YOLO model, we predicted bounding boxes and classes on the validation data.

- We matched the ground truth boxes provided in the train.csv with the YOLO predictions, retaining the YOLO bounding boxes (which were empirically more accurate than the manual annotations) while preserving the ground truth class.

- For YOLO detections with a sufficient score that did not match any ground truth boxes, we masked the corresponding part of the image with a black mask to ensure no unannotated cells remained in the training dataset (whether they were WBCs at the border or unannotated WBCs) (see left figure).

- *Optionnal*: For ground truth boxes with no matching YOLO boxes (IoU < 0.4), we also masked these bounding boxes, assuming they could be incorrect annotations (see right figure).

The figure below illustrates the cleaning procedure we described.

<img src="images/curation1.png" alt="Curation process 1" height="400"/>    <img src="images/curation2.png" alt="Curation process 2" height="400"/>  

As mentionned in the [`README.md`](README.md), the final curation step is *optional*. We created two versions of the dataset: one with this curation step and one without, to increase variability in the data used to train our models.    

Here is the code used to create the 2 new csv train files corresponding to the 2 curated datasets: 

**For the first dataset**: 

In [None]:
mode="msk_iou"

origin_csv_path=f"{data_folder}/Cytologia/train.csv"

csv_path=f"{data_folder}/Cytologia/train_{mode}.csv"
if not os.path.exists(csv_path):
    shutil.copy(origin_csv_path,csv_path)
else:
    raise ValueError("File already exists")

df = pd.read_csv(csv_path)
images_list = df["NAME"].unique()
path=f"{data_folder}/Cytologia/images/"

def get_jpg_files_from_directory(directory_path):
    jpg_files = [f for f in os.listdir(directory_path) if f.endswith('.jpg')]
    return jpg_files

weights_path="models/detection/Cytologia_yolo/yolo11n/384/no_curation_cv/"

# count number of folders (one per fold) to find k
k=len([name for name in os.listdir(weights_path) if os.path.isdir(os.path.join(weights_path, name))])

In [None]:
for i in range(k):
    print(f"fold{i+1}/{k}")
    val_txt=weights_path+f"val_fold{i}.txt"
    yolo_engine=YoloInference(f"{weights_path}/fold_{i}/train/weights/best.pt",device="cuda")

    with open(val_txt) as f:
        list_image_paths = f.readlines()
    list_image_paths = [x.strip() for x in list_image_paths]

    new_data = []
    tqdm_fold=tqdm(list_image_paths,desc="Processing images",unit="image")
    
    for img_path in tqdm_fold:
        name=img_path.split("/")[-1]
        df_img=df[df['NAME']==name]
        boxes = df_img[['x1', 'y1', 'x2', 'y2']].apply(tuple, axis=1).tolist()
        classes = df_img['class'].tolist()
        yolo_output=yolo_engine.predict(img_path)
        curate_image(boxes,yolo_output,img_path,classes,df,new_data,mode=mode)

    del yolo_engine
    torch.cuda.empty_cache()

    if new_data:
        df = pd.concat([df, pd.DataFrame(new_data)], ignore_index=True)                          

df.to_csv(csv_path, index=False)

**For the second dataset**: 

In [11]:
mode="msk_blk"

origin_csv_path=f"{data_folder}/Cytologia/train.csv"

csv_path=f"{data_folder}/Cytologia/train_{mode}.csv"
if not os.path.exists(csv_path):
    shutil.copy(origin_csv_path,csv_path)
else:
    raise ValueError("File already exists")

df = pd.read_csv(csv_path)
images_list = df["NAME"].unique()
path=f"{data_folder}/Cytologia/images/"


In [None]:
for i in range(k):
    print(f"fold{i+1}/{k}")
    val_txt=weights_path+f"val_fold{i}.txt"
    yolo_engine=YoloInference(f"{weights_path}/fold_{i}/train/weights/best.pt",device="cuda")

    with open(val_txt) as f:
        list_image_paths = f.readlines()
    list_image_paths = [x.strip() for x in list_image_paths]

    new_data = []
    tqdm_fold=tqdm(list_image_paths,desc="Processing images",unit="image")
    
    for img_path in tqdm_fold:
        name=img_path.split("/")[-1]
        df_img=df[df['NAME']==name]
        boxes = df_img[['x1', 'y1', 'x2', 'y2']].apply(tuple, axis=1).tolist()
        classes = df_img['class'].tolist()
        yolo_output=yolo_engine.predict(img_path)
        curate_image(boxes,yolo_output,img_path,classes,df,new_data,mode=mode)

    del yolo_engine
    torch.cuda.empty_cache()

    if new_data:
        df = pd.concat([df, pd.DataFrame(new_data)], ignore_index=True)                          

df.to_csv(csv_path, index=False)

## Create the yolo datasets corresponding to the curated training csv

Now, we will create the 2 datasets in the yolo format corresponding to the 2 newly created csv:  

In [None]:
modes=["msk_iou","msk_blk"]
data_path=f"{data_folder}/Cytologia/images/"

for mode in modes: 
    new_data_path=f"{data_folder}/Cytologia_{mode}/"
    os.makedirs(new_data_path,exist_ok=True)
    csv_path=f"{data_folder}/Cytologia/train_{mode}.csv"
    split_dataset_from_csv(new_data_path,data_path, csv_path)

## Train the 7 models used for the ensembling

With all the necessary datasets prepared, we will train the seven models for ensembling:

- **3 models on the curated dataset 1**: Yolo11m, Yolo11x, and Yolov10m
- **3 models on the curated dataset 2**: Yolo11n, Yolov10n, and Yolov10s
- **1 model on the uncurated dataset**: Yolo11n  

We prioritized using smaller YOLO models (except for Yolo11x) to maintain relatively low inference time, even with an ensemble of seven models.

If you need a single model to meet inference speed constraints, I recommend using Yolo11m trained on dataset1 for the best standalone performance or Yolo11n trained on dataset1 for optimal inference speed with good performance. Details about inference speed are available in thee [`README.md`](README.md) and the [`inference.ipynb`](inference.ipynb) files.

### Training yolo models on dataset 1

In [None]:
mode="msk_iou"

device=get_device()

classes=['B', 'BA', 'EO', 'Er', 'LAM3', 'LF', 'LGL', 'LH_lyAct', 'LLC', 'LM', 'LY', 'LZMG', 'LyB', 'Lysee', 'M', 'MBL', 'MM', 'MO', 'MoB', 'PM', 'PNN', 'SS', 'Thromb']

backbones=["yolo11m.pt","yolo11x.pt","yolov10m.pt"]

for backbone in backbones:
    conf=YoloConfig(
        dataset=f"{data_folder}/Cytologia_{mode}", # WARNING : path must be an absolute path
        backbone=backbone,
        img_size=384, 
        identifier="curation250",
        nc=23,
        classes=classes,
        device=device,
        epochs=250, 
        batch_size=64,
        folds=1, 
        val_split=0.05,
    )
    trainer=YoloTrainer(conf)
    trainer.train()

    del trainer
    torch.cuda.empty_cache()

### Training models on dataset 2

In [None]:
mode="msk_blk"

device=get_device()

classes=['B', 'BA', 'EO', 'Er', 'LAM3', 'LF', 'LGL', 'LH_lyAct', 'LLC', 'LM', 'LY', 'LZMG', 'LyB', 'Lysee', 'M', 'MBL', 'MM', 'MO', 'MoB', 'PM', 'PNN', 'SS', 'Thromb']

backbones=["yolo11n.pt","yolov10n.pt","yolov10s.pt"]

for backbone in backbones:
    conf=YoloConfig(
        dataset=f"{data_folder}/Cytologia_{mode}", # WARNING : path must be an absolute path
        backbone=backbone,
        img_size=384, 
        identifier="curation250",
        nc=23,
        classes=classes,
        device=device,
        epochs=250, 
        batch_size=64,
        folds=1, 
        val_split=0.05,
    )
    trainer=YoloTrainer(conf)
    trainer.train()

    del trainer
    torch.cuda.empty_cache()

### Training models on the uncurated dataset

In [None]:

device=get_device()

classes=['B', 'BA', 'EO', 'Er', 'LAM3', 'LF', 'LGL', 'LH_lyAct', 'LLC', 'LM', 'LY', 'LZMG', 'LyB', 'Lysee', 'M', 'MBL', 'MM', 'MO', 'MoB', 'PM', 'PNN', 'SS', 'Thromb']

conf=YoloConfig(
    dataset=f"{data_folder}/Cytologia_yolo", # WARNING : path must be an absolute path
    img_size=384, 
    identifier="no_curation250",
    nc=23,
    classes=classes,
    device=device,
    epochs=250, 
    batch_size=64,
    folds=1, 
    val_split=0.05,
)
trainer=YoloTrainer(conf)
trainer.train()

We have now completed training our seven models. For the inference process, please refer to the [`inference.ipynb`](inference.ipynb) file.