In [None]:
%load_ext autoreload
%autoreload 2

# MIDOG 2025 Dataset Preparation and Alrogithm Setup

This notebook will show you how you could setup the MIDOG++ dataset and train a simple detection method. Before starting with this notebook, you should have a look at the `MIDOG2025_01_Exploratory_Analysis.ipynb` to get familiar with the data. If you have not yet downloaded the MIDOG++ dataset, check out the previous notebook or download the dataset with the `download_MIDOGpp.py` script.

Here, we will go through the following steps:
1. Prepare the MIDOG++ dataset for algorithm development.
2. Setup a simple object detection pipeline.
3. Evaluate the model on large images. 

**Note: This is notebook should just give you an idea of how to approach the challenge. You can be creative and set the dataset up differently. You are also encouraged to use different models. Have a look at the methods from previous challenges get a better picture of the task. Here is a link to the [MIDOG 2022 Overview Paper](https://www.sciencedirect.com/science/article/pii/S136184152400080X).**

# Prerequisites

Make sure that you set up your environment correctly and downloaded the MIDOG++ dataset by following the instructions of the `README.md` or from the previous notebook.


In [None]:
import numpy as np
import pandas as pd 
import openslide 
import matplotlib.pyplot as plt 
import json 
import plotly.express as px 
import torch 

from pathlib import Path

# 1. Prepare the MIDOG++ dataset for algorithm development

In the following steps, we will split the data into a training, validation and test split to get started with the development of our detection algorithm. For easier handling and visualization we will convert the `json` database file into a `pandas` dataframe. 

In [None]:
def create_train_val_test_datasets(
        json_file, 
        train_ratio: float = 0.7, 
        val_ratio: float = 0.2, 
        test_ratio: float = 0.1,
        random_seed: int = 42
        ):
    """Converts the json file to pandas dataframe and creates train, val, and test split containing all tumortypes."""

    # Verify ratios sum to 1
    assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 1e-10, 'Ratios must sum to 1.'

    database = json.load(open(json_file, 'rb'))

    # Read image data
    image_df = pd.DataFrame.from_dict(database['images'])
    image_df = image_df.drop(columns=['license', 'width', 'height'])
    image_df = image_df.rename({'id':'image_id', 'tumor_type':'tumortype', 'file_name':'filename'}, axis=1)

    # Group by tumortype and sample training split
    train_ids = image_df.groupby('tumortype').sample(frac=train_ratio, random_state=random_seed)['image_id']

    # Sample validation split from the remaining samples 
    remaining_df = image_df.query('image_id not in @train_ids')
    adjusted_val_ratio = val_ratio / (val_ratio + test_ratio)
    valid_ids = remaining_df.groupby('tumortype').sample(frac=adjusted_val_ratio, random_state=random_seed)['image_id']

    # Assign splits, test samples are neither train nor val samples 
    image_df['split'] = 'test'  
    image_df.loc[image_df['image_id'].isin(train_ids), 'split'] = 'train'
    image_df.loc[image_df['image_id'].isin(valid_ids), 'split'] = 'val'

    # Read annotations and convert to center locations 
    annotations_df = pd.DataFrame.from_dict(database['annotations'])
    annotations_df = annotations_df.assign(x=annotations_df['bbox'].apply(lambda x: int((x[0] + x[2]) / 2)))
    annotations_df = annotations_df.assign(y=annotations_df['bbox'].apply(lambda x: int((x[1] + x[3]) / 2)))
    annotations_df = annotations_df.drop(columns=['bbox', 'labels', 'id'])

    # Merge data, rename and rearrange
    comb_df = image_df.merge(annotations_df, how='right', on='image_id')
    comb_df = comb_df.rename({'category_id': 'label', 'image_id': 'slide'}, axis=1)
    comb_df = comb_df[['x', 'y', 'label', 'filename', 'slide', 'split', 'tumortype']]

    return comb_df

## Split the data into train, val, and test split

For the purpose of this notebook, we will split the data into a 70/20/10 train, val, and test split. This is simply to show you how the pipeline works. For challenge participation you can think of different ways to split your data. You may want to train different models for tumortypes individually, or you want to use all images for training and validation and test your algorithm on the preliminary test set. The choice is up to you. However, it is always good practice to test your method on some unseen cases. 

In [None]:
# Set the path to your dataset file 
dataset_file = '/data/patho/MIDOGpp/images/MIDOGpp.json'

# Set your train, val, and test ratios
train_ratio = 0.7
val_ratio = 0.2
test_ratio = 0.1
random_seed = 42

# Create the dataset
dataset = create_train_val_test_datasets(dataset_file, train_ratio, val_ratio, test_ratio, random_seed)

# Save the dataset 
dataset.to_csv('demo_dataset.csv', index=False)
dataset.head()

Let's have a look at the different splits that we just created. 

In [None]:
# The number of files per train, val, test split
print(dataset.groupby('split')['filename'].nunique())

In [None]:
# The distribution of mitotic figures in each split 
for split in dataset['split'].unique():
    split_annos = dataset.query('split == @split')
    row = []
    for image_id in split_annos["slide"].unique():
        image_annos = split_annos.query('slide == @image_id')
        row.append([image_id, len(image_annos[image_annos['label'] == 1]), "mitotic figure"])
        row.append([image_id, len(image_annos[image_annos['label'] == 2]), "hard negative"])
    tumortype_meta = pd.DataFrame(row, columns=["image_id", "total", "type"])
    fig = px.bar(tumortype_meta, x="image_id", y="total", color="type", title=f"{split}: Annotations per image")
    fig.show()

It looks like we have similar distribution of mitotic figures in each split. This helps our algorithm to handle cases with high and low mitotic figure density equally well. 

# 2. Set up simple object detection pipeline

In the following steps we will create a relatively simple object detection pipeline. We use the `torchvision` library to create an `FCOS` detection algorithm. We use the `pytorch-lightning` library to train our model. 

You can also use the `train.py` script to train your own model. Here, we will follow the same steps as in the script. 

## Create the datamodule 

The dataloading pipeline in this repository is openslide-based and implements a class-specific online-sampling strategy. We sample patches around mitotic figures and hard negatives with a certain probability enabling us to sample a very diverse set of patches in every epoch. For instance, if we set `fg_prob=0.5` and `arb_prob=0.25`, 50% of the patches should contain at least one mitotic figure, 25% should contain at least on hard negative, and the other 25% are sampled completely at random. 

**Note: We use the hard negatives only to sample challenging patches for the model to learn decision boundaries more efficiently. We do not train the models to detect this class.**

In [None]:
from utils.datamodule import ObjectDetectionDataModule

# Set the parameters
img_dir = '/data/patho/MIDOGpp/images'
dataset_file = 'demo_dataset.csv'
domain_col = 'tumortype'
box_format = 'cxcy'
num_train_samples = 512
num_val_samples = 256
fg_prob = 0.5       # probability of patches with mitotic figures
arb_prob = 0.25     # probability of random patches 
patch_size = 512
batch_size = 12
num_workers = 6

# Create the datamodule
dm = ObjectDetectionDataModule(
    img_dir=img_dir,
    dataset=dataset_file,
    domain_col=domain_col, 
    box_format=box_format,
    num_train_samples=num_train_samples,
    num_val_samples=num_val_samples,
    fg_prob=fg_prob,
    arb_prob=arb_prob,
    patch_size=patch_size,
    batch_size=batch_size,
    num_workers=num_workers
    )

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches

def visualize_images(images, gt_boxes, gt_labels, pred_boxes=None, pred_labels=None, pred_scores=None, legend=False):
    """Visualized images with annotations and optional with predictions."""
    total_images = len(images)
    cols = (total_images + 1) // 2 
    rows = min(2, total_images)
    fig = plt.figure(figsize=(7*cols, 15))
    
    for i, (img, gt_bbox, gt_label) in enumerate(zip(images, gt_boxes, gt_labels)):
        ax = fig.add_subplot(rows, cols, i+1)
        ax.imshow(img.permute(1,2,0))
            
        # Plot ground truth boxes in green
        for b, l in zip(gt_bbox, gt_label):
            x1, y1, x2, y2 = b
            rectangle = patches.Rectangle(
                (x1, y1), x2-x1, y2-y1,
                linewidth=3,
                edgecolor='green',
                facecolor='none',
                label='Ground Truth'
            )
            ax.add_patch(rectangle)
            
        # Plot predicted boxes in red if available
        if pred_boxes is not None and pred_labels is not None:
            pred_bbox = pred_boxes[i]
            pred_label = pred_labels[i]
            scores = pred_scores[i] if pred_scores is not None else None
            
            for j, (b, l) in enumerate(zip(pred_bbox, pred_label)):
                x1, y1, x2, y2 = b
                score_text = f' ({scores[j]:.2f})' if scores is not None else ''
                rectangle = patches.Rectangle(
                    (x1, y1), x2-x1, y2-y1,
                    linewidth=3,
                    edgecolor='red',
                    facecolor='none',
                    label='Prediction'
                )
                ax.add_patch(rectangle)
                # Add score text above the box if available
                if scores is not None:
                    ax.text(x1, y1-5, f'Score: {scores[j]:.2f}', 
                           color='red', fontsize=8)
        
        ax.axis('off')

        if legend:
            # Add legend 
            handles = [
                patches.Patch(color='green', label='Ground Truth'),
                patches.Patch(color='red', label='Prediction')
            ]
            ax.legend(handles=handles, loc='upper right', prop={'size': 10})
        
    plt.tight_layout()
    plt.show()

Let's verify that our dataloading works as expected by visualizing some batches. During training, we use the `albumentations` and `tiatoolbox` library for augmenting the patches. We use a very simple augmentation strategy with some rotations and flips, stain augmenation and defocusing. 

If you wish to make changes to this augmentation strategy you need to modify the `train_transform` property in `utils/datamodule.py`.

In [None]:
# Initialize the training data loader 
train_loader = dm.train_dataloader()

# Visualize some batches 
for idx, (images, targets) in enumerate(train_loader):
    if idx == 5:
        break 
    
    # Extract annotations
    boxes = [t['boxes'] for t in targets]
    labels = [t['labels'] for t in targets]

    # Visulize the images 
    visualize_images(images, boxes, labels)

## Create the object detection model 

For the purpose of this notebook we will create a simple FCOS model with a ResNet18 backbone that is trained on a patch size of 512x512. There are other models available in this repository, you can check them out at `utils/model.py`.

In [None]:
from utils.factory import ModelFactory 

# Set up model configurations 
model = 'FCOS'
lr = 0.0001
num_classes = 2                 # mitotic figure + background 
backbone = 'resnet18'
weights = 'IMAGENET1K_V1'
optimizer = 'AdamW'

# init model settings
model_kwargs = {
    'num_classes': num_classes,
    'backbone': backbone,
    'weights': weights,
    'patch_size': patch_size
}

# init module settings 
module_kwargs = {
    'batch_size': batch_size,
    'lr': lr,
    'optimizer': optimizer,
    'scheduler': None
}

fcos = ModelFactory.create(
    model_name=model,
    model_kwargs=model_kwargs,
    module_kwargs=module_kwargs
)

### Set up lightning trainer and callbacks

The goal of this notebook is to demonstrate some of the functionalities of this repository to get you started. To make use of the full training pipeline, you should use the `train.py` script. 

In [None]:
import lightning.pytorch as pl
from lightning.pytorch.callbacks.progress import TQDMProgressBar

# Init the trainer 
trainer = pl.Trainer(
    max_epochs=20,
    accelerator='gpu',
    logger=False,
    gradient_clip_val=1,
    reload_dataloaders_every_n_epochs=1
)

# Start training 
trainer.fit(fcos, datamodule=dm)

We can see that we can get decent patch-based performance on the validation set. We can easily evaluate the model on some patches of the test in the following. 

**Note: Here we only perform a patch-based evaluation where the probability of mitotic figures is higher than in the real world setting. Hence, you also need to evaluate your model over the entire images of the test split. This could lead to higher number of false positives. The evaluation over the entire image is outside the scope of this notebook but is included in this repository in the `optimize_threshold.py`.**

In [None]:
# Test the model on patch-based evaluation of the test split 
trainer.test(fcos, dataloaders=dm.test_dataloader(), ckpt_path='best')

We can also visualize some of the predictions on cases from the test split.

In [None]:
# Get the test dataloader 
test_loader = dm.test_dataloader()

# Set model to eval mode
fcos.eval()
fcos.to('cuda')

# Perform inference on some test patches 
for idx, (images, targets) in enumerate(test_loader):

    if idx == 10:
        break 

    with torch.no_grad():
        images = [img.to('cuda') for img in images]
        preds = fcos(images)

    # Extract annotations
    gt_boxes = [t['boxes'] for t in targets]
    gt_labels = [t['labels'] for t in targets]

    # Extract predictions
    pred_boxes = [p['boxes'].cpu() for p in preds]
    pred_labels = [p['labels'].cpu() for p in preds]
    pred_scores = [p['scores'].cpu() for p in preds]

    # Visulize the images with precictions
    visualize_images([img.cpu() for img in images], gt_boxes, gt_labels, pred_boxes, pred_labels, pred_scores, legend=True)
