In [None]:
%load_ext autoreload
%autoreload 2

# MIDOG 2025 Introduction and Exploratory Analysis

Welcome to the MIDOG 2025 challenge! This notebook is designed to help you take your first steps in participating in this year’s competition. Specifically, it introduces you to the [MIDOG++](https://www.nature.com/articles/s41597-023-02327-4) dataset, a large and diverse collection of mitotic figure annotations. The dataset includes 11,937 mitotic figures across 7 different tumor types, providing a robust foundation for developing your algorithms.

While the MIDOG++ dataset is a key resource, this year’s challenge allows the use of all publicly available datasets. For a non-exhaustive list of additional datasets, please refer to the [dataset section](https://midog2025.deepmicroscopy.org/datasets/) on the challenge website.

For a general overview of the competition, visit the [MIDOG 2025 homepage](https://midog2025.deepmicroscopy.org/). This notebook focuses on guiding you through the initial steps of participating in [Track 1: Mitotic Figure Detection](https://midog2025.deepmicroscopy.org/midog2025-track-1/).

In this notebook, you will:
1. Set up your environment.
2. Download the MIDOG++ dataset.
3. Perform an exploratory analysis of the dataset.

Let’s get started!

# 1. Set Up Your Environment

Before running this notebook, ensure that you have created a new virtual environment and installed the required dependencies. Follow the steps below to set up your environment:

1. Open your terminal and create a new virtual environment:
   ```
   python -m venv midog_env
   ```

2. Activate the environment:
   - On Linux/macOS:
     ```
     source midog_env/bin/activate
     ```
   - On Windows:
     ```
     midog_env\Scripts\activate
     ```

3. Install the required dependencies:
   ```
   pip install -r requirements.txt
   ```

Once the setup is complete, return to this notebook and select the newly created kernel to ensure the environment is properly configured.

You can verify that you are using the correct environment by checking if the following code shows e.g. `path/to/your/projects/MIDOG_2025_Guide/midog_env/bin/python`. 

In [None]:
import sys
print(sys.executable)

In [None]:
# Load libaries 
import cv2
import json 
import numpy as np
import openslide 
import pandas as pd 
import plotly.express as px 

from pathlib import Path

# 2. Download the MIDOG++ Dataset

The first step is to download the dataset. Please note that the total size of the dataset is approximately 65 GB, so the download process may take some time. By default, it is recommended to store the downloaded files in the `images` directory. If you choose a different location, you will need to update the corresponding paths later in this notebook.

**Important:** While the MIDOG++ dataset is a key resource for this challenge, you are encouraged to explore and utilize additional datasets. In particular, whole-slide image datasets can be highly beneficial for your algorithms, as they may include tissue types not present in the MIDOG++ dataset.

You can download the dataset directly within this notebook by uncommenting and running the next cell, or alternatively, you can execute the `download_MIDOGpp.py` script in your terminal.

In [None]:
# # Download MIDOG++ to images/
# !python download_MIDOGpp.py --location images

# 3. Exploratory Analysis of the MIDOG++ Dataset

Let's begin with the exploratory analysis of the dataset. This should give you an idea of the variety of the images in the dataset and the variation of the mitotic figure class. 

Here is an overview of the images contained in the dataset. 

| No. Cases | Tumor Type | Origin | Species | Scanner | Resolution |
|-----------|------------|---------|----------|----------|------------|
| 50 | Breast Carcinoma | UMC Utrecht | Human | Hamamatsu XR (C12000-22) | 0.23 μm/px |
| 50 | Breast Carcinoma | UMC Utrecht | Human | Hamamatsu S360 | 0.23 μm/px |
| 50 | Breast Carcinoma | UMC Utrecht | Human | Leica ScanScope CS2 | 0.25 μm/px |
| 44 | Lung Carcinoma | VMU Vienna | Canine | 3DHistech Pannoramic Scan II | 0.25 μm/px |
| 55 | Lymphosarcoma | VMU Vienna | Canine | 3DHistech Pannoramic Scan II | 0.25 μm/px |
| 50 | Cutaneous Mast Cell Tumor | FU Berlin | Canine | Aperio ScanScope CS2 | 0.25 μm/px |
| 55 | Neuroendocrine Tumor | UMC Utrecht | Human | Hamamatsu XR (C12000-22) | 0.23 μm/px |
| 85 | Soft Tissue Sarcoma | AMC New York | Canine | 3DHistech Pannoramic Scan II | 0.25 μm/px |
| 15 | Soft Tissue Sarcoma | VMU Vienna | Canine | 3DHistech Pannoramic Scan II | 0.25 μm/px |
| 49 | Melanoma | UMC Utrecht | Human | Hamamatsu XR (C12000-22) | 0.23 μm/px |
| **503** | | | | | |


For more detailed information about the dataset have a look the [MIDOG++ paper](https://www.nature.com/articles/s41597-023-02327-4).


For easier handling and visualization of the annotations we will convert them to a pandas dataframe. 

In [None]:
# Path to your downloaded images
image_dir = Path('images')

# Path to your dataset file
dataset_file = image_dir / 'MIDOGpp.json'


with open(dataset_file, 'r') as file:
    database = json.load(file)

    # Set labels 
    categories = {1: 'mitotic figure', 2: 'hard negative'}

    # Read image data
    image_df = pd.DataFrame.from_dict(database['images']).drop(columns='license').rename({'id':'image_id'}, axis=1)

    # Read annotations
    annotations_df = pd.DataFrame.from_dict(database['annotations']).drop(columns=['labels', 'id']).rename({'category_id':'cat'}, axis=1)
    annotations_df['cat'] = annotations_df['cat'].map(categories)
 
    # Merge dataframes
    dataset = image_df.merge(annotations_df, how='right', on='image_id')

dataset.head()

## Statistics 

The following examples will give you an idea regarding the distribution of mitotic figure annotations and hard examples in the data across the different tumortypes. 

**Note: Be aware the only the mitotic figures are relevant for the challenge. The hard negative (non-mitotic cell) annotations are only meant to visualise the problem.**

### The ratio of mitotic figures vs hard negative annotations

In [None]:
pie_df = pd.DataFrame([
          ["mitotic figure", len(dataset[dataset["cat"] == "mitotic figure"])],
          ["hard negative", len(dataset[dataset["cat"] == "hard negative"])]], columns=["cat", "total"])

fig = px.pie(pie_df, values='total', names='cat', title='Mitotic figures vs hard negatives')
fig.show()

### The number of mitotic figures vs hard negatives per tumor type and image 

In [None]:
for tumortype in dataset['tumor_type'].unique():

    tumortype_annos = dataset[dataset['tumor_type'] == tumortype]

    row = []
    for image_id in tumortype_annos["image_id"].unique():
        image_annos = tumortype_annos[tumortype_annos["image_id"] == image_id]
        row.append([image_id, len(image_annos[image_annos['cat'] == 'mitotic figure']), "mitotic figure"])
        row.append([image_id, len(image_annos[image_annos['cat'] == 'hard negative']), "hard negative"])

    tumortype_meta = pd.DataFrame(row, columns=["image_id", "total", "type"])

    fig = px.bar(tumortype_meta, x="image_id", y="total", color="type", title=f"{tumortype}: Annotations per image")
    fig.show()

## Visual Examples

The examples below showcase images of the various tumor types included in the dataset. It is important to note the differences in tissue appearance across these images, as these variations may result in domain shifts that could impact the performance of your algorithm on unseen domains.

In [None]:
num_images = 10
thumbnail_size = 512

for tumortype in dataset['tumor_type'].unique():
    images = []

    tumor_filenames = dataset.query('tumor_type == @tumortype')['file_name'].unique()
    samples = np.random.choice(tumor_filenames, size=num_images, replace=False)

    for file in samples:
        file_path = image_dir / file
        if file_path.exists():
            slide = openslide.open_slide(file_path)
            image = slide.get_thumbnail((thumbnail_size, thumbnail_size))
            images.append(np.array(image))

    
    max_x = max([img.shape[1] for img in images])
    max_y = max([img.shape[0] for img in images])

    imgs = np.array([cv2.resize(img, dsize=(max_x, max_y)) for img in images])

    fig = px.imshow(imgs, facet_col=0, facet_col_wrap=5, labels={'facet_col':'Image'}, title=tumortype)

    for i, id in enumerate(samples):
        fig.layout.annotations[i]['text'] = f'Image: {id}'

    fig.show()


Next, we look a the annotations for one file. Red boxes show mitotic figures while blue boxes show hard negatives. You can set a different `image_id` to view another image. 

In [None]:
thumbnail_size = 1024
image_id = 79
file_path = image_dir / f"{image_id:03d}.tiff"

slide = openslide.open_slide(file_path)
image = slide.get_thumbnail((thumbnail_size, thumbnail_size))

fig = px.imshow(image)

scale_x = slide.level_dimensions[0][0] / image.size[0]
scale_y = slide.level_dimensions[0][1] / image.size[1]

for id, anno in dataset[dataset["image_id"] == image_id].iterrows():
    x0, y0, x1, y1 = anno.bbox[0] / scale_x, anno.bbox[1] / scale_y, anno.bbox[2] / scale_x, anno.bbox[3] / scale_y
    
    fig.add_shape(
        type='rect',
        x0=x0, x1=x1, y0=y0, y1=y1,
        xref='x', yref='y',
        line_color='red' if "mitotic" in anno["cat"] else "blue"
    )
fig.update_layout(
    autosize=False,
    width=image.size[0],
    height=image.size[1],
    )

fig.show()

The following examples display extracted patches of mitotic figures. Pay close attention to the morphological differences between the mitotic figures, as these variations contribute to the complexity of this detection task.

In [None]:
num_samples = 5

for tumortype in dataset['tumor_type'].unique():
    patches = []
    tumor_dataset = dataset.query('tumor_type == @tumortype and cat == "mitotic figure"')
    samples = tumor_dataset.sample(n=num_samples)

    for idx, sample in samples.iterrows():
        file_path = image_dir / sample['file_name']
        slide = openslide.open_slide(file_path)
        center_x, center_y = sample.bbox[0] + (sample.bbox[2] - sample.bbox[0]) / 2, sample.bbox[1] + (sample.bbox[3] - sample.bbox[1]) / 2
        patch = np.array(slide.read_region((int(center_x-50), int(center_y-50)), level=0, size=(100, 100)))
        patches.append(patch)

    fig = px.imshow(np.array(patches), facet_col=0, facet_col_wrap=5, labels={'facet_col':'mitotic figure'}, title=tumortype)
    fig.show()

The following examples showcase hard negatives. These samples emphasize the challenges of the detection task, as there can be significant overlap in appearance between hard negatives and mitotic figures. This similarity can lead to algorithmic confusion and an increased likelihood of false positives.

In [12]:
num_samples = 5

for tumortype in dataset['tumor_type'].unique():
    patches = []
    tumor_dataset = dataset.query('tumor_type == @tumortype and cat == "hard negative"')
    samples = tumor_dataset.sample(n=num_samples)

    for idx, sample in samples.iterrows():
        file_path = image_dir / sample['file_name']
        slide = openslide.open_slide(file_path)
        center_x, center_y = sample.bbox[0] + (sample.bbox[2] - sample.bbox[0]) / 2, sample.bbox[1] + (sample.bbox[3] - sample.bbox[1]) / 2
        patch = np.array(slide.read_region((int(center_x-50), int(center_y-50)), level=0, size=(100, 100)))
        patches.append(patch)

    fig = px.imshow(np.array(patches), facet_col=0, facet_col_wrap=5, labels={'facet_col':'hard negative'}, title=tumortype)
    fig.show()