<a href="https://colab.research.google.com/github/DanielaSchacherer/IDC-Tutorials/blob/bmdeep_tutorial/notebooks/collections_demos/bonemarrowwsi_pediatricleukemia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BoneMarrowWSI-PediatricLeukemia


## Background

This notebook introduces the `BoneMarrowWSI-PediatricLeukemia` collection, which is presented in [this preprint](https://www.arxiv.org/pdf/2509.15895) and was recently added to IDC.

- **Images**: The `BoneMarrowWSI-PediatricLeukemia` dataset comprises bone marrow aspirate smear WSIs for 246 pediatric cases of leukemia, including acute lymphoid leukemia (ALL), acute myeloid leukemia (AML), and chronic myeloid leukemia (CML). The smears were prepared for the initial diagnosis (i.e., without prior treatment), stained in accordance with the Pappenheim method, and scanned at 40x magnification.
- **Annotations**: The images have been annotated with rectangular regions of interest (ROI) of the evaluable monolayer area and a total of 45176 cell bounding box annotations have been placed (with few exceptions) within the ROIs. For a subset of 232 ROIs all cells and other haematological structures have been labelled by multiple experts in a consensus labeling approach with 49 distinct (cell type) classes. The consensus labelling approach worked as follows: each bounding box was successively labelled by different experts in so-called "annotation sessions" until (a) the bounding box has been labelled by at least two experts, and (b) the most frequent label constitues at least half of all labels given to that bounding box (and is then termed "consensus class"). In summary, the following annotations are available:  

    - For each slide: ROI annotations of the monolayer area for each slide
    - For some slides: Unlabeled cell bounding boxes
    - For some slides: Cell bounding boxes with cell type labels for each annotation session plus the finally obtained consensus.

This notebook concentrates on **how to access and work with the annotation data**, that are made available in **DICOM Microscopy Bulk Simple Annotation format (ANNs)**. As a general introduction to this format, we recommend having a look at [this tutorial notebook](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb).


<img src="https://raw.githubusercontent.com/ImagingDataCommons/IDC-Tutorials/master/notebooks/pathomics/bmdeep_annotations_example.png" alt="Example visualization of BoneMarrowWSI-PediatricLeukemia annotations" width="1000"/>



## Prerequisites
**Installations**
* **Install highdicom:** [highdicom](https://highdicom.readthedocs.io/en/latest/introduction.html) was specifically designed to work with DICOM objects holding image-derived information, e.g. annotations and measurements. Detailed information on highdicom's functionality can be found in its [user guide](https://highdicom.readthedocs.io/en/latest/usage.html).
* **Install wsidicom:** The [wsidicom](https://pypi.org/project/wsidicom/) Python package provides functionality to open and extract image or metadata from WSIs.
* **Install idc-index:** The Python package [idc-index](https://pypi.org/project/idc-index/) facilitates queries of the basic metadata and download of DICOM files hosted by the IDC.

In [1]:
%%capture
!pip install highdicom
!pip install wsidicom
!pip install idc-index --upgrade

## Imports

In [2]:
import os
import highdicom as hd
from idc_index import index
import pandas as pd
from collections import defaultdict
from google.cloud import storage
from pathlib import Path
from typing import List, Union, Tuple

## Finding the `BoneMarrowWSI-PediatricLeukemia` dataset on IDC
To access and download image and ANNs files, we utilize the Python package [idc-index](https://github.com/ImagingDataCommons/idc-index).

In [3]:
idc_client = index.IDCClient() # set-up idc_client
idc_client.fetch_index('sm_instance_index')

First, we verify that we have indeed 246 WSI (=distinct StudyInstanceUIDs) in the `BoneMarrowWSI-PediatricLeukemia` collection:

In [4]:
query_slide_count = '''
SELECT COUNT(DISTINCT StudyInstanceUID)
FROM
    index
WHERE
    collection_id = 'bonemarrowwsi_pediatricleukemia' AND Modality='SM'
'''
print(idc_client.sql_query(query_slide_count))

   count(DISTINCT StudyInstanceUID)
0                               246


Next, let's have a look on the available annotation (ANN) files:

In [5]:
query_anns = '''
SELECT
    SeriesDescription,
    SeriesInstanceUID,
    ARRAY_AGG(StudyInstanceUID) AS StudyInstanceUID,
    ARRAY_AGG(Modality) AS Modality
FROM
    index
WHERE
    collection_id = 'bonemarrowwsi_pediatricleukemia' AND Modality='ANN'
GROUP BY
    SeriesInstanceUID,
    SeriesDescription
ORDER BY
    StudyInstanceUID,
    SeriesDescription
'''
annotations = idc_client.sql_query(query_anns)
display(annotations)

Unnamed: 0,SeriesDescription,SeriesInstanceUID,StudyInstanceUID,Modality
0,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.76434139437749586...,[1.2.826.0.1.3680043.8.498.1074763298775112063...,[ANN]
1,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.76035111849294113...,[1.2.826.0.1.3680043.8.498.1110250475182573623...,[ANN]
2,Unlabeled cell bounding boxes,1.2.826.0.1.3680043.10.511.3.51699668688633439...,[1.2.826.0.1.3680043.8.498.1110250475182573623...,[ANN]
3,Consensus: cell bounding boxes with cell type ...,1.2.826.0.1.3680043.10.511.3.18476424701131582...,[1.2.826.0.1.3680043.8.498.1162778434880422268...,[ANN]
4,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.57387082213597634...,[1.2.826.0.1.3680043.8.498.1162778434880422268...,[ANN]
...,...,...,...,...
1028,Session 2: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.10.511.3.39451636835490582...,[1.2.826.0.1.3680043.8.498.9975397932428013130...,[ANN]
1029,Session 3: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.10.511.3.67965106709643031...,[1.2.826.0.1.3680043.8.498.9975397932428013130...,[ANN]
1030,Session 4: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.10.511.3.98267820174458043...,[1.2.826.0.1.3680043.8.498.9975397932428013130...,[ANN]
1031,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.86763164155160463...,[1.2.826.0.1.3680043.8.498.9996452406228816651...,[ANN]


We can see, that for each slide (i.e. DICOM Study) there are multiple ANN Series. Looking at the SeriesDescription, we can assert what is described in the [Background](#Background) section of this notebook.


*   Each slide has "Monolayer regions of interest for cell classification" annotations.
*   For some slides, there is one ANN Series with "Unlabeled cell bounding boxes", while for others, there are multiple ANN Series containing "Cell bounding boxes with cell type labels" for different annotation sessions and the consensus labels.



## Viewing annotations


Annotations can be viewed and explored in detail on its respective slide using the Slim viewer. In the Slim viewer's interface at the bottom of the right sidebar you may select the ANN Series of interest to you from the drop-down menue, then click on `Annotation Groups` and switch the slider(s) to make annotations visible.

In [None]:
viewer_url = idc_client.get_viewer_URL(studyInstanceUID=annotations['StudyInstanceUID'].iloc[3][0], viewer_selector='slim')
from IPython.display import IFrame
IFrame(viewer_url, width=1500, height=900)

## Accessing annotations

### Download complete annotation collection for local access
Since the annotation dataset is of reasonable size it could be downloaded completely using `idc_index` as shown below and then accessed from the local disk using `highdicom`.

In [None]:
dcm_ann_dir = '/content/dicom_ann_annotations'
os.makedirs(dcm_ann_dir, exist_ok=True)

idc_client.download_from_selection(downloadDir=dcm_ann_dir,
                                   seriesInstanceUID=annotations['SeriesInstanceUID'].tolist(), dirTemplate=None)

Downloading data:  93%|█████████▎| 17.1M/18.4M [00:04<00:00, 4.08MB/s]


For guidance on how to read the downloaded annotation files see section "Reading DICOM ANNs" of [this tutorial notebook](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb).

### Access annotations directly from the Cloud

A more desirable approach especially for larger size datasets is to directly extract the relevant information from the objects in the cloud. The following functions `get_roi_annotations()` and `get_cell_annotations()` can be used for this approach. They extract and summarize ROIs respectively cell annotations in an easy to use pandas DataFrame.
Note, that the selection of the respective annotation files, i.e. files containing ROI annotations, labeled or unlabeled cell annotations, is done by filtering for the respective SeriesDescription.

The following two code cells define and use `get_roi_annotations()` to select all DICOM ANNs in the `BoneMarrowWSI-PediatricLeukemia` collection that contain ROI annotations of the monolayer area.
The resulting pandas DataFrame contains
- **'reference_SeriesInstanceUID'** and **'reference_SOPInstanceUID'**: the SeriesInstanceUID and SOPInstanceUID of the slide level the annotations refer to
- **'roi_id'**: the ID of the ROI
- **'roi_label'**: its label   
- **'roi_coordinates'**: the 2D coordinates in the image coordinate system of the referenced slide level.


In [21]:
def get_roi_annotations(demo: bool = False):
    query_roi_anns = '''
    SELECT
        SeriesInstanceUID
    FROM
        index
    WHERE
        collection_id = 'bonemarrowwsi_pediatricleukemia'
        AND Modality='ANN'
        AND LOWER(SeriesDescription) LIKE '%monolayer%'
    ORDER BY
        StudyInstanceUID,
        SeriesDescription
    '''
    roi_series = idc_client.sql_query(query_roi_anns)
    if demo:
        roi_series_to_extract = roi_series['SeriesInstanceUID'].tolist()[:10]
    else:
        roi_series_to_extract = roi_series['SeriesInstanceUID'].tolist()
    rois = extract_rois(roi_series_to_extract)
    return rois


def extract_rois(series_uids: List[str]) -> pd.DataFrame:
    gcs_client = storage.Client.create_anonymous_client()
    rows = []
    for series_uid in series_uids:
        file_urls = idc_client.get_series_file_URLs(seriesInstanceUID=series_uid, source_bucket_location='gcs')
        for file_url in file_urls:
            (_,_, bucket_name, folder_name, file_name) = file_url.split('/')
            bucket = gcs_client.bucket(bucket_name)
            blob = bucket.blob(f'{folder_name}/{file_name}')

            with blob.open('rb') as file_obj:
                ann = hd.ann.annread(file_obj)
                for ann_group in ann.get_annotation_groups():
                    coords = ann_group.get_graphic_data(coordinate_type='2D')
                    m_names, m_values, m_units = ann_group.get_measurements()
                    for c, m in zip(coords, m_values):
                        rows.append({
                            'reference_SeriesInstanceUID': ann.ReferencedSeriesSequence[0].SeriesInstanceUID,
                            'reference_SOPInstanceUID': ann.ReferencedImageSequence[0].ReferencedSOPInstanceUID,
                            'roi_id': int(m[0]), # allow empty roi_id,
                            'roi_label': ann_group.label,
                            'roi_coordinates': c
                        })
    rois = pd.DataFrame(rows)
    return rois

In [7]:
# This code may run for 1-2 minutes if you remove the 'demo' mode, please be patient :)
rois = get_roi_annotations(demo=True)
display(rois)

Unnamed: 0,reference_series_id,reference_sop_id,roi_id,roi_label,roi_coordinates
0,1.2.826.0.1.3680043.8.498.98377665788926698337...,1.2.826.0.1.3680043.8.498.70616662305497812223...,2271,region_of_interest,"[[72032.0, 160247.0], [74080.0, 160247.0], [74..."
1,1.2.826.0.1.3680043.8.498.98377665788926698337...,1.2.826.0.1.3680043.8.498.70616662305497812223...,2272,region_of_interest,"[[92857.0, 163260.0], [94905.0, 163260.0], [94..."
2,1.2.826.0.1.3680043.8.498.99045734331130228562...,1.2.826.0.1.3680043.8.498.52239720641745361153...,1070,region_of_interest,"[[48995.0, 80570.0], [51043.0, 80570.0], [5104..."
3,1.2.826.0.1.3680043.8.498.99045734331130228562...,1.2.826.0.1.3680043.8.498.52239720641745361153...,1071,region_of_interest,"[[99451.0, 126518.0], [101499.0, 126518.0], [1..."
4,1.2.826.0.1.3680043.8.498.36810224044030831386...,1.2.826.0.1.3680043.8.498.20301403784060697253...,290,region_of_interest,"[[21001.0, 175192.0], [23049.0, 175192.0], [23..."
...,...,...,...,...,...
807,1.2.826.0.1.3680043.8.498.25839405899256364708...,1.2.826.0.1.3680043.8.498.69266863984443253723...,2135,region_of_interest,"[[130389.0, 43037.0], [136209.0, 43037.0], [13..."
808,1.2.826.0.1.3680043.8.498.25839405899256364708...,1.2.826.0.1.3680043.8.498.69266863984443253723...,2136,region_of_interest,"[[82964.0, 46380.0], [92775.0, 46380.0], [9277..."
809,1.2.826.0.1.3680043.8.498.25839405899256364708...,1.2.826.0.1.3680043.8.498.69266863984443253723...,2137,region_of_interest,"[[38180.0, 46679.0], [45830.0, 46679.0], [4583..."
810,1.2.826.0.1.3680043.8.498.25839405899256364708...,1.2.826.0.1.3680043.8.498.69266863984443253723...,2138,region_of_interest,"[[23710.0, 49104.0], [32375.0, 49104.0], [3237..."


The following code cells define and use `get_cell_annotations()` to select all DICOM ANNs in the `BoneMarrowWSI-PediatricLeukemia` collection that contain cell annotations. By setting the parameter 'subset' to either 'labeled', 'unlabeled' or 'both', it's possible to extract either only labeled, unlabeled or all cell annotations.
The resulting pandas DataFrame contains
- **'reference_SeriesInstanceUID'** and **'reference_SOPInstanceUID'**: the SeriesInstanceUID and SOPInstanceUID of the slide level the annotations refer to
- **'annotation_session'**: 'n/a' for the unlabeled cells, otherwise the number of the annotation session or 'consensus' for the final consensus.
- **'cell_id'**: the ID of the cell
- **'roi_id'**: if applicable, the ID of the monolayer ROI, the cell is located within
- **'cell_label_code_scheme'**: Tuple of code of the cell label and designator of the coding scheme, e.g. (414387006, SCT) which is code 414387006 from SNOMED CT ontology
- **'cell_label'**: Code meaning of the cell label e.g. 'Structure of haematological system'
- **'cell_coordinates'**: the 2D coordinates in the image coordinate system of the referenced slide level

In [22]:
def get_cell_annotations(subset: str = 'labeled', demo: bool = False) -> pd.DataFrame:
    assert subset in ['labeled', 'unlabeled', 'both']
    if subset == 'labeled':
        query_word = 'labels'
    elif subset == 'unlabeled':
        query_word = 'unlabeled'
    else:
        query_word = 'cell'

    query_cell_anns = f'''
    SELECT
        SeriesInstanceUID
    FROM
        index
    WHERE
        collection_id = 'bonemarrowwsi_pediatricleukemia'
        AND Modality='ANN'
        AND LOWER(SeriesDescription) LIKE '%{query_word}%'
    ORDER BY
        StudyInstanceUID,
        SeriesDescription
    '''
    cell_series = idc_client.sql_query(query_cell_anns)
    if demo:
        cell_series_to_extract = cell_series['SeriesInstanceUID'].tolist()[:10]
    else:
        cell_series_to_extract = cell_series['SeriesInstanceUID'].tolist()

    cells = extract_cells(cell_series_to_extract)
    return cells


def extract_cells(series_uids: List[str]) -> pd.DataFrame:
    gcs_client = storage.Client.create_anonymous_client()
    rows = []
    for series_uid in series_uids:
        file_urls = idc_client.get_series_file_URLs(seriesInstanceUID=series_uid, source_bucket_location='gcs')
        for file_url in file_urls:
            (_,_, bucket_name, folder_name, file_name) = file_url.split('/')
            bucket = gcs_client.bucket(bucket_name)
            blob = bucket.blob(f'{folder_name}/{file_name}')

            with blob.open('rb') as file_obj:
                ann = hd.ann.annread(file_obj)
                for ann_group in ann.get_annotation_groups():
                    coords = ann_group.get_graphic_data(coordinate_type='2D')
                    m_names, m_values, m_units = ann_group.get_measurements()
                    for c, m in zip(coords, m_values):
                        rows.append({
                            'reference_SeriesInstanceUID': ann.ReferencedSeriesSequence[0].SeriesInstanceUID,
                            'reference_SOPInstanceUID': ann.ReferencedImageSequence[0].ReferencedSOPInstanceUID,
                            'annotation_session': get_annotation_session(ann),
                            'cell_id': int(m[0]),
                            'roi_id': int(m[1]) if m.size > 1 else None, # allow empty roi_id,
                            'cell_label': ann_group.annotated_property_type.meaning,
                            'cell_label_code_scheme': (ann_group.annotated_property_type.value, ann_group.annotated_property_type.scheme_designator),
                            'cell_coordinates': c
                        })
    cells = pd.DataFrame(rows)
    return cells


def get_annotation_session(ann: hd.ann.sop.MicroscopyBulkSimpleAnnotations) -> str:
    if 'unlabeled' in ann.SeriesDescription.lower():
        return 'n/a'
    return ann.SeriesDescription.split(':')[0]

In [18]:
# This code may run for 1-2 minutes if you remove the demo mode, please be patient :)
unlabeled_cells = get_cell_annotations(subset='unlabeled', demo=True)
display(unlabeled_cells)

Unnamed: 0,reference_SeriesInstanceUID,reference_SOPInstanceUID,annotation_session,cell_id,roi_id,cell_label,cell_label_code,cell_coordinates
0,1.2.826.0.1.3680043.8.498.99045734331130228562...,1.2.826.0.1.3680043.8.498.52239720641745361153...,,104693,1070,Structure of haematological system,"(414387006, SCT)","[[49259.0, 80647.0], [49424.0, 80647.0], [4942..."
1,1.2.826.0.1.3680043.8.498.99045734331130228562...,1.2.826.0.1.3680043.8.498.52239720641745361153...,,104694,1070,Structure of haematological system,"(414387006, SCT)","[[49383.0, 80751.0], [49514.0, 80751.0], [4951..."
2,1.2.826.0.1.3680043.8.498.99045734331130228562...,1.2.826.0.1.3680043.8.498.52239720641745361153...,,104695,1070,Structure of haematological system,"(414387006, SCT)","[[49178.0, 80526.0], [49270.0, 80526.0], [4927..."
3,1.2.826.0.1.3680043.8.498.99045734331130228562...,1.2.826.0.1.3680043.8.498.52239720641745361153...,,104696,1070,Structure of haematological system,"(414387006, SCT)","[[49262.0, 80516.0], [49358.0, 80516.0], [4935..."
4,1.2.826.0.1.3680043.8.498.99045734331130228562...,1.2.826.0.1.3680043.8.498.52239720641745361153...,,104697,1070,Structure of haematological system,"(414387006, SCT)","[[49723.0, 80551.0], [49856.0, 80551.0], [4985..."
...,...,...,...,...,...,...,...,...
1655,1.2.826.0.1.3680043.8.498.69631392888544866375...,1.2.826.0.1.3680043.8.498.33316914385579021210...,,109220,1125,Structure of haematological system,"(414387006, SCT)","[[82138.0, 330089.0], [82200.0, 330089.0], [82..."
1656,1.2.826.0.1.3680043.8.498.69631392888544866375...,1.2.826.0.1.3680043.8.498.33316914385579021210...,,109221,1125,Structure of haematological system,"(414387006, SCT)","[[82988.0, 329818.0], [83057.0, 329818.0], [83..."
1657,1.2.826.0.1.3680043.8.498.69631392888544866375...,1.2.826.0.1.3680043.8.498.33316914385579021210...,,109222,1125,Structure of haematological system,"(414387006, SCT)","[[82997.0, 329744.0], [83113.0, 329744.0], [83..."
1658,1.2.826.0.1.3680043.8.498.69631392888544866375...,1.2.826.0.1.3680043.8.498.33316914385579021210...,,109223,1125,Structure of haematological system,"(414387006, SCT)","[[82103.0, 329810.0], [82205.0, 329810.0], [82..."


In [23]:
# This code may run for 2-3 minutes, if you remove the demo mode please be patient :)
labeled_cells = get_cell_annotations(subset='labeled', demo=True)
display(labeled_cells)

Unnamed: 0,reference_SeriesInstanceUID,reference_SOPInstanceUID,annotation_session,cell_id,roi_id,cell_label,cell_label_code_scheme,cell_coordinates
0,1.2.826.0.1.3680043.8.498.36810224044030831386...,1.2.826.0.1.3680043.8.498.20301403784060697253...,Consensus,40309,290.0,Artifact,"(47973001, SCT)","[[22542.0, 176021.0], [22603.0, 176021.0], [22..."
1,1.2.826.0.1.3680043.8.498.36810224044030831386...,1.2.826.0.1.3680043.8.498.20301403784060697253...,Consensus,40316,290.0,Damage,"(37782003, SCT)","[[22706.0, 175917.0], [22841.0, 175917.0], [22..."
2,1.2.826.0.1.3680043.8.498.36810224044030831386...,1.2.826.0.1.3680043.8.498.20301403784060697253...,Consensus,40336,290.0,Damage,"(37782003, SCT)","[[22711.0, 176620.0], [22834.0, 176620.0], [22..."
3,1.2.826.0.1.3680043.8.498.36810224044030831386...,1.2.826.0.1.3680043.8.498.20301403784060697253...,Consensus,40390,291.0,Damage,"(37782003, SCT)","[[86655.0, 154583.0], [86805.0, 154583.0], [86..."
4,1.2.826.0.1.3680043.8.498.36810224044030831386...,1.2.826.0.1.3680043.8.498.20301403784060697253...,Consensus,40477,291.0,Damage,"(37782003, SCT)","[[86298.0, 155920.0], [86504.0, 155920.0], [86..."
...,...,...,...,...,...,...,...,...
1585,1.2.826.0.1.3680043.8.498.82223767803353692585...,1.2.826.0.1.3680043.8.498.72082594196695068782...,Session 2,47205,322.0,Structure of haematological system,"(414387006, SCT)","[[42710.0, 31883.0], [42724.0, 31883.0], [4272..."
1586,1.2.826.0.1.3680043.8.498.82223767803353692585...,1.2.826.0.1.3680043.8.498.72082594196695068782...,Session 2,46688,322.0,Smudge cell,"(34717007, SCT)","[[42768.0, 32070.0], [42972.0, 32070.0], [4297..."
1587,1.2.826.0.1.3680043.8.498.82223767803353692585...,1.2.826.0.1.3680043.8.498.72082594196695068782...,Session 2,46709,323.0,Smudge cell,"(34717007, SCT)","[[41278.0, 28278.0], [41419.0, 28278.0], [4141..."
1588,1.2.826.0.1.3680043.8.498.82223767803353692585...,1.2.826.0.1.3680043.8.498.72082594196695068782...,Session 2,47204,322.0,Unusable - Quality renders image unusable,"(111235, DCM)","[[42780.0, 32381.0], [42827.0, 32381.0], [4282..."


## How to use the `BoneMarrowWSI-PediatricLeukemia` annotations


# Next steps

Share your feedback or ask questions about this notebook in IDC Forum: https://discourse.canceridc.dev.

If you are interested in tissue type annotations or want to learn about DICOM Structured Reporting, you can take a look at [this notebook](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/collections_demos/rms_mutation_prediction/RMS-Mutation-Prediction-Expert-Annotations_exploration.ipynb) navigating expert-generated region annotations for rhabdomyosarcoma tumor slides.