<a href="https://colab.research.google.com/github/DanielaSchacherer/IDC-Tutorials/blob/bmdeep_tutorial/notebooks/collections_demos/bonemarrowwsi_pediatricleukemia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BoneMarrowWSI-PediatricLeukemia


## Background

This notebook introduces the `BoneMarrowWSI-PediatricLeukemia` collection, which is presented in [this preprint](https://www.arxiv.org/pdf/2509.15895) and was recently added to IDC.

- **Images**: The `BoneMarrowWSI-PediatricLeukemia` dataset comprises bone marrow aspirate smear WSIs for 246 pediatric cases of leukemia, including acute lymphoid leukemia (ALL), acute myeloid leukemia (AML), and chronic myeloid leukemia (CML). The smears were prepared for the initial diagnosis (i.e., without prior treatment), stained in accordance with the Pappenheim method, and scanned at 40x magnification.
- **Annotations**: The images have been annotated with rectangular regions of interest (ROI) of the evaluable monolayer area and a total of 45176 cell bounding box annotations have been placed (with few exceptions) within the ROIs. For a subset of 232 ROIs all cells and other haematological structures have been labelled by multiple experts in a consensus labeling approach with 49 distinct (cell type) classes. The consensus labelling approach worked as follows: each bounding box was successively labelled by different experts in so-called "annotation sessions" until (a) the bounding box has been labelled by at least two experts, and (b) the most frequent label constitues at least half of all labels given to that bounding box (and is then termed "consensus class"). In summary, the following annotations are available:  

    - For each slide: ROI annotations of the monolayer area for each slide
    - For some slides: Unlabeled cell bounding boxes
    - For some slides: Cell bounding boxes with cell type labels for each annotation session plus the finally obtained consensus.

This notebook concentrates on **how to access and work with the annotation data**, that are made available in DICOM Microscopy Bulk Simple Annotation format. As a general introduction to this format, we recommend having a look at [this tutorial notebook](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb).


<img src="https://raw.githubusercontent.com/ImagingDataCommons/IDC-Tutorials/master/notebooks/pathomics/bmdeep_annotations_example.png" alt="Example visualization of BoneMarrowWSI-PediatricLeukemia annotations" width="1000"/>



## Prerequisites
**Installations**
* **Install highdicom:** [highdicom](https://highdicom.readthedocs.io/en/latest/introduction.html) was specifically designed to work with DICOM objects holding image-derived information, e.g. annotations and measurements. Detailed information on highdicom's functionality can be found in its [user guide](https://highdicom.readthedocs.io/en/latest/usage.html).
* **Install wsidicom:** The [wsidicom](https://pypi.org/project/wsidicom/) Python package provides functionality to open and extract image or metadata from WSIs.
* **Install idc-index:** The Python package [idc-index](https://pypi.org/project/idc-index/) facilitates queries of the basic metadata and download of DICOM files hosted by the IDC.

In [1]:
%%capture
!pip install highdicom
!pip install wsidicom
!pip install idc-index --upgrade

## Imports

In [2]:
import highdicom as hd
from idc_index import index
import pandas as pd
from collections import defaultdict
from google.cloud import storage
from pathlib import Path
from typing import List, Union, Tuple

In [10]:
from google.colab import auth
auth.authenticate_user()

## Finding the `BoneMarrowWSI-PediatricLeukemia` dataset on IDC
To access and download image and ANNs files, we utilize the Python package [idc-index](https://github.com/ImagingDataCommons/idc-index).

In [3]:
idc_client = index.IDCClient() # set-up idc_client
idc_client.fetch_index('sm_instance_index')

First, we verify that we have indeed 246 WSI (=distinct StudyInstanceUIDs) in the `BoneMarrowWSI-PediatricLeukemia` collection:

In [12]:
############################################## DEV TEST #######################################
from google.cloud import bigquery

# Initialize client (project is optional, uses default if not set)
client = bigquery.Client(project='idc-pathomics-000')

# Your SQL query
query = """
    SELECT COUNT(DISTINCT StudyInstanceUID)
    FROM `idc-dev-etl.idc_v23_pub.dicom_all`
    WHERE
      collection_id = 'bonemarrowwsi_pediatricleukemia' AND Modality='SM'
"""

# Run the query
query_job = client.query(query).to_dataframe()

# Fetch and print results
display(query_job)

Unnamed: 0,f0_
0,246


In [4]:
query_slide_count = '''
SELECT COUNT(DISTINCT StudyInstanceUID)
FROM
    index
WHERE
    collection_id = 'bonemarrowwsi_pediatricleukemia' AND Modality='SM'
'''
print(idc_client.sql_query(query_slide_count))

   count(DISTINCT StudyInstanceUID)
0                               246


Next, let's have a look on the available annotation (ANN) files:

In [20]:
############################################## DEV TEST #######################################
from google.cloud import bigquery

# Initialize client (project is optional, uses default if not set)
client = bigquery.Client(project='idc-pathomics-000')

# Your SQL query
query = """
    SELECT
    SeriesDescription,
    SeriesInstanceUID,
    ARRAY_AGG(StudyInstanceUID) AS StudyInstanceUID,
    ARRAY_AGG(Modality) AS Modality
FROM
    `idc-dev-etl.idc_v23_pub.dicom_all`
WHERE
    collection_id = 'bonemarrowwsi_pediatricleukemia' AND Modality='ANN'
GROUP BY
    SeriesInstanceUID,
    SeriesDescription
ORDER BY
    StudyInstanceUID[OFFSET(0)]
"""

# Run the query
annotations = client.query(query).to_dataframe()

# Fetch and print results
display(annotations)

Unnamed: 0,SeriesDescription,SeriesInstanceUID,StudyInstanceUID,Modality
0,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.76434139437749586...,[1.2.826.0.1.3680043.8.498.1074763298775112063...,[ANN]
1,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.76035111849294113...,[1.2.826.0.1.3680043.8.498.1110250475182573623...,[ANN]
2,Unlabeled cell bounding boxes,1.2.826.0.1.3680043.10.511.3.51699668688633439...,[1.2.826.0.1.3680043.8.498.1110250475182573623...,[ANN]
3,Session 1: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.10.511.3.59892159514053659...,[1.2.826.0.1.3680043.8.498.1162778434880422268...,[ANN]
4,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.57387082213597634...,[1.2.826.0.1.3680043.8.498.1162778434880422268...,[ANN]
...,...,...,...,...
1028,Session 1: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.10.511.3.27471911554539798...,[1.2.826.0.1.3680043.8.498.9975397932428013130...,[ANN]
1029,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.73548391110725757...,[1.2.826.0.1.3680043.8.498.9975397932428013130...,[ANN]
1030,Session 2: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.10.511.3.39451636835490582...,[1.2.826.0.1.3680043.8.498.9975397932428013130...,[ANN]
1031,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.10.511.3.86763164155160463...,[1.2.826.0.1.3680043.8.498.9996452406228816651...,[ANN]


In [5]:
query_anns = '''
SELECT
    SeriesDescription,
    SeriesInstanceUID,
    ARRAY_AGG(StudyInstanceUID) AS StudyInstanceUID,
    ARRAY_AGG(Modality) AS Modality
FROM
    index
WHERE
    collection_id = 'bonemarrowwsi_pediatricleukemia' AND Modality='ANN'
GROUP BY
    SeriesInstanceUID,
    SeriesDescription
ORDER BY
    StudyInstanceUID[OFFSET(0)]
'''
annotations = idc_client.sql_query(query_anns)
print(annotations)

Empty DataFrame
Columns: [SeriesDescription, SeriesInstanceUID, StudyInstanceUID, Modality]
Index: []


We can see, that for each slide (i.e. DICOM Study) there are multiple ANN Series. Looking at the SeriesDescription, we can assert what is described in the [Background](#Background) section of this notebook.


*   Each slide has "Monolayer regions of interest for cell classification" annotations.
*   For some slides, there is one ANN Series with "Unlabeled cell bounding boxes", while for others, there are multiple ANN Series containing "Cell bounding boxes with cell type labels" for different annotation sessions and the consensus labels.



## Viewing annotations


Annotations can be viewed and explored in detail on its respective slide using the Slim viewer. In the Slim viewer's interface at the bottom of the right sidebar you may select the ANN Series of interest to you from the drop-down menue, then click on `Annotation Groups` and switch the slider(s) to make annotations visible.

In [21]:
viewer_url = idc_client.get_viewer_URL(studyInstanceUID=annotations['StudyInstanceUID'].iloc[0][0], viewer_selector='slim')
from IPython.display import IFrame
IFrame(viewer_url, width=1500, height=900)

## Accessing annotations

Since the annotation dataset is of reasonable size it could be downloaded completely and accessed from the local disk. However, a more desirable approach especially for larger size datasets is to directly extract the relevant information from the objects in the cloud. The following functions `get_roi_annotations()`, `get_unlabeled_cell_annotations()` and `get_labeled_cell_annotations()` can be used for this approach.

Note, that the selection of the respective files, i.e. files containing ROI annotations, labeled or unlabeled cell annotations, is done by filtering for the respective SeriesDescription.

In [None]:
def get_roi_annotations():
    query_roi_anns = '''
    SELECT
        SeriesInstanceUID
    FROM
        index
    WHERE
        collection_id = 'bonemarrowwsi_pediatricleukemia'
        AND Modality='ANN'
        AND LOWER(SeriesDescription) LIKE '%monolayer%'
    ORDER BY
        StudyInstanceUID
    '''
    roi_series = idc_client.sql_query(query_roi_anns)
    rois = extract_rois(roi_series['SeriesInstanceUID'].tolist())
    return rois

In [22]:
def extract_rois(series_uids: List[str]) -> pd.DataFrame:
    rows = []
    for series_uid in series_uids:
        file_urls = idc_client.get_series_file_URLs(seriesInstanceUID=series_uid, source_bucket_location="gcs")
        for file_url in file_urls:
            ann = hd.ann.annread(file_url)
            for ann_group in ann.get_annotation_groups():
                coords = ann_group.get_graphic_data(coordinate_type='2D')
                m_names, m_values, m_units = ann_group.get_measurements()
                for c, m in zip(coords, m_values):
                    rows.append({
                        'reference_series_id': ann.ReferencedSeriesSequence[0].SeriesInstanceUID,
                        'reference_sop_id': ann.ReferencedImageSequence[0].ReferencedSOPInstanceUID,
                        'roi_id': int(m[0]), # allow empty roi_id,
                        'roi_label': ann_group.label,
                        'roi_coordinates': c
                    })
    rois = pd.DataFrame(rows)
    return rois

In [None]:
rois = get_roi_annotations()
display(rois)

In [31]:
def get_cells(subset: str = 'labeled') -> pd.DataFrame:
    assert subset in ['labeled', 'unlabeled', 'both']
    if subset == 'labeled':
        query_word = 'labels'
    elif subset == 'unlabeled':
        query_word = 'unlabeled'
    else:
        query_word = 'cell'

    query_cell_anns = f'''
    SELECT
        SeriesInstanceUID
    FROM
        index
    WHERE
        collection_id = 'bonemarrowwsi_pediatricleukemis'
        AND Modality='ANN'
        AND LOWER(SeriesDescription) LIKE '%{query_word}%'
    ORDER BY
        StudyInstanceUID
    '''
    cell_series = idc_client.sql_query(query_cell_anns)
    cells = extract_cells(cell_series['SeriesInstanceUID'].tolist())
    return cells

In [None]:
def extract_cells(series_uids: List[str]) -> pd.DataFrame:
    rows = []
    for series_uid in series_uids:
        file_urls = idc_client.get_series_file_URLs(seriesInstanceUID=series_uid, source_bucket_location="gcs")
        for file_url in file_urls:
            ann = hd.ann.annread(file_url)
            for ann_group in ann.get_annotation_groups():
                coords = ann_group.get_graphic_data(coordinate_type='2D')
                m_names, m_values, m_units = ann_group.get_measurements()
                for c, m in zip(coords, m_values):
                    rows.append({
                        'reference_series_id': ann.ReferencedSeriesSequence[0].SeriesInstanceUID,
                        'reference_sop_id': ann.ReferencedImageSequence[0].ReferencedSOPInstanceUID,
                        'annotation_session': get_annotation_session(ann),
                        'cell_id': int(m[0]),
                        'roi_id': int(m[1]) if m.size > 1 else None, # allow empty roi_id,
                        'cell_label': ann_group.label,
                        'cell_coordinates': c
                    })
    cells = pd.DataFrame(rows)
    return cells

In [None]:
def get_annotation_session(ann: hd.ann.sop.MicroscopyBulkSimpleAnnotations) -> str:
    return ann.SeriesDescription.split(':')[0]

In [None]:
unlabeled_cells = get_cells(subset='unlabeled')
display(unlabeled_cells)

In [None]:
labeled_cells = get_cells(subset='labeled')
display(labeled_cells)

## Test with private bucket

In [26]:
from google.cloud import storage
import os

def download_all_files_from_bucket(bucket_name, destination_folder):
    client = storage.Client()  # Make sure credentials are configured (e.g., GOOGLE_APPLICATION_CREDENTIALS)
    bucket = client.bucket(bucket_name)

    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    blobs = bucket.list_blobs()
    for blob in blobs:
        file_path = os.path.join(destination_folder, blob.name)
        os.makedirs(os.path.dirname(file_path), exist_ok=True)  # Create directories as needed
        blob.download_to_filename(file_path)
        print(f"Downloaded {blob.name} to {file_path}")

# Usage
bucket_name = "bmdeep_anns"
destination_folder = "./downloaded_files"
download_all_files_from_bucket(bucket_name, destination_folder)


Downloaded 02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_1.dcm to ./downloaded_files/02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_1.dcm
Downloaded 02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_2.dcm to ./downloaded_files/02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_2.dcm
Downloaded 02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_3.dcm to ./downloaded_files/02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_3.dcm
Downloaded 02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_4.dcm to ./downloaded_files/02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_4.dcm
Downloaded 02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_5.dcm to ./downloaded_files/02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_5.dcm
Downloaded 02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_consensus.dcm to ./downloaded_files/02E74F10E0327AD868D138F2B4FDD6F0_1_bm_cells_ann_session_consensus.dcm
Downloaded 02E74F10E0327AD868D138F2B4FDD6F

In [27]:
# for local access
def extract_rois(series_uids: List[str]) -> pd.DataFrame:
    rows = []
    for file in os.listdir('./downloaded_files'):
        if 'roi' in file:
            ann = hd.ann.annread('./downloaded_files/'+ file)
            for ann_group in ann.get_annotation_groups():
                coords = ann_group.get_graphic_data(coordinate_type='2D')
                m_names, m_values, m_units = ann_group.get_measurements()
                for c, m in zip(coords, m_values):
                    rows.append({
                        'reference_series_id': ann.ReferencedSeriesSequence[0].SeriesInstanceUID,
                        'reference_sop_id': ann.ReferencedImageSequence[0].ReferencedSOPInstanceUID,
                        'roi_id': int(m[0]), # allow empty roi_id,
                        'roi_label': ann_group.label,
                        'roi_coordinates': c
                    })
    rois = pd.DataFrame(rows)
    return rois

display(extract_rois(None))

Unnamed: 0,reference_series_id,reference_sop_id,roi_id,roi_label,roi_coordinates
0,1.2.826.0.1.3680043.8.498.81829257432420839308...,1.2.826.0.1.3680043.8.498.88303719051231458307...,2,region_of_interest,"[[63021.0, 166331.0], [65069.0, 166331.0], [65..."
1,1.2.826.0.1.3680043.8.498.81829257432420839308...,1.2.826.0.1.3680043.8.498.88303719051231458307...,3,region_of_interest,"[[128929.0, 184949.0], [130977.0, 184949.0], [..."
2,1.2.826.0.1.3680043.8.498.81829257432420839308...,1.2.826.0.1.3680043.8.498.88303719051231458307...,2002,region_of_interest,"[[44298.0, 145648.0], [75522.0, 145648.0], [75..."
3,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,82,region_of_interest,"[[8235.0, 164826.0], [10283.0, 164826.0], [102..."
4,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,83,region_of_interest,"[[134067.0, 152341.0], [136115.0, 152341.0], [..."
5,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,2195,region_of_interest,"[[87180.0, 160884.0], [95250.0, 160884.0], [95..."
6,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,2196,region_of_interest,"[[12720.0, 162063.0], [23165.0, 162063.0], [23..."
7,1.2.826.0.1.3680043.8.498.95463846645682241501...,1.2.826.0.1.3680043.8.498.13783623809302100421...,1028,region_of_interest,"[[99052.0, 80470.0], [101100.0, 80470.0], [101..."
8,1.2.826.0.1.3680043.8.498.95463846645682241501...,1.2.826.0.1.3680043.8.498.13783623809302100421...,1029,region_of_interest,"[[80013.0, 88503.0], [82061.0, 88503.0], [8206..."


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [28]:
## for local access
import os
def get_cells(series_uids: List[str], annotation_session: Union[int,str]) -> pd.DataFrame:
    rows = []
    for file in os.listdir('./downloaded_files'):
        if 'cell' in file:
            ann = hd.ann.annread('./downloaded_files/'+ file)
            for ann_group in ann.get_annotation_groups():
                coords = ann_group.get_graphic_data(coordinate_type='2D')
                m_names, m_values, m_units = ann_group.get_measurements()
                for c, m in zip(coords, m_values):
                    rows.append({
                        'reference_series_id': ann.ReferencedSeriesSequence[0].SeriesInstanceUID,
                        'reference_sop_id': ann.ReferencedImageSequence[0].ReferencedSOPInstanceUID,
                        'annotation_session': ann.SeriesDescription,
                        'cell_id': int(m[0]),
                        'roi_id': int(m[1]) if m.size > 1 else None, # allow empty roi_id,
                        'cell_label': ann_group.label,
                        'cell_coordinates': c
                    })
    cells = pd.DataFrame(rows)#, columns=['series_id', 'sop_id', 'cell_id', 'roi_id', 'cell_label', 'cell_coordinates'])
    return cells

display(get_cells(None, None))

Unnamed: 0,reference_series_id,reference_sop_id,annotation_session,cell_id,roi_id,cell_label,cell_coordinates
0,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,Consensus: cell bounding boxes with cell type ...,12080,82.0,artifact,"[[8283.0, 164832.0], [8474.0, 164832.0], [8474..."
1,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,Consensus: cell bounding boxes with cell type ...,12095,82.0,artifact,"[[8583.0, 166228.0], [8624.0, 166228.0], [8624..."
2,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,Consensus: cell bounding boxes with cell type ...,12097,82.0,artifact,"[[8689.0, 166322.0], [8803.0, 166322.0], [8803..."
3,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,Consensus: cell bounding boxes with cell type ...,12098,82.0,artifact,"[[8944.0, 166190.0], [9178.0, 166190.0], [9178..."
4,1.2.826.0.1.3680043.8.498.77272910370076975556...,1.2.826.0.1.3680043.8.498.50197427852324449642...,Consensus: cell bounding boxes with cell type ...,12110,82.0,artifact,"[[8653.0, 164860.0], [8859.0, 164860.0], [8859..."
...,...,...,...,...,...,...,...
1544,1.2.826.0.1.3680043.8.498.95463846645682241501...,1.2.826.0.1.3680043.8.498.13783623809302100421...,Unlabeled cell bounding boxes,102455,1028.0,haematological_structure,"[[100424.0, 80821.0], [100555.0, 80821.0], [10..."
1545,1.2.826.0.1.3680043.8.498.95463846645682241501...,1.2.826.0.1.3680043.8.498.13783623809302100421...,Unlabeled cell bounding boxes,102456,1028.0,haematological_structure,"[[99723.0, 81037.0], [99879.0, 81037.0], [9987..."
1546,1.2.826.0.1.3680043.8.498.95463846645682241501...,1.2.826.0.1.3680043.8.498.13783623809302100421...,Unlabeled cell bounding boxes,102457,1028.0,haematological_structure,"[[99193.0, 81851.0], [99373.0, 81851.0], [9937..."
1547,1.2.826.0.1.3680043.8.498.95463846645682241501...,1.2.826.0.1.3680043.8.498.13783623809302100421...,Unlabeled cell bounding boxes,102458,1028.0,haematological_structure,"[[99197.0, 81691.0], [99343.0, 81691.0], [9934..."


In [None]:
from google.colab import auth
auth.authenticate_user()