<a href="https://colab.research.google.com/github/DanielaSchacherer/IDC-Tutorials/blob/bmdeep_tutorial/notebooks/collections_demos/bonemarrowwsi_pediatricleukemia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BoneMarrowWSI-PediatricLeukemia


## Background

This notebook introduces the `BoneMarrowWSI-PediatricLeukemia` collection, which is presented in [this preprint](https://www.arxiv.org/pdf/2509.15895) and was recently added to [Imaging Data Commons](https://portal.imaging.datacommons.cancer.gov/).

- **Images**: The `BoneMarrowWSI-PediatricLeukemia` dataset comprises bone marrow aspirate smear WSIs for almost 250 pediatric cases of leukemia, including acute lymphoid leukemia (ALL), acute myeloid leukemia (AML), and chronic myeloid leukemia (CML). The smears were prepared for the initial diagnosis (i.e., without prior treatment), stained in accordance with the Pappenheim method, and scanned at 40x magnification.
- **Annotations**: The images have been annotated with rectangular regions of interest (ROI) of the evaluable monolayer area and more than 40000 cell bounding box annotations have been placed (with few exceptions) within the ROIs. For a subset of them all cells and other haematological structures have additionally been labelled by multiple experts in a consensus labeling approach with 49 distinct (cell type) classes. The consensus labelling approach worked as follows: each bounding box was successively labelled by different experts in so-called "annotation sessions" until (a) the bounding box had been labelled by at least two experts, and (b) the most frequent label had constituted at least half of all labels given to that bounding box (and had then been termed "consensus class"). In summary, the following annotations are available:  

    - For each slide: ROI annotations of the monolayer area for each slide
    - For some slides: Unlabeled cell bounding boxes
    - For some slides: Cell bounding boxes with cell type labels for each annotation session plus the finally obtained consensus.

This notebook concentrates on **how to access and work with the annotation data**, that are made available in **DICOM Microscopy Bulk Simple Annotation format (ANNs)**. As a general introduction to this format, we recommend having a look at [this tutorial notebook](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb).


<img src="https://raw.githubusercontent.com/ImagingDataCommons/IDC-Tutorials/master/notebooks/pathomics/bmdeep_annotations_example.png" alt="Example visualization of BoneMarrowWSI-PediatricLeukemia annotations" width="1000"/>



## Prerequisites
**Installations**
* **Install highdicom:** [highdicom](https://highdicom.readthedocs.io/en/latest/introduction.html) was specifically designed to work with DICOM objects holding image-derived information, e.g. annotations and measurements. Detailed information on highdicom's functionality can be found in its [user guide](https://highdicom.readthedocs.io/en/latest/usage.html).
* **Install wsidicom:** The [wsidicom](https://pypi.org/project/wsidicom/) Python package provides functionality to open and extract image or metadata from WSIs.
* **Install idc-index:** The Python package [idc-index](https://pypi.org/project/idc-index/) facilitates queries of the basic metadata and download of DICOM files hosted by the IDC.

In [1]:
%%capture
!pip install highdicom
!pip install wsidicom
!pip install idc-index --upgrade

## Imports

In [2]:
import os
import highdicom as hd
from idc_index import IDCClient
import pandas as pd
from google.cloud import storage
from pathlib import Path
from typing import List, Union

## Finding the `BoneMarrowWSI-PediatricLeukemia` dataset on IDC
To access and download image and ANNs files, we utilize the Python package [idc-index](https://github.com/ImagingDataCommons/idc-index) and fetch the `ann_index`, specific to DICOM ANN objects.

In [3]:
idc_client = IDCClient() # set-up idc_client
idc_client.fetch_index('ann_index') # fetch index for ANN objects

First, we count the number of slides (=distinct SeriesInstanceUIDs of SM modality) available in the `BoneMarrowWSI-PediatricLeukemia` collection:

In [4]:
query_slide_count = '''
SELECT COUNT(DISTINCT SeriesInstanceUID)
FROM
    index
WHERE
    collection_id = 'bonemarrowwsi_pediatricleukemia' AND Modality='SM'
'''
print(idc_client.sql_query(query_slide_count))

   count(DISTINCT SeriesInstanceUID)
0                                246


Next, we have a look at the available annotation (ANN) files. The following query collects information about ANN files on series-level from idc-index's `ann_index`.

In [9]:
query_anns = '''
SELECT
    ann_index.SeriesInstanceUID,
    ann_index.referenced_SeriesInstanceUID,
    index.SeriesDescription,
    index.StudyInstanceUID,
FROM
    ann_index
JOIN index
  ON index.SeriesInstanceUID = ann_index.SeriesInstanceUID
WHERE
    index.collection_id = 'bonemarrowwsi_pediatricleukemia'
ORDER BY
    ann_index.referenced_SeriesInstanceUID
'''
annotations = idc_client.sql_query(query_anns)
display(annotations)

Unnamed: 0,SeriesInstanceUID,SeriesDescription,StudyInstanceUID,referenced_SeriesInstanceUID
0,1.2.826.0.1.3680043.10.511.3.70465245248160893...,Session 0: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.8.498.32796004339808195676...,1.2.826.0.1.3680043.8.498.11363257719279170754...
1,1.2.826.0.1.3680043.10.511.3.91776015271042872...,Session 1: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.8.498.32796004339808195676...,1.2.826.0.1.3680043.8.498.11363257719279170754...
2,1.2.826.0.1.3680043.10.511.3.81099351624121702...,Monolayer regions of interest for cell classif...,1.2.826.0.1.3680043.8.498.32796004339808195676...,1.2.826.0.1.3680043.8.498.11363257719279170754...
3,1.2.826.0.1.3680043.10.511.3.94539084193943051...,Consensus: cell bounding boxes with cell type ...,1.2.826.0.1.3680043.8.498.32796004339808195676...,1.2.826.0.1.3680043.8.498.11363257719279170754...
4,1.2.826.0.1.3680043.10.511.3.75813793525322825...,Session 3: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.8.498.32796004339808195676...,1.2.826.0.1.3680043.8.498.11363257719279170754...
...,...,...,...,...
1028,1.2.826.0.1.3680043.10.511.3.12137144004756088...,Session 5: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.8.498.12485116558934689418...,1.2.826.0.1.3680043.8.498.99096006395418595522...
1029,1.2.826.0.1.3680043.10.511.3.38880129437460258...,Session 2: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.8.498.12485116558934689418...,1.2.826.0.1.3680043.8.498.99096006395418595522...
1030,1.2.826.0.1.3680043.10.511.3.12165779616169782...,Session 3: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.8.498.12485116558934689418...,1.2.826.0.1.3680043.8.498.99096006395418595522...
1031,1.2.826.0.1.3680043.10.511.3.53100158250225700...,Session 4: Cell bounding boxes with cell type ...,1.2.826.0.1.3680043.8.498.12485116558934689418...,1.2.826.0.1.3680043.8.498.99096006395418595522...


We can see, that for each slide (i.e. the **referencedSeriesInstanceUID**) there are multiple ANN Series. Looking at the **SeriesDescription**, we can assert what is described in the [Background](#Background) section of this notebook.

*   Each slide has "Monolayer regions of interest for cell classification" annotations.
*   For some slides, there is one ANN Series with "Unlabeled cell bounding boxes", while for others, there are multiple ANN Series containing "Cell bounding boxes with cell type labels" for different annotation sessions and the consensus labels.

We will use this knowledge of the **SeriesDescription** later in this notebook to facilitate filtering directly for labeled or unlabeled cell annotations.



## Viewing annotations


Annotations can be viewed and explored in detail on its respective slide using the Slim viewer. In the Slim viewer's interface at the bottom of the right sidebar you may select the ANN Series of interest to you from the drop-down menue, then click on `Annotation Groups` and switch the slider(s) to make annotations visible.

In [11]:
viewer_url = idc_client.get_viewer_URL(studyInstanceUID=annotations['StudyInstanceUID'][0], viewer_selector='slim')
from IPython.display import IFrame
IFrame(viewer_url, width=1500, height=900)

## Accessing annotations

### Download complete annotation collection for local access
Since the complete set of annotation series is of reasonable size it could be downloaded completely using `idc_index` as shown below and then accessed from the local disk using `highdicom`.

In [12]:
dcm_ann_dir = Path('/content/dicom_ann_annotations')
os.makedirs(dcm_ann_dir, exist_ok=True)

idc_client.download_from_selection(downloadDir=dcm_ann_dir,
                                   seriesInstanceUID=annotations['SeriesInstanceUID'].tolist(), dirTemplate=None)

Downloading data:  93%|█████████▎| 17.1M/18.4M [00:02<00:00, 6.62MB/s]


For guidance on how to read the downloaded annotation files see section "Reading DICOM ANNs" of [this tutorial notebook](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/microscopy_dicom_ann_intro.ipynb).

### Access annotations directly from the Cloud

A more desirable approach especially for larger size datasets is to directly extract the relevant information from the objects in the cloud. The following functions `get_roi_annotations()` and `get_cell_annotations()` can be used for this approach. They extract and summarize ROIs respectively cell annotations in an easy to use pandas DataFrame.
Note, that the selection of the respective annotation files, i.e. files containing ROI annotations, labeled or unlabeled cell annotations, is done by filtering for the respective SeriesDescription.

The following two code cells define and use `get_roi_annotations()` to select all DICOM ANNs in the `BoneMarrowWSI-PediatricLeukemia` collection that contain ROI annotations of the monolayer area.
The resulting pandas DataFrame contains
- **'SeriesInstanceUID'**: SeriesInstanceUID of the DICOM ANN Series containing the cell annotation.
- **'roi_id'**: the ID of the ROI
- **'roi_label'**: its label   
- **'roi_coordinates'**: the 2D coordinates in the image coordinate system of the referenced slide level
- **'reference_SeriesInstanceUID'** and **'reference_SOPInstanceUID'**: the SeriesInstanceUID and SOPInstanceUID of the slide level the annotations refer to. reference_SeriesInstanceUID can either be obtained from ann_index or read from the ANN file directly - for consistency with reference_SOPInstanceUID the later approach was chosen here.


In [38]:
def get_roi_annotations(ann_to_process: int = 10):
    # We use the term 'monolayer' to filter for ANNs with ROIs in the SeriesDescription
    query_roi_anns = f'''
    SELECT
        ann_index.SeriesInstanceUID
    FROM
        ann_index
    JOIN index
      ON index.SeriesInstanceUID = ann_index.SeriesInstanceUID
    WHERE
        collection_id = 'bonemarrowwsi_pediatricleukemia'
        AND LOWER(index.SeriesDescription) LIKE '%monolayer%'
    ORDER BY
        ann_index.SeriesInstanceUID
    LIMIT {ann_to_process}
    '''
    roi_series = idc_client.sql_query(query_roi_anns)['SeriesInstanceUID']
    rois = extract_rois(roi_series)
    return rois


def extract_rois(series_uids: List[str]) -> pd.DataFrame:
    gcs_client = storage.Client.create_anonymous_client()
    rows = []
    for series_uid in series_uids:
        file_urls = idc_client.get_series_file_URLs(seriesInstanceUID=series_uid, source_bucket_location='gcs')
        for file_url in file_urls:
            (_,_, bucket_name, folder_name, file_name) = file_url.split('/')
            bucket = gcs_client.bucket(bucket_name)
            blob = bucket.blob(f'{folder_name}/{file_name}')

            with blob.open('rb') as file_obj:
                ann = hd.ann.annread(file_obj)
                for ann_group in ann.get_annotation_groups():
                    coords = ann_group.get_graphic_data(coordinate_type='2D')
                    m_names, m_values, m_units = ann_group.get_measurements()
                    for c, m in zip(coords, m_values):
                        rows.append({
                            'SeriesInstanceUID': ann.SeriesInstanceUID,
                            'roi_id': int(m[0]), # allow empty roi_id,
                            'roi_label': ann_group.label,
                            'roi_coordinates': c,
                            'reference_SeriesInstanceUID': ann.ReferencedSeriesSequence[0].SeriesInstanceUID,
                            'reference_SOPInstanceUID': ann.ReferencedImageSequence[0].ReferencedSOPInstanceUID,
                        })
    rois = pd.DataFrame(rows)
    return rois

In [39]:
# This code will run longer as you increase the number of ANN files to be processed
rois = get_roi_annotations(ann_to_process=10)
display(rois)

Unnamed: 0,SeriesInstanceUID,roi_id,roi_label,roi_coordinates,reference_SeriesInstanceUID,reference_SOPInstanceUID
0,1.2.826.0.1.3680043.10.511.3.10076145498370342...,1152,region_of_interest,"[[97220.0, 195768.0], [99268.0, 195768.0], [99...",1.2.826.0.1.3680043.8.498.32484296459223334560...,1.2.826.0.1.3680043.8.498.87476043951163326277...
1,1.2.826.0.1.3680043.10.511.3.10076145498370342...,1153,region_of_interest,"[[98892.0, 160318.0], [100940.0, 160318.0], [1...",1.2.826.0.1.3680043.8.498.32484296459223334560...,1.2.826.0.1.3680043.8.498.87476043951163326277...
2,1.2.826.0.1.3680043.10.511.3.10370508621567005...,1018,region_of_interest,"[[113423.0, 326838.0], [115471.0, 326838.0], [...",1.2.826.0.1.3680043.8.498.76034665511943175311...,1.2.826.0.1.3680043.8.498.10662747416486796732...
3,1.2.826.0.1.3680043.10.511.3.10370508621567005...,1019,region_of_interest,"[[30933.0, 380620.0], [32981.0, 380620.0], [32...",1.2.826.0.1.3680043.8.498.76034665511943175311...,1.2.826.0.1.3680043.8.498.10662747416486796732...
4,1.2.826.0.1.3680043.10.511.3.10509566419026275...,2037,region_of_interest,"[[48237.0, 89186.0], [50285.0, 89186.0], [5028...",1.2.826.0.1.3680043.8.498.17135957873481360405...,1.2.826.0.1.3680043.8.498.61055026268823152919...
5,1.2.826.0.1.3680043.10.511.3.10509566419026275...,2038,region_of_interest,"[[25890.0, 97475.0], [27938.0, 97475.0], [2793...",1.2.826.0.1.3680043.8.498.17135957873481360405...,1.2.826.0.1.3680043.8.498.61055026268823152919...
6,1.2.826.0.1.3680043.10.511.3.11224067190602751...,322,region_of_interest,"[[40807.0, 31720.0], [42855.0, 31720.0], [4285...",1.2.826.0.1.3680043.8.498.82223767803353692585...,1.2.826.0.1.3680043.8.498.72082594196695068782...
7,1.2.826.0.1.3680043.10.511.3.11224067190602751...,323,region_of_interest,"[[41251.0, 26908.0], [43299.0, 26908.0], [4329...",1.2.826.0.1.3680043.8.498.82223767803353692585...,1.2.826.0.1.3680043.8.498.72082594196695068782...
8,1.2.826.0.1.3680043.10.511.3.11224067190602751...,2017,region_of_interest,"[[25157.0, 41317.0], [30531.0, 41317.0], [3053...",1.2.826.0.1.3680043.8.498.82223767803353692585...,1.2.826.0.1.3680043.8.498.72082594196695068782...
9,1.2.826.0.1.3680043.10.511.3.11224067190602751...,2018,region_of_interest,"[[83712.0, 75137.0], [88720.0, 75137.0], [8872...",1.2.826.0.1.3680043.8.498.82223767803353692585...,1.2.826.0.1.3680043.8.498.72082594196695068782...


The following code cells define and use `get_cell_annotations()` to select all DICOM ANNs in the `BoneMarrowWSI-PediatricLeukemia` collection that contain cell annotations. By setting the parameter 'subset' to either 'labeled', 'unlabeled' or 'both', it's possible to extract either only labeled, unlabeled or all cell annotations.
The resulting pandas DataFrame contains
- **'SeriesInstanceUID'**: SeriesInstanceUID of the DICOM ANN Series containing the cell annotation.
- **'annotation_session'**: 'n/a' for the unlabeled cells, otherwise the number of the annotation session or 'consensus' for the final consensus.
- **'cell_id'**: the ID of the cell
- **'roi_id'**: if applicable, the ID of the monolayer ROI, the cell is located within
- **'cell_label_code_scheme'**: Tuple of code of the cell label and designator of the coding scheme, e.g. (414387006, SCT) which is code 414387006 from SNOMED CT ontology
- **'cell_label'**: Code meaning of the cell label defined in cell_label_code_scheme e.g. 'Structure of haematological system'. Have a look at the [SNOMED Browser](https://browser.ihtsdotools.org/?perspective=full&conceptId1=414387006&edition=MAIN&release=&languages=en) for this example.
- **'cell_coordinates'**: the 2D coordinates in the image coordinate system of the referenced slide level
- **'reference_SeriesInstanceUID'** and **'reference_SOPInstanceUID'**: the SeriesInstanceUID and SOPInstanceUID of the slide level the annotations refer to. reference_SeriesInstanceUID can either be obtained from ann_index or read from the ANN file directly - for consistency with reference_SOPInstanceUID the later approach was chosen here.

In [40]:
def get_cell_annotations(subset: Union[str, None] = None, ann_to_process: int = 10) -> pd.DataFrame:
    assert subset in ['labeled', 'unlabeled', None]
    query_words = {'labeled' : 'labels', 'unlabeled': 'unlabeled'}
    if subset == 'labeled' or subset == 'unlabeled':
        # Use the respective query word to filter on ANNs with labeled or unlabeled cells in the SeriesDescription
        query_word = query_words[subset]
        query_cell_anns = f'''
          SELECT
              ann_index.SeriesInstanceUID,
          FROM
              ann_index
          JOIN index
            ON index.SeriesInstanceUID = ann_index.SeriesInstanceUID
          WHERE
              index.collection_id = 'bonemarrowwsi_pediatricleukemia'
              AND LOWER(index.SeriesDescription) LIKE '%{query_word}%'
          ORDER BY
              ann_index.SeriesInstanceUID
          LIMIT {ann_to_process}
          '''
    else:
      query_cell_anns = f'''
          SELECT
              ann_index.SeriesInstanceUID,
          FROM
              ann_index
          JOIN index
            ON index.SeriesInstanceUID = ann_index.SeriesInstanceUID
          WHERE
              index.collection_id = 'bonemarrowwsi_pediatricleukemia'
          ORDER BY
              ann_index.SeriesInstanceUID
          LIMIT {ann_to_process}
          '''
    cell_series = idc_client.sql_query(query_cell_anns)['SeriesInstanceUID']
    cells = extract_cells(cell_series)
    return cells


def extract_cells(series_uids: List[str]) -> pd.DataFrame:
    gcs_client = storage.Client.create_anonymous_client()
    rows = []
    for series_uid in series_uids:
        file_urls = idc_client.get_series_file_URLs(seriesInstanceUID=series_uid, source_bucket_location='gcs')
        for file_url in file_urls:
            (_,_, bucket_name, folder_name, file_name) = file_url.split('/')
            bucket = gcs_client.bucket(bucket_name)
            blob = bucket.blob(f'{folder_name}/{file_name}')

            with blob.open('rb') as file_obj:
                ann = hd.ann.annread(file_obj)
                for ann_group in ann.get_annotation_groups():
                    coords = ann_group.get_graphic_data(coordinate_type='2D')
                    m_names, m_values, m_units = ann_group.get_measurements()
                    for c, m in zip(coords, m_values):
                        rows.append({
                            'SeriesInstanceUID': ann.SeriesInstanceUID,
                            'annotation_session': get_annotation_session(ann),
                            'cell_id': int(m[0]),
                            'roi_id': int(m[1]) if m.size > 1 else None, # allow empty roi_id,
                            'cell_label': ann_group.annotated_property_type.meaning,
                            'cell_label_code_scheme': (ann_group.annotated_property_type.value, ann_group.annotated_property_type.scheme_designator),
                            'cell_coordinates': c,
                            'reference_SeriesInstanceUID': ann.ReferencedSeriesSequence[0].SeriesInstanceUID,
                            'reference_SOPInstanceUID': ann.ReferencedImageSequence[0].ReferencedSOPInstanceUID
                        })
    cells = pd.DataFrame(rows)
    return cells


def get_annotation_session(ann: hd.ann.sop.MicroscopyBulkSimpleAnnotations) -> str:
    if 'unlabeled' in ann.SeriesDescription.lower():
        return 'n/a'
    return ann.SeriesDescription.split(':')[0]

In [41]:
# This code will run longer as you increase the number of ANN files to be processed
unlabeled_cells = get_cell_annotations(subset='unlabeled', ann_to_process=10)
display(unlabeled_cells)

Unnamed: 0,SeriesInstanceUID,annotation_session,cell_id,roi_id,cell_label,cell_label_code_scheme,cell_coordinates,reference_SeriesInstanceUID,reference_SOPInstanceUID
0,1.2.826.0.1.3680043.10.511.3.10323924147988442...,,114518,1174,Structure of haematological system,"(414387006, SCT)","[[92032.0, 215473.0], [92138.0, 215473.0], [92...",1.2.826.0.1.3680043.8.498.79371179824920998262...,1.2.826.0.1.3680043.8.498.81643514503975082734...
1,1.2.826.0.1.3680043.10.511.3.10323924147988442...,,114519,1174,Structure of haematological system,"(414387006, SCT)","[[91986.0, 215538.0], [92127.0, 215538.0], [92...",1.2.826.0.1.3680043.8.498.79371179824920998262...,1.2.826.0.1.3680043.8.498.81643514503975082734...
2,1.2.826.0.1.3680043.10.511.3.10323924147988442...,,114520,1174,Structure of haematological system,"(414387006, SCT)","[[92103.0, 215517.0], [92199.0, 215517.0], [92...",1.2.826.0.1.3680043.8.498.79371179824920998262...,1.2.826.0.1.3680043.8.498.81643514503975082734...
3,1.2.826.0.1.3680043.10.511.3.10323924147988442...,,114521,1174,Structure of haematological system,"(414387006, SCT)","[[92451.0, 215285.0], [92557.0, 215285.0], [92...",1.2.826.0.1.3680043.8.498.79371179824920998262...,1.2.826.0.1.3680043.8.498.81643514503975082734...
4,1.2.826.0.1.3680043.10.511.3.10323924147988442...,,114522,1174,Structure of haematological system,"(414387006, SCT)","[[92709.0, 215442.0], [92813.0, 215442.0], [92...",1.2.826.0.1.3680043.8.498.79371179824920998262...,1.2.826.0.1.3680043.8.498.81643514503975082734...
...,...,...,...,...,...,...,...,...,...
2052,1.2.826.0.1.3680043.10.511.3.20449813712117001...,,107523,1097,Structure of haematological system,"(414387006, SCT)","[[100110.0, 263966.0], [100302.0, 263966.0], [...",1.2.826.0.1.3680043.8.498.31088998307342714143...,1.2.826.0.1.3680043.8.498.98991581155690605814...
2053,1.2.826.0.1.3680043.10.511.3.20449813712117001...,,107524,1097,Structure of haematological system,"(414387006, SCT)","[[100814.0, 263911.0], [100991.0, 263911.0], [...",1.2.826.0.1.3680043.8.498.31088998307342714143...,1.2.826.0.1.3680043.8.498.98991581155690605814...
2054,1.2.826.0.1.3680043.10.511.3.20449813712117001...,,107525,1097,Structure of haematological system,"(414387006, SCT)","[[100782.0, 264035.0], [100938.0, 264035.0], [...",1.2.826.0.1.3680043.8.498.31088998307342714143...,1.2.826.0.1.3680043.8.498.98991581155690605814...
2055,1.2.826.0.1.3680043.10.511.3.20449813712117001...,,107526,1097,Structure of haematological system,"(414387006, SCT)","[[100683.0, 263013.0], [100858.0, 263013.0], [...",1.2.826.0.1.3680043.8.498.31088998307342714143...,1.2.826.0.1.3680043.8.498.98991581155690605814...


In [42]:
# This code will run longer as you increase the number of ANN files to be processed
labeled_cells = get_cell_annotations(subset='labeled', ann_to_process=10)
sorted_cell_labels = labeled_cells.sort_values(by=['reference_SOPInstanceUID', 'cell_id', 'annotation_session'])
display(sorted_cell_labels.style.hide(axis='index')) # don't show row index

SeriesInstanceUID,annotation_session,cell_id,roi_id,cell_label,cell_label_code_scheme,cell_coordinates,reference_SeriesInstanceUID,reference_SOPInstanceUID
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4648,39.0,Damage,"('37782003', 'SCT')",[[ 90693. 170083.]  [ 90867. 170083.]  [ 90867. 170251.]  [ 90693. 170251.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4648,39.0,Structure of haematological system,"('414387006', 'SCT')",[[ 90693. 170083.]  [ 90867. 170083.]  [ 90867. 170251.]  [ 90693. 170251.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4681,40.0,Structure of haematological system,"('414387006', 'SCT')",[[ 12468. 179750.]  [ 12617. 179750.]  [ 12617. 179937.]  [ 12468. 179937.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4681,40.0,Unusable - Quality renders image unusable,"('111235', 'DCM')",[[ 12468. 179750.]  [ 12617. 179750.]  [ 12617. 179937.]  [ 12468. 179937.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4687,40.0,Hypogranular white blood cell,"('250292003', 'SCT')",[[ 12818. 179412.]  [ 12938. 179412.]  [ 12938. 179540.]  [ 12818. 179540.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4687,40.0,Structure of haematological system,"('414387006', 'SCT')",[[ 12818. 179412.]  [ 12938. 179412.]  [ 12938. 179540.]  [ 12818. 179540.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4690,40.0,Damage,"('37782003', 'SCT')",[[ 12933. 179411.]  [ 13096. 179411.]  [ 13096. 179616.]  [ 12933. 179616.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4690,40.0,Structure of haematological system,"('414387006', 'SCT')",[[ 12933. 179411.]  [ 13096. 179411.]  [ 13096. 179616.]  [ 12933. 179616.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4697,40.0,Neutrophil with Cytoplasmic Hypogranularity,"('C37174', 'NCIt')",[[ 13086. 179884.]  [ 13225. 179884.]  [ 13225. 180005.]  [ 13086. 180005.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920
1.2.826.0.1.3680043.10.511.3.10273357893572711856545022632138034,Session 4,4697,40.0,Structure of haematological system,"('414387006', 'SCT')",[[ 13086. 179884.]  [ 13225. 179884.]  [ 13225. 180005.]  [ 13086. 180005.]],1.2.826.0.1.3680043.8.498.21282381063379764723444839472487253786,1.2.826.0.1.3680043.8.498.11740672268929623333414774170922030920


## How to use the `BoneMarrowWSI-PediatricLeukemia` annotations
The `BoneMarrowWSI-PediatricLeukemia` collection stands out due to the extensive amount of information contained in its annotations. More than 40000 cells are annotated with bounding boxes suitable for training **cell detection models**, 28000 of those additionally received expert-generated class labels for **cell type classification** tasks. Particularly noteworthy is the uncertainty information embedded in the consensus labelling process, giving insight into which cell types are particularly challenging to determine or easy to confuse with others.
In the cell below, we catch some of those cases:

In [43]:
grouped_cell_labels = sorted_cell_labels.groupby('cell_id').agg({'cell_label': list, 'cell_label_code_scheme': list,
                                                              'reference_SOPInstanceUID': 'first',
                                                              'cell_coordinates': 'first'})
uncertain = grouped_cell_labels['cell_label'].apply(lambda x: len(set(x)) > 1)
display(grouped_cell_labels[uncertain])

Unnamed: 0_level_0,cell_label,cell_label_code_scheme,reference_SOPInstanceUID,cell_coordinates
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1519,"[Band neutrophil, Structure of haematological ...","[(702697008, SCT), (414387006, SCT)]",1.2.826.0.1.3680043.8.498.80225091671086795258...,"[[67463.0, 181988.0], [67607.0, 181988.0], [67..."
1553,"[Hypogranular white blood cell, Structure of h...","[(250292003, SCT), (414387006, SCT)]",1.2.826.0.1.3680043.8.498.80225091671086795258...,"[[68431.0, 181909.0], [68591.0, 181909.0], [68..."
1644,"[Hypogranular white blood cell, Structure of h...","[(250292003, SCT), (414387006, SCT)]",1.2.826.0.1.3680043.8.498.80225091671086795258...,"[[67903.0, 183726.0], [68035.0, 183726.0], [68..."
1714,"[Structure of haematological system, Blast cell]","[(414387006, SCT), (312256009, SCT)]",1.2.826.0.1.3680043.8.498.80225091671086795258...,"[[68699.0, 182652.0], [68812.0, 182652.0], [68..."
1835,"[Structure of haematological system, Unusable ...","[(414387006, SCT), (111235, DCM)]",1.2.826.0.1.3680043.8.498.80225091671086795258...,"[[86437.0, 189648.0], [86554.0, 189648.0], [86..."
1926,"[Damage, Structure of haematological system]","[(37782003, SCT), (414387006, SCT)]",1.2.826.0.1.3680043.8.498.80225091671086795258...,"[[85642.0, 189293.0], [85691.0, 189293.0], [85..."
4648,"[Damage, Structure of haematological system]","[(37782003, SCT), (414387006, SCT)]",1.2.826.0.1.3680043.8.498.11740672268929623333...,"[[90693.0, 170083.0], [90867.0, 170083.0], [90..."
4681,"[Structure of haematological system, Unusable ...","[(414387006, SCT), (111235, DCM)]",1.2.826.0.1.3680043.8.498.11740672268929623333...,"[[12468.0, 179750.0], [12617.0, 179750.0], [12..."
4687,"[Hypogranular white blood cell, Structure of h...","[(250292003, SCT), (414387006, SCT)]",1.2.826.0.1.3680043.8.498.11740672268929623333...,"[[12818.0, 179412.0], [12938.0, 179412.0], [12..."
4690,"[Damage, Structure of haematological system]","[(37782003, SCT), (414387006, SCT)]",1.2.826.0.1.3680043.8.498.11740672268929623333...,"[[12933.0, 179411.0], [13096.0, 179411.0], [13..."


# Next steps

Share your feedback or ask questions about this notebook in IDC Forum: https://discourse.canceridc.dev.

If you are interested in tissue type annotations or want to learn about DICOM Structured Reporting, you can take a look at [this notebook](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/collections_demos/rms_mutation_prediction/RMS-Mutation-Prediction-Expert-Annotations_exploration.ipynb) navigating expert-generated region annotations for rhabdomyosarcoma tumor slides.