<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/notebooks/getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started: Introduction to IDC data organization and main features

## Summary

[NCI Imaging Data Commons (IDC)](https://imaging.datacommons.cancer.gov) is a cloud-based repository of publicly available cancer imaging data co-located with the analysis and exploration tools and resources. IDC is a node within the broader [NCI Cancer Research Data Commons (CRDC)](https://datacommons.cancer.gov/) infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data.

Many datasets on IDC contain images and labels which make them useful for artificial intelligence research.  You can view all available IDC data at https://imaging.datacommons.cancer.gov/explore/. You can interactively identify cases that contain annotations by selecting SEG (Segmentation object) and/or RTSTRUCT (RadioTherapy Structure set) in the Modality section of the search facets on the left.

This notebook is focused on identifying and retrieving programmatically IDC datasets which have DICOM images and corresponding segmentation labels which could be used to train models for auto-segmentation.

## Acknowledgements

This notebook was created by [Andrey Fedorov](https://github.com/fedorov) in response to the interest from the [MONAI](https://monai.io/) Datasets Program to explore possibility of integrating IDC with MONAI. 

If you leverage this notebook or any TCIA datasets in your work please be sure to comply with the Data Usage Policy of the individual collections, and please cite the IDC manuscript listed below.

> Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S., Aerts, H. J. W. L., Homeyer, A., Lewis, R., Akbarzadeh, A., Bontempi, D., Clifford, W., Herrmann, M. D., Höfener, H., Octaviano, I., Osborne, C., Paquette, S., Petts, J., Punzo, D., Reyes, M., Schacherer, D. P., Tian, M., White, G., Ziegler, E., Shmulevich, I., Pihl, T., Wagner, U., Farahani, K. & Kikinis, R. NCI Imaging Data Commons. Cancer Res. 81, 4188–4193 (2021). http://dx.doi.org/10.1158/0008-5472.CAN-21-0950

Initial version: Jul 20, 2022

Updated: Sept 2022


# Prerequisites

**VERY IMPORTANT**: You will need to complete the prerequisites as described in this page in order to follow this notebook: https://learn.canceridc.dev/introduction/getting-started-with-gcp.

# Getting started with exploring IDC data

Think of IDC as a library. Image files are books, and we have ~45 TB of those. When you go to a library, you want to check out just the books that you want to read. In order to find a book in a large library you need a catalog. 

In a similar way to the library catalog, IDC maintains a catalog that indexes a variety of metadata fields describing the files we curate. That metadata catalog is accessible in a large database table that you should be using to search and subset the images. Each row in that table corresponds to a file, and includes the location of the file alongside the metadata attributes describing that file.

Image files corresponding to the collections hosted by IDC are maintained in the [GCP Storage](https://cloud.google.com/storage) buckets. 

Image catalog is maintained in [GCP BigQuery](https://cloud.google.com/bigquery) tables. A tiny subset of the attributes indexed in the catalog is available via the [IDC Portal exploration page](https://imaging.datacommons.cancer.gov/explore/).

Hosting of both the image files and BigQuery table is sponsored by Google Public Datasets program, you can see IDC entry in the GCP Marketplace here: https://console.cloud.google.com/marketplace/product/bigquery-public-data/nci-idc-data.

Putting everything together, the process of retrieving the files consists of 2 steps.

* **Step 1: Define the list of files that need to be downloaded.** This can be done using either the IDC Portal, which provides interactive search interface, or the BigQuery [`dicom_all` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=idc_current&t=dicom_all&page=table) containing DICOM metadata extracted from those files, and accessible via the [standard SQL interface](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax). Since this notebook is focused on programmatic access to IDC data, we will focus here on the SQL search interface and not on the capabilities of the IDC portal. `dicom_all` table contains one row per file, and includes column named `gcs_url` which can be used to download the corresponding file.

* **Step 2: Download the files.** This operation can be done using a variety of approaches that support S3 interface to storage buckets. Most convenient approach is using the [`gsutil` command line tool](https://cloud.google.com/storage/docs/gsutil) from the Google Cloud SDK, or the open source [`s5cmd` tool](https://github.com/peak/s5cmd), which may achieve better performance compared to `gsutil`.

In the following cells we will work through some of the main features of the "IDC catalog", run some queries to sample data from IDC, and will demonstrate how to download the indexed files.



## Authentication and initialization of GCP project ID

Before doing **anything** you **must** set authorize Colab Runtime to act on your behalf, and initialize the variable pointing to your Google Cloud project ID.

The following cell initializes project ID that is needed for all operations with the cloud. You should have project ID if you completed the [prerequisites](https://learn.canceridc.dev/introduction/getting-started-with-gcp), as instructed earlier.


In [None]:
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "REPLACE_WITH_YOUR_PROJECT_ID"

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

from google.colab import auth
auth.authenticate_user()

## Introduction to IDC BigQuery `dicom_all` table: the IDC data catalog

[`dicom_all` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=idc_current&t=dicom_all&page=table) is perhaps the most important table that you will be using while searching IDC. 

Each row in this table corresponds to a DICOM file that IDC has. There are two categories of the attributes (columns) in that table:
  1. Columns that correspond to the DICOM attributes encountered across all of the IDC data. If you are looking for a conveninent interface to explore the attributes that may be available for the specific image types, but are reluctant to read the DICOM standard directly, [Innolitics DICOM Browser](https://dicom.innolitics.com/ciods) provides a convenient interface for navigating DICOM.
  2. Various non-DICOM metadata attributes available for the files (e.g., those identifying the program or collection specific file belongs to, cancer type, license or GCS URL).

Any of the attributes in `dicom_all` can be used to subset IDC data!

The easiest way to query `dicom_all` in Colab is using the [BigQuery SQL workspace](https://console.cloud.google.com/bigquery), where you can debug the query interactively, confirm the syntax is correct, explore the content of the tables.

Once you know the query that you need to select the data, or once you develop sufficient familiarity with BigQuery SQL, you can run the queries directly in the notebook using the `%%bigquery` magic, as we do below. The result of the query will be saved into a pandas DataFrame. In the following cell we run a query that will select list all distinct collection IDs available in IDC. Keep in mind that query is **not** sensitive to either capitalization of the query keywords or indentation!

Note that since project ID cannot be parameterized while using `%%bigquery`, you will need to manually replace the text `REPLACE_WITH_YOUR_PROJECT_ID` with the ID of your GCP project!

In [None]:
# get a list of all available collections
%%bigquery --project=$my_ProjectID

SELECT 
  DISTINCT(collection_id) 
FROM 
  bigquery-public-data.idc_current.dicom_all

Unnamed: 0,collection_id
0,covid_19_ny_sbu
1,tcga_stad
2,tcga_read
3,pdmr_833975_119_r
4,tcga_skcm
...,...
146,tcga_dlbc
147,cptac_ccrcc
148,tcga_thca
149,tcga_chol


In the following cell we touch two DICOM attributes to search for the relevant data: [`BodyPartExamined`](https://dicom.innolitics.com/ciods/mr-image/general-series/00180015) and [`Modality`](https://dicom.innolitics.com/ciods/mr-image/general-series/00080060).

In [None]:
# find out the modalities contained in a given collection

%%bigquery --project=$my_ProjectID

SELECT
  DISTINCT(BodyPartExamined),
  Modality
FROM
  bigquery-public-data.idc_current.dicom_all
WHERE
  collection_id = "nsclc_radiomics"

Unnamed: 0,BodyPartExamined,Modality
0,,RTSTRUCT
1,LUNG,SEG
2,LUNG,CT


Every image in IDC was collected for a "patient", or case. That "patient" in most cases is a human, but sometime corresponds to a dog or a mouse for pre-clinical studies! Identifier for the patient is stored in the `PatientID` attribute.

To be clear, those identifiers have been modified as part of data ingestion to ensure that the true identity of the patient is protected.

In [None]:
# find all PatientID values that belong to the nsclc_radiomics collection and have segmentation modality

%%bigquery --project=$my_ProjectID

SELECT
  DISTINCT(PatientID)
FROM
  bigquery-public-data.idc_current.dicom_all
WHERE
  collection_id = "nsclc_radiomics"
  AND Modality="SEG"
ORDER BY
  PatientID

Unnamed: 0,PatientID
0,LUNG1-001
1,LUNG1-002
2,LUNG1-003
3,LUNG1-004
4,LUNG1-005
...,...
416,LUNG1-418
417,LUNG1-419
418,LUNG1-420
419,LUNG1-421


IDC is using DICOM for data representation, and in the DICOM data model, patients (identified by `PatientID`) undergo imaging exams (or _studies_, in DICOM nomenclature). 

Each patient will have one or more studies, with each study identified uniquely by the attribute `StudyInstanceUID`. During each of the imaging studies one or more imaging _series_ will be collected. As an example, a Computed Tomography (CT) imaging study may include a volume sweep before and after administration of the contrast agent. Imaging series are uniqiely identified by `SeriesInstanceUID`. 

Finally, each imaging series contains one or more _instance_, where each instance maps to a file. Most often, one instance corresponds to a single slice from a cross-sectional image. Individual instances are identified by unique `SOPInstanceUID` values.

In the following we will select all distinct segmentation series that are available within the `nsclc_radiomics` collection.

In [None]:
# find all DICOM segmentation series in the nsclc_radiomics collection

%%bigquery --project=$my_ProjectID

SELECT
  DISTINCT(SeriesInstanceUID)
FROM
  bigquery-public-data.idc_current.dicom_all
WHERE
  collection_id = "nsclc_radiomics"
  AND Modality="SEG"

Unnamed: 0,SeriesInstanceUID
0,1.2.276.0.7230010.3.1.3.2323910823.22628.15972...
1,1.2.276.0.7230010.3.1.3.2323910823.23284.15972...
2,1.2.276.0.7230010.3.1.3.2323910823.22300.15972...
3,1.2.276.0.7230010.3.1.3.2323910823.25464.15972...
4,1.2.276.0.7230010.3.1.3.2323910823.9200.159725...
...,...
416,1.2.276.0.7230010.3.1.3.2323910823.23292.15972...
417,1.2.276.0.7230010.3.1.3.2323910823.4748.159726...
418,1.2.276.0.7230010.3.1.3.2323910823.24588.15972...
419,1.2.276.0.7230010.3.1.3.2323910823.22048.15972...


## Different mechanisms available for querying IDC BigQuery tables

In the above, we used the `%%bigquery` magic to run queries. This approach is probably most convenient for running queries, but it has limitations. Fortunately, those queries can be executed using variety of approaches covered below, along with their strengths and weaknesses.


### `%%bigquery` magic

Pros:
* you only need to write the SQL query
* output can be redirected to a pandas DataFrame

Cons:
* you cannot parameterize the query
* you cannot use it outside of Colab without extra setup

**Example**

In [None]:
%%bigquery selection_df --project=$my_ProjectID 

SELECT 
  DISTINCT(collection_id)
FROM 
  bigquery-public-data.idc_current.dicom_all

In [None]:
selection_df

Unnamed: 0,collection_id
0,prostatex
1,apollo_5_lscc
2,rembrandt
3,spie_aapm_lung_ct_challenge
4,tcga_meso
...,...
130,breast_diagnosis
131,pseudo_phi_dicom_data
132,tcga_stad
133,tcga_acc


### BigQuery Python API

Pros:
* highly configurable
* works the same way outside of Colab

Cons:
* more lines of code to write for the same query as compared to `%%bigquery`

**Example**

In [None]:
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

selection_query = """
SELECT 
  DISTINCT(collection_id) 
FROM 
  bigquery-public-data.idc_current.dicom_all
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

Unnamed: 0,collection_id
0,cmb_lca
1,midrc_ricord_1b
2,cptac_ov
3,breast_mri_nact_pilot
4,tcga_lihc
...,...
146,icdc_glioma
147,prostate_3t
148,pancreas_ct
149,lung_phantom


### Cloud SDK `bq` tool

Pros:
* queries can be scripted without writing Python code
* convenient for generating download manifests
  * by "download manifests" we refer to text files that contain the list of fully resolved Google Cloud Storage URLs that can be passed to the `gsutil` tool for download

Cons:
* not convenient for experimenting and exploring
* by default, the result of the query will be capped at 100 rows: the cap should be set to a large number to avoid incomplete query result

**Example**

In [None]:
!echo "SELECT DISTINCT(collection_id) FROM bigquery-public-data.idc_current.dicom_all" > query.sql
!cat query.sql| bq query --use_legacy_sql=false --format=csv -n 10000000 --project_id=$my_ProjectID > query_result.csv


In [None]:
!cat query_result.csv

collection_id
lidc_idri
tcga_gbm
tcga_tgct
tcga_sarc
tcga_paad
nsclc_radiogenomics
tcga_kirc
nlst
duke_breast_cancer_mri
apollo_5_thym
phantom_fda
cptac_luad
qin_prostate_repeatability
midrc_ricord_1c
ct_colonography
tcga_uvm
lung_pet_ct_dx
cmb_lca
midrc_ricord_1b
icdc_glioma
pseudo_phi_dicom_data
prostate_3t
lung_phantom
pancreas_ct
tcga_coad
aapm_rt_mac
upenn_gbm
dro_toolkit
tcga_dlbc
covid_19_ny_sbu
tcga_read
tcga_stad
opc_radiomics
pancreatic_ct_cbct_seg
hnscc_3dct_rt
cmmd
rider_neuro_mri
mouse_astrocytoma
acrin_6698
cptac_brca
pdmr_997537_175_t
gbm_dsc_mri_dro
qin_gbm_treatment_response
tcga_blca
b_mode_and_ceus_liver
apollo_5_paad
acrin_fmiso_brain
victre
qiba_ct_1c
brain_tumor_progression
tcga_ov
tcga_hnsc
apollo_5_lscc
vestibular_schwannoma_seg
tcga_brca
htan_wustl
tcga_acc
htan_hms
lung_fused_ct_pathology
head_neck_cetuximab
ispy1
acrin_nsclc_fdg_pet
rembrandt
pdmr_833975_119_r
tcga_skcm
qin_brain_dsc_mri
lgg_1p19qdeletion
tcga_prad
tcga_thym
ct_lymph_nodes
rider_lung_ct
hnscc

## Using SQL to identify datasets of interest for image segmentation

Now let's use the API to build a list of how many patients exist in each collection with DICOM segmentation data (SEG/RTSTRUCT modalities) and then decide which collection(s) to download and visualize. 

The following query returns the Collection ID, `Modality`, `BodyPartExamined`, and number of patients with DICOM segmentations (SEG/RTSTRUCT) to help a researcher decide what collection(s) they want to download and use to train a segmentation model.

This is a more complex two-stage query that first identifies all series that have modality "SEG" or "RTSTRUCT", and next walks up in the DICOM model hierarchy to get summaries for the studies that contain those modalities.

In [None]:
# the query below first identifies DICOM studies that contain SEG or RTSTRUCT 
# modalities, and then creates summary of the number of patients, modalities and BodyPartExamined
# values for the matching studies

%%bigquery --project=$my_ProjectID

WITH
  collections_with_seg_rtstruct AS (
  SELECT
    DISTINCT(StudyInstanceUID)
  FROM
    bigquery-public-data.idc_current.dicom_all
  WHERE
    Modality = "SEG"
    OR Modality = "RTSTRUCT" )
SELECT
  dicom_all.collection_id,
  COUNT(DISTINCT(PatientID)) AS PatientID_cnt,
  STRING_AGG(DISTINCT(Modality),",") AS Modality,
  STRING_AGG(DISTINCT(BodyPartExamined),",") AS BodyPartsExamined,
  ROUND(SUM(instance_size)/POW(1024,4),2) AS collection_size_TB
FROM
  bigquery-public-data.idc_current.dicom_all AS dicom_all
JOIN
  collections_with_seg_rtstruct
ON
  dicom_all.StudyInstanceUID = collections_with_seg_rtstruct.StudyInstanceUID
GROUP BY
  collection_id
ORDER BY
  PatientID_cnt DESC

Unnamed: 0,collection_id,PatientID_cnt,Modality,BodyPartsExamined,collection_size_TB
0,lidc_idri,875,"SEG,SR,CT",CHEST,0.11
1,opc_radiomics,605,"CT,RTSTRUCT",Head-and-Neck,0.06
2,hnscc,604,"CT,RTDOSE,RTSTRUCT,RTPLAN,PT",HEADNECK,0.07
3,nsclc_radiomics,422,"RTSTRUCT,CT,SEG",LUNG,0.03
4,pediatric_ct_seg,359,"CT,RTSTRUCT",ABDOMEN,0.06
5,head_neck_pet_ct,298,"RTSTRUCT,RTDOSE,RTPLAN,CT,PT,REG",,0.07
6,vestibular_schwannoma_seg,242,"RTPLAN,RTSTRUCT,MR,RTDOSE",BRAIN,0.03
7,c4kc_kits,210,"CT,SEG","ABD PEL,ABD PELV,WO INTER,CT 3PHASE REN,CAP,AB...",0.04
8,ispy1,207,"SR,SEG,MR",BREAST,0.06
9,lgg_1p19qdeletion,159,"SEG,MR",BRAIN,0.0


## DICOM Slide Microscopy (digital pathology images)

IDC contains digital pathology images stored in the [DICOM-TIFF dual personality format](https://learn.canceridc.dev/dicom/dicom-tiff-dual-personality-files), and are ingested into IDC as DICOM Slide Microscopy objects. Those images can be searched, visualized and downloaded using the same approach as radiology data. 

To demonstrate this, in the following cell we query for a summary of all collections in IDC that contain Slide Microscopy modality, accompanied by their sizes and additional relevant collection-level attributes.

In [None]:
%%bigquery --project=$my_ProjectID

WITH
  collections_with_sm AS (
  SELECT
    DISTINCT(StudyInstanceUID)
  FROM
    bigquery-public-data.idc_current.dicom_all
  WHERE
    Modality = "SM" )
SELECT
  dicom_all.collection_id,
  COUNT(DISTINCT(PatientID)) AS PatientIDs,
  STRING_AGG(DISTINCT(Modality),",") AS Modalities,
  STRING_AGG(DISTINCT(BodyPartExamined),",") AS BodyPartsExamined,
  STRING_AGG(DISTINCT(Access),",") AS accessTypes,
  ROUND(SUM(instance_size)/POW(1024,4),2) AS collection_size_TB
FROM
  bigquery-public-data.idc_current.dicom_all AS dicom_all
JOIN
  collections_with_sm
ON
  dicom_all.StudyInstanceUID = collections_with_sm.StudyInstanceUID
GROUP BY
  collection_id
ORDER BY
  PatientIDs DESC

Unnamed: 0,collection_id,PatientIDs,Modalities,BodyPartsExamined,accessTypes,collection_size_TB
0,tcga_brca,1098,SM,,Public,1.56
1,tcga_gbm,607,SM,,Public,0.6
2,tcga_ov,590,SM,,Public,0.44
3,tcga_ucec,560,SM,,Public,1.0
4,tcga_kirc,537,SM,,Public,0.76
5,tcga_hnsc,523,SM,,Public,0.52
6,tcga_luad,522,SM,,Public,0.59
7,tcga_lgg,516,SM,,Public,1.09
8,tcga_thca,507,SM,,Public,0.7
9,tcga_lusc,504,SM,,Public,0.57


## Getting information about data usage terms

It is important to recognize that licenses and data usage terms vary across collections, and may also vary within a single collection. It is very important to comply with the data usage terms, and give proper (and required!) acknowledgements to the contributors of the individual datasets.

Fortunately, IDC metadata table contains information about the license at the granularity of individual files (rows in the IDC BigQuery catalog), and the details about usage terms (i.e., required citations) are typically present at the page pointed to by the Digital Object Identifier (DOI), which is also available at the granularity of the individual files.

Let's say, you are interested in the LIDC collection. In the below we get the distinct licenses that apply to the items in that collections, and DOIs.

In [None]:
%%bigquery --project=$my_ProjectID

SELECT
  DISTINCT(license_short_name),
  source_doi,
  source_url
FROM
  bigquery-public-data.idc_current.dicom_all
WHERE
  collection_id = "lidc_idri"

Unnamed: 0,license_short_name,source_doi,source_url
0,CC BY 3.0,10.7937/K9/TCIA.2015.1BUVFJR7,https://doi.org/10.7937/K9/TCIA.2015.1BUVFJR7
1,CC BY 3.0,10.7937/K9/TCIA.2015.LO9QL9SX,https://doi.org/10.7937/K9/TCIA.2015.LO9QL9SX
2,CC BY 3.0,10.7937/TCIA.2018.h7umfurq,https://doi.org/10.7937/TCIA.2018.h7umfurq


As you can see, all of the items are covered by CC-BY 3.0 license, but there are three distinct DOIs associated with the items included in the collection. The reason for this is that many collections include items that were contributed by the entity that collected and submitted the original content, but also analysis results contributed by another entity. You can (and should!) follow the URLs in the `source_url` column to familirize yourself with the details and attribution requirements, whenever you use any content from IDC!

# Downloading IDC data

Hopefully, the previous section gave you some initial idea about how to navigate the "IDC catalog". Once you narrowed down into the subset you need, the locations of the files corresponding to the rows in the catalog on the cloud are defined by the `gcs_url` column, which contains URL that can be used to download the files.

IDC documentation page on downloading data is here: https://learn.canceridc.dev/data/downloading-data, and in the following cells we will go over those steps.

In the following example we define the query that selects the rows of the table corresponding to the SEG series in the `C4KC-KiTS` collection, and then downloads the DICOM files corresponding to those rows from the cloud to the Colab VM. This time, for a change, we will use the BigQuery Python SDK to run the query.

So first step - run the selection query:



In [None]:
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

selection_query = """
SELECT 
  StudyInstanceUID, gcs_url 
FROM 
  bigquery-public-data.idc_current.dicom_all 
WHERE 
  collection_id = \"c4kc_kits\" 
  AND Modality = \"SEG\"
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

Unnamed: 0,StudyInstanceUID,gcs_url
0,1.3.6.1.4.1.14519.5.2.1.6919.4624.225468060654...,gs://public-datasets-idc/826a5e9a-87cb-4bb3-bc...
1,1.3.6.1.4.1.14519.5.2.1.6919.4624.272299865959...,gs://public-datasets-idc/fbda7603-37c6-4ee7-ac...
2,1.3.6.1.4.1.14519.5.2.1.6919.4624.264826143622...,gs://public-datasets-idc/0fe88517-ff9b-460c-a0...
3,1.3.6.1.4.1.14519.5.2.1.6919.4624.746114323533...,gs://public-datasets-idc/eb4cb4fd-028f-458c-aa...
4,1.3.6.1.4.1.14519.5.2.1.6919.4624.112931913439...,gs://public-datasets-idc/c979a3e5-08d7-4e3d-bb...
...,...,...
205,1.3.6.1.4.1.14519.5.2.1.6919.4624.203830691945...,gs://public-datasets-idc/2c39893c-7b14-4516-9e...
206,1.3.6.1.4.1.14519.5.2.1.6919.4624.110805686568...,gs://public-datasets-idc/e9e590c8-7e19-47d2-b9...
207,1.3.6.1.4.1.14519.5.2.1.6919.4624.329466515207...,gs://public-datasets-idc/6bd390e5-49cb-498a-8a...
208,1.3.6.1.4.1.14519.5.2.1.6919.4624.212419072254...,gs://public-datasets-idc/5d937785-8dc2-450b-9e...


Second step - save the manifest:

In [None]:
selection_df["gcs_url"].to_csv("manifest.txt", header=False, index=False)

Third step - download the files using the manifest:

In [None]:


%%capture
!rm -rf downloaded_segs && mkdir downloaded_segs
!cat manifest.txt | gsutil -m cp -I downloaded_segs

If you want to download large number of files, and gsutil seems to be slow,
 please check out alternatives to gsutil in the IDC documentation: https://learn.canceridc.dev/data/downloading-data.

If you are downloading multiple series/studies, you can use this tool to organize files into folders: https://github.com/pieper/dicomsort.

# Viewing studies of interest in IDC

Any of the studies or individual image series you downloaded can be opened in the IDC Viewer! 

You may remember we mentioned `StudyInstanceUID` as the unique identifier used in DICOM to refer to an imaging study. If you know that identifier for the study in IDC you want to visualize - visualization is trivial.


## Viewing radiology images

In the following code, we define the function that helps you form the URL for the viewer, and visualize a random study from the result of the query above. You can pass this URL to your colleague, or bookmark it!

In [None]:
# helper function to view a study or a specific series hosted by IDC
def get_idc_viewer_url(studyUID, seriesUID=None):
  url = "https://viewer.imaging.datacommons.cancer.gov/viewer/"+studyUID
  if seriesUID is not None:
    url = url+"?seriesInstanceUID="+seriesUID
  return url

import random
print(get_idc_viewer_url(random.choice(selection_df["StudyInstanceUID"].values)))

https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.6919.4624.225468060654147110486404354267


## Viewing pathology images

Visualization of the slide microscopy images is very similar, except that the URL will be slightly different. The reason for this is that IDC is relying on the OHIF Viewer for visualizing radiology images, and Slim Viewer for side microscopy, since radiology and pathology communities have somewhat different preferences of how they like to see images. 

Let's first select `StudyInstanceUID`s corresponding to the slide microscopy image studies.

In [None]:

%%bigquery sm_studies --project=$my_ProjectID 

SELECT
  DISTINCT(StudyInstanceUID)
FROM
  bigquery-public-data.idc_current.dicom_all
WHERE
  Modality = "SM"

The following cell will pick a random SM study, and will generate the viewer URL.

In [None]:
# helper function to view a study or a specific series hosted by IDC
def get_idc_slim_viewer_url(studyUID):
  url = "https://viewer.imaging.datacommons.cancer.gov/slim/studies/"+studyUID
  return url

import random
print(get_idc_slim_viewer_url(random.choice(sm_studies["StudyInstanceUID"].values)))

https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.20396345804009928035460472238885694182


At the time of writing this, IDC does not have any of the annotations for slide microscopy data. For examples of using DICOM SM data from IDC in analysis workflows, please see pathomics-focused notebooks here: https://github.com/ImagingDataCommons/IDC-Examples/tree/master/notebooks/pathomics. 

# Identifying image series corresponding to the segmentations

The following query exercises the provenance available within DICOM segmentation series attributes to match segmentations with the images those segmentations were generated from. If you want to know how and why this works - read the next section marked as "advanced", but if you just are interested in the result, you can skip it.

In [None]:
%%bigquery segs_with_referenced_image --project=$my_ProjectID 

WITH
  sampled_sops AS (
  SELECT
    collection_id,
    SeriesDescription,
    SeriesInstanceUID,
    SOPInstanceUID as seg_SOPInstanceUID,
    ReferencedSeriesSequence[SAFE_OFFSET(0)].ReferencedInstanceSequence[SAFE_OFFSET(0)].ReferencedSOPInstanceUID AS rss_one,
    ReferencedImageSequence[SAFE_OFFSET(0)].ReferencedSOPInstanceUID AS ris_one,
    SourceImageSequence[SAFE_OFFSET(0)].ReferencedSOPInstanceUID AS sis_one
  FROM
    `bigquery-public-data.idc_current.dicom_all`
  WHERE
    Modality="SEG"
    AND SOPClassUID = "1.2.840.10008.5.1.4.1.1.66.4"
    AND Access = "Public"),
  coalesced_ref AS (
  SELECT
    *,
    COALESCE(rss_one,
      ris_one,
      sis_one) AS referenced_sop
  FROM
    sampled_sops)
SELECT
  dicom_all.collection_id,
  dicom_all.PatientID,
  dicom_all.SOPInstanceUID,
  segmentations.SegmentedPropertyCategory.CodeMeaning AS segmentation_category,
  segmentations.SegmentedPropertyType.CodeMeaning AS segmentation_type,
  segmentations.SegmentAlgorithmType AS segmentation_algorithm,
  dicom_all.StudyInstanceUID,
  coalesced_ref.SeriesInstanceUID AS seg_SeriesInstanceUID,
  dicom_all.SeriesInstanceUID AS ref_SeriesInstanceUID,
  dicom_all.Modality AS ref_Modality,
  CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/", StudyInstanceUID,"?seriesInstanceUID=",coalesced_ref.SeriesInstanceUID,",",dicom_all.SeriesInstanceUID) as viewer_url
FROM
  coalesced_ref
JOIN
  `bigquery-public-data.idc_current.dicom_all` AS dicom_all
ON
  coalesced_ref.referenced_sop = dicom_all.SOPInstanceUID
JOIN
  `bigquery-public-data.idc_current.segmentations` AS segmentations
ON
  segmentations.SOPInstanceUID = coalesced_ref.seg_SOPInstanceUID

In [None]:
segs_with_referenced_image[["collection_id", "segmentation_category","segmentation_type","segmentation_algorithm", "seg_SeriesInstanceUID", "ref_SeriesInstanceUID"]]


Unnamed: 0,collection_id,segmentation_category,segmentation_type,segmentation_algorithm,seg_SeriesInstanceUID,ref_SeriesInstanceUID
0,ispy1,Tissue,Breast,SEMIAUTOMATIC,1.3.6.1.4.1.14519.5.2.1.7695.1700.268186625666...,1.3.6.1.4.1.14519.5.2.1.7695.1700.211500920009...
1,ispy1,Tissue,Breast,SEMIAUTOMATIC,1.3.6.1.4.1.14519.5.2.1.7695.1700.228484230479...,1.3.6.1.4.1.14519.5.2.1.7695.1700.312032875865...
2,ispy1,Tissue,Breast,SEMIAUTOMATIC,1.3.6.1.4.1.14519.5.2.1.7695.1700.126117986062...,1.3.6.1.4.1.14519.5.2.1.7695.1700.156564945014...
3,ispy1,Tissue,Breast,SEMIAUTOMATIC,1.3.6.1.4.1.14519.5.2.1.7695.1700.646943712219...,1.3.6.1.4.1.14519.5.2.1.7695.1700.247815552390...
4,ispy1,Tissue,Breast,SEMIAUTOMATIC,1.3.6.1.4.1.14519.5.2.1.7695.1700.230738111333...,1.3.6.1.4.1.14519.5.2.1.7695.1700.321136429547...
...,...,...,...,...,...,...
20335,nsclc_radiomics,Anatomical Structure,Lung,SEMIAUTOMATIC,1.2.276.0.7230010.3.1.3.2323910823.18812.15972...,1.3.6.1.4.1.32722.99.99.1745927143060515205794...
20336,nsclc_radiomics,Anatomical Structure,Lung,SEMIAUTOMATIC,1.2.276.0.7230010.3.1.3.2323910823.18812.15972...,1.3.6.1.4.1.32722.99.99.1745927143060515205794...
20337,nsclc_radiomics,Anatomical Structure,Esophagus,MANUAL,1.2.276.0.7230010.3.1.3.2323910823.18812.15972...,1.3.6.1.4.1.32722.99.99.1745927143060515205794...
20338,nsclc_radiomics,Morphologically Altered Structure,"Neoplasm, Primary",MANUAL,1.2.276.0.7230010.3.1.3.2323910823.18812.15972...,1.3.6.1.4.1.32722.99.99.1745927143060515205794...




## Visual check of the selected image/segmentation sample

As explained earlierTo visualize segmentations over the referenced images you do not need to set up your own viewer or download anything. Given `StudyInstanceUID` for the DICOM study containing the segmentation and the image it is segmenting, you can populate an IDC viewer URL and see the segmentations in your browser! For the sake of convenience, we also generated `viewer_url` as a column in the table returned by the query above.

In [None]:
sample = segs_with_referenced_image.sample()

print("Random sample:\n-------\n")
print(". Collection: "+sample["collection_id"].values[0])
print(". PatientID: "+sample["PatientID"].values[0])
print(". Modality: "+ sample["ref_Modality"].values[0])
print(". Segmentation category: "+sample["segmentation_category"].values[0])
print(". Segmentation type: "+sample["segmentation_type"].values[0])
print(". Segmentation algorithm: "+sample["segmentation_algorithm"].values[0])

# if generating the viewer URL on your own, you can use the helper function we discussed earlier in this notebook
viewer_url = get_idc_viewer_url(sample["StudyInstanceUID"].values[0], seriesUID=sample["seg_SeriesInstanceUID"].values[0]+","+sample["ref_SeriesInstanceUID"].values[0])

print(". Viewer URL: "+viewer_url)


Random sample:
-------

. Collection: lidc_idri
. PatientID: LIDC-IDRI-0368
. Modality: CT
. Segmentation category: Morphological Abnormal Structure
. Segmentation type: Nodule
. Segmentation algorithm: MANUAL
. Viewer URL: https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.6279.6001.267495169884268604035801498197?seriesInstanceUID=1.2.276.0.7230010.3.1.3.0.22691.1553305690.347661,1.3.6.1.4.1.14519.5.2.1.6279.6001.326084450802794941185111945926


## Advanced: How does the query above work?

One may assume that segmentation series within a DICOM study corresponds to a single imaging series, and may wonder how to find that series. It is important to recognize that in the general case, segmentation may be derived from multiple imaging series, may have different orientation and resolution, and different number of slices.

In the cases where segmentation series indeed corresponds to a single imaging series, it may not be easy to identify which one among several within a study. Segmentations can be stored in different objects (SEG vs RTSTRUCT), which in turn have different capabilities to communicate referenced image series. Finally, there are different implementations of the standard, and implementations that are not compliant with the standard.

All of that is to say that there is not a single DICOM attribute that can be used to identify the image series corresponding to the segmentation.

DICOM SEG encountered in IDC utilizes one of the two mechanisms that can be used to establish correspondence with the series being segmented:
1. `ReferencedSeriesSequence[0].SeriesInstanceUID` 
2. `ReferencedImageSequence`
3. `SourceImageSequence`

It is also important to appreciate the meaning of the `FrameOfReferenceUID` attribute, which is the unique identifier that can be used to establish whether segmentation (encoded either as SEG or RTSTRUCT) and any image within the study share the same coordinate frame. Strictly speaking, when the standard is implemented correctly, you should be allowed to overlay two series if their `FrameOfReferenceUID` values match.

Note that if the reference to a source/referenced DICOM series is present, the semantics of it is that this is the series that was used for segmentation. The segmentation is still expected to be applicable to the other series that share `FrameOfReferenceUID`.

Further, strictly speaking, per-frame references to the images used by segmentations are available in the `PerFrameFunctionalGroupsSequence`. However, for the datasets available in IDC, the approach discussed above is sufficient.

In [None]:
# find outlier SEG series

%%bigquery --project=$my_ProjectID

WITH
  refs_series_counted AS (
  SELECT
    collection_id,
    SeriesInstanceUID,
    SeriesDescription,
    ARRAY_LENGTH(ReferencedSeriesSequence) AS ref_len,
    ARRAY_LENGTH(ReferencedImageSequence) AS ref_img_len,
    ARRAY_LENGTH(SourceImageSequence) AS src_img_len,
    CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/", StudyInstanceUID) as viewer_url
    #ReferencedSeriesSequence[OFFSET(0)].SeriesInstanceUID
  FROM
    `bigquery-public-data.idc_current.dicom_all`
  WHERE
    Modality="SEG"
    # this is important, since some of the series declare themselves as SEG via Modality
    #  attribute, but in fact are not Segmentations
    #  (you can experiment by commenting out the following line to see the collection
    #  containing the offending series! ;)
    AND SOPClassUID = "1.2.840.10008.5.1.4.1.1.66.4"
  ORDER BY
    ref_len ASC)
SELECT
  DISTINCT(collection_id)
FROM
  refs_series_counted
WHERE
  ref_len = 0
  AND ref_img_len = 0
  AND src_img_len = 0

Unnamed: 0,collection_id
0,qiba_ct_1c


To get more familiar with the data, we can take a few representative samples for each of the situations above, and take a look at the organization of DICOM metadata to help in parsing it.

In [None]:
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

selection_query = """
WITH
  refs_series_counted AS (
  SELECT
    collection_id,
    SeriesInstanceUID,
    SeriesDescription,
    ARRAY_LENGTH(ReferencedSeriesSequence) AS ref_len,
    ARRAY_LENGTH(ReferencedImageSequence) AS ref_img_len,
    ARRAY_LENGTH(SourceImageSequence) AS src_img_len,
    CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/", StudyInstanceUID) as viewer_url
    #ReferencedSeriesSequence[OFFSET(0)].SeriesInstanceUID
  FROM
    `bigquery-public-data.idc_current.dicom_all`
  WHERE
    Modality="SEG"
    # this is important, since some of the series declare themselves as SEG via Modality
    #  attribute, but in fact are not Segmentations
    #  (you can experiment by commenting out the following line to see the collection
    #  containing the offending series! ;)
    AND SOPClassUID = "1.2.840.10008.5.1.4.1.1.66.4"
    AND Access = "Public"
  ORDER BY
    ref_len ASC)
SELECT
  ANY_VALUE(viewer_url) as viewer_url
FROM
  refs_series_counted
WHERE
  REPLACE_WITH_REF_TYPE <> 0
"""

selection_result = bq_client.query(selection_query.replace("REPLACE_WITH_REF_TYPE", "ref_len"))
ref_len_selection_df = selection_result.result().to_dataframe()

selection_result = bq_client.query(selection_query.replace("REPLACE_WITH_REF_TYPE", "ref_img_len"))
ref_img_len_selection_df = selection_result.result().to_dataframe()

selection_result = bq_client.query(selection_query.replace("REPLACE_WITH_REF_TYPE", "src_img_len"))
src_img_len_selection_df = selection_result.result().to_dataframe()



In [None]:

print("ReferencedSeriesSequence example: "+ref_len_selection_df["viewer_url"].values[0])
print("ReferencedImageSequence example: "+ref_img_len_selection_df["viewer_url"].values[0])
print("SourceImageSequence example: "+src_img_len_selection_df["viewer_url"].values[0])

ReferencedSeriesSequence example: https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178
ReferencedImageSequence example: https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.7695.1700.251725059071532256976802593346
SourceImageSequence example: https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.7695.1700.251725059071532256976802593346


For each of the URLs above, you can open the study in IDC Viewer, and then click "Tag Browser" button to look at the DICOM metadata for the SEG series. 

In general, each of those mechanisms includes the list of sequence items that refer to the individual DICOM instances:

* `ReferencedSeriesSequence`: `ReferencedSeriesSequence > ReferencedInstanceSequence > ReferencedSOPInstanceUID`
* `ReferencedImageSequence`: `ReferencedImageSequence > ReferencedSOPInstanceUID`
* `SourceImageSequence`: `SourceImageSequence > ReferencedSOPInstanceUID`

Given the values of `ReferencedSOPInstanceUID`s that are located in those various places, we can dereference `SeriesInstanceUID` that includes those instances, and use that as the image corresponding to the segmentation.

For the sake of simplifying the matching task, we assumed that each SEG series references only one series, and we will use the first value of `ReferencedSOPInstanceUID` to dereference `SeriesInstanceUID`. Note that in the general case, a SEG series can reference instances that are from multiple series.

# Data conversion

Next we will show how to convert the data from DICOM into some research formats that work with popular tools for visualizing and analyzing the data.

Although there are different tools available, in this example we will use a very robust [`dcm2niix`](https://github.com/rordenlab/dcm2niix) tool to convert image series into NIfTI format, and the [`dcmqi`](https://github.com/QIICR/dcmqi) library for converting DICOM SEG series into the same NIfTI format.

## Install the conversion tools

In [None]:
!wget https://github.com/rordenlab/dcm2niix/releases/download/v1.0.20211006/dcm2niix_lnx.zip
!unzip dcm2niix_lnx.zip
!cp dcm2niix /usr/bin
!which dcm2niix

--2022-08-02 21:49:57--  https://github.com/rordenlab/dcm2niix/releases/download/v1.0.20211006/dcm2niix_lnx.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/25434012/44c8f69c-8f40-43a4-8928-42446850a78d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220802%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220802T214957Z&X-Amz-Expires=300&X-Amz-Signature=e58cf65580a8a280a8de55292754183d201a2603a331c1af2cbc41e6e55f22d9&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=25434012&response-content-disposition=attachment%3B%20filename%3Ddcm2niix_lnx.zip&response-content-type=application%2Foctet-stream [following]
--2022-08-02 21:49:57--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/25434012/44c8f69c-8f40-43a4-8928-42446850a78d?

In [None]:
!wget https://github.com/QIICR/dcmqi/releases/download/v1.2.5/dcmqi-1.2.5-linux.tar.gz
!tar zxf dcmqi-1.2.5-linux.tar.gz
!mv dcmqi-1.2.5-linux/bin/* /usr/bin
!which segimage2itkimage

--2022-08-02 21:50:29--  https://github.com/QIICR/dcmqi/releases/download/v1.2.5/dcmqi-1.2.5-linux.tar.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/50675718/79d3ad95-9f0c-42a4-a1c5-bf5a63461894?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220802%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220802T215029Z&X-Amz-Expires=300&X-Amz-Signature=1668d4ac00d8beebfa26a9074a7bc80b49b4da94a7abd2e8942350f7e98f772f&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=50675718&response-content-disposition=attachment%3B%20filename%3Ddcmqi-1.2.5-linux.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-08-02 21:50:29--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/50675718/79d3ad95-9f0c-42a4-a1c5-bf5a6346189

## Download files corresponding to the segmentation and segmented image

In order to download files from IDC, we first need to create the manifest, and then download the files.

Our query resulted in a table that refers to the series corresponding to the segmentation and segmented image by their `SeriesInstanceUID` in the `seg_SeriesInstanceUID` and `ref_SeriesInstanceUID` columns. We can now use those UIDs to locate the files corresponding to the series.

Let's do it again for the same random sample selected in the previous cell. Here we follow the steps for downloading data from IDC as described in this documentation page https://learn.canceridc.dev/data/downloading-data, using the sample query for selecting files corresponding to the specific series. While the documentation uses `gcloud bq` command, here we will use BigQuery Python API, since it makes it easiest to parameterize the query.

First, let's remind ouselves what is our sample.


In [None]:
print("Random sample:\n-------\n")
print(". Collection: "+sample["collection_id"].values[0])
print(". PatientID: "+sample["PatientID"].values[0])
print(". Modality: "+ sample["ref_Modality"].values[0])
print(". Segmentation category: "+sample["segmentation_category"].values[0])
print(". Segmentation type: "+sample["segmentation_type"].values[0])
print(". Segmentation algorithm: "+sample["segmentation_algorithm"].values[0])

# if generating the viewer URL on your own, you can use the helper function we discussed earlier in this notebook
viewer_url = get_idc_viewer_url(sample["StudyInstanceUID"].values[0], seriesUID=sample["seg_SeriesInstanceUID"].values[0]+","+sample["ref_SeriesInstanceUID"].values[0])

print(". Viewer URL: "+viewer_url)

Random sample:
-------

. Collection: c4kc_kits
. PatientID: KiTS-00203
. Modality: CT
. Segmentation category: Anatomical Structure
. Segmentation type: Kidney
. Segmentation algorithm: SEMIAUTOMATIC
. Viewer URL: https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.6919.4624.174492788567834321870551886230?seriesInstanceUID=1.2.276.0.7230010.3.1.3.0.76586.1588587457.457518,1.3.6.1.4.1.14519.5.2.1.6919.4624.334960895605466204042734854778


Next, download DICOM files corresponding to the image.

In [None]:
image_SeriesInstanceUID = sample["ref_SeriesInstanceUID"].values[0]

image_selection_query = f"""
SELECT
  gcs_url,
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  SeriesInstanceUID = \"{image_SeriesInstanceUID}\"
"""

image_selection_result = bq_client.query(image_selection_query)
image_selection_df = image_selection_result.result().to_dataframe()

# creating a manifest file for the subsequent download of files
image_selection_df["gcs_url"].to_csv("image_manifest.txt", header=False, index=False)

!mkdir -p image_files && cat image_manifest.txt | gsutil -m cp -I image_files

Copying gs://public-datasets-idc/d56e6902-4e01-409e-aef3-7507add780e0.dcm...
/ [0 files][    0.0 B/513.8 KiB]                                                Copying gs://public-datasets-idc/bfc9b0db-3357-4a73-87e1-818eac445663.dcm...
/ [0 files][    0.0 B/  1.0 MiB]                                                Copying gs://public-datasets-idc/401d096a-2881-4d20-b529-0abc835441ab.dcm...
Copying gs://public-datasets-idc/1ed96efe-c9a7-4429-8292-e08af82aa1cf.dcm...
Copying gs://public-datasets-idc/c7bf4e7d-7903-4b87-965e-0df1c8423f2e.dcm...
Copying gs://public-datasets-idc/253c5eb6-ba42-4f0a-b12f-454213e12bbf.dcm...
Copying gs://public-datasets-idc/2d96263d-9ff1-42f0-b287-282803260127.dcm...
Copying gs://public-datasets-idc/8554458c-9e5d-4de2-8cab-b7c7b2701ca9.dcm...
Copying gs://public-datasets-idc/043193f5-f1cc-4fb4-80ec-ed8666cde865.dcm...
Copying gs://public-datasets-idc/fcde5377-7deb-4768-ac09-99afbe4a74d7.dcm...
Copying gs://public-datasets-idc/0255cfb0-e3d7-4647-8860-2558cec7ee1

Next, convert the files corresponding to the image using `dcm2niix`.

In [None]:
!mkdir -p image_nifti && dcm2niix -o image_nifti image_files

Compression will be faster with 'pigz' installed
Chris Rorden's dcm2niiX version v1.0.20211006  (JP2:OpenJPEG) (JP-LS:CharLS) GCC7.5.0 x86-64 (64-bit Linux)
Found 620 DICOM file(s)
Convert 620 DICOM as image_nifti/image_files_CAP_W_IV_20030509103229_3 (512x512x620x1)
Conversion required 2.221893 seconds (1.984363 for core code).


In [None]:
seg_SeriesInstanceUID = sample["seg_SeriesInstanceUID"].values[0]

seg_selection_query = f"""
SELECT
  gcs_url,
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  SeriesInstanceUID = \"{seg_SeriesInstanceUID}\"
"""

seg_selection_result = bq_client.query(seg_selection_query)
seg_selection_df = seg_selection_result.result().to_dataframe()

# creating a manifest file for the subsequent download of files
seg_selection_df["gcs_url"].to_csv("seg_manifest.txt", header=False, index=False)

!mkdir -p seg_files && cat seg_manifest.txt | gsutil -m cp -I seg_files

Copying gs://public-datasets-idc/a67e7a27-f88b-428c-bdbc-5ab2e968269b.dcm...
- [1/1 files][ 39.5 MiB/ 39.5 MiB] 100% Done                                    
Operation completed over 1 objects/39.5 MiB.                                     


Next, convert the files corresponding to the image using `dcmqi`.

Note that DICOM Segmentation series consist of a single DICOM instance, and thus correspond to a single file. This is the case even when those segmentations contain multiple segments and cover multiple slices. The conversion tool accepts as input the name of the DICOM segmentation to convert.

In [None]:
!mkdir -p seg_nifti && segimage2itkimage --inputDICOM seg_files/`ls seg_files` --outputDirectory seg_nifti --outputType nifti

dcmqi repository URL: git@github.com:QIICR/dcmqi.git revision: 1153738 tag: v1.2.5
Row direction: 1 0 0
Col direction: 0 1 0
Z direction: 0 0 1
Total frames: 1240
Total frames with unique IPP: 620
Total overlapping frames: 620
Origin: [-162.682, -288.682, -600.5]


If we look at the content of the output directory, it should contain a NIfTI file for each of the segments,and a JSON file. The NIfTI files will contain the individual segments (labels) - one per file - and the JSON file will have the DICOM metadata describing the content of the NIfTI file, indexed by the label ID.

In [None]:

!ls seg_nifti

1.nii.gz  2.nii.gz  meta.json


In [None]:
!cat seg_nifti/meta.json

{
   "BodyPartExamined" : "ABDOMEN",
   "ClinicalTrialSeriesID" : "KiTS-00203",
   "ClinicalTrialTimePointID" : "-119",
   "ContentCreatorName" : "",
   "InstanceNumber" : "1",
   "SeriesDescription" : "Segmentation",
   "SeriesNumber" : "300",
   "segmentAttributes" : [
      [
         {
            "SegmentAlgorithmName" : "Custom",
            "SegmentAlgorithmType" : "SEMIAUTOMATIC",
            "SegmentDescription" : "Kidney",
            "SegmentLabel" : "Kidney",
            "SegmentedPropertyCategoryCodeSequence" : {
               "CodeMeaning" : "Anatomical Structure",
               "CodeValue" : "T-D000A",
               "CodingSchemeDesignator" : "SRT"
            },
            "SegmentedPropertyTypeCodeSequence" : {
               "CodeMeaning" : "Kidney",
               "CodeValue" : "T-71000",
               "CodingSchemeDesignator" : "SRT"
            },
            "SegmentedPropertyTypeModifierCodeSequence" : {
               "CodeMeaning" : "Right and left",
     

Here is how one can refer to the segment-level metadata that was loaded from DICOM.

In [None]:
import json

with open("seg_nifti/meta.json","r") as meta_f:
  meta = json.load(meta_f)

total_segments = len(meta["segmentAttributes"])

print(f"Total segments: {total_segments}")

for s in range(total_segments):
  print(f"\nMetadata for segment {s}")

  print(json.dumps(meta["segmentAttributes"][s][0], indent=2))

Total segments: 2

Metadata for segment 0
{
  "SegmentAlgorithmName": "Custom",
  "SegmentAlgorithmType": "SEMIAUTOMATIC",
  "SegmentDescription": "Kidney",
  "SegmentLabel": "Kidney",
  "SegmentedPropertyCategoryCodeSequence": {
    "CodeMeaning": "Anatomical Structure",
    "CodeValue": "T-D000A",
    "CodingSchemeDesignator": "SRT"
  },
  "SegmentedPropertyTypeCodeSequence": {
    "CodeMeaning": "Kidney",
    "CodeValue": "T-71000",
    "CodingSchemeDesignator": "SRT"
  },
  "SegmentedPropertyTypeModifierCodeSequence": {
    "CodeMeaning": "Right and left",
    "CodeValue": "G-A102",
    "CodingSchemeDesignator": "SRT"
  },
  "labelID": 1,
  "recommendedDisplayRGBValue": [
    255,
    86,
    123
  ]
}

Metadata for segment 1
{
  "SegmentAlgorithmName": "Custom",
  "SegmentAlgorithmType": "SEMIAUTOMATIC",
  "SegmentDescription": "Renal Tumor",
  "SegmentLabel": "Mass",
  "SegmentedPropertyCategoryCodeSequence": {
    "CodeMeaning": "Morphologically Altered Structure",
    "CodeValu

# Want to learn more?

* check out other notebooks: https://github.com/ImagingDataCommons/IDC-Examples/tree/master/notebooks
* join our community forum to ask any questions about IDC: https://discourse.canceridc.dev/
* ask your questions during live discussions with IDC developers at the IDC weekly office hours - join us on Google Meet at https://meet.google.com/xyt-vody-tvb every Tuesday 16:30 – 17:30 (New York) and Wednesday 10:30-11:30 (New York)
* browse IDC portal: https://imaging.datacommons.cancer.gov/explore/
* read IDC paper: http://dx.doi.org/10.1158/0008-5472.CAN-21-0950
* watch a recent presentation about IDC: https://youtu.be/P9ateg9ZUEs