<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/notebooks/cookbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IDC Google Colab cookbook notebook

The goal of this notebook is to serve as the source of various small bits that should be helpful in developing analysis notebooks by the IDC users.

Please email Andrey Fedorov andrey dot fedorov at gmail dot com if you have any questions or suggestions!

Prepared: Spring 2022

# Prerequisites

* To use Colab, and to access data in IDC, you will need a [Google Account](https://support.google.com/accounts/answer/27441?hl=en)
* Make sure your Colab instance has a GPU! For this check "Runtime > Change runtime type" and make sure to choose the GPU runtime.
* To perform queries against IDC BigQuery tables you will need a cloud project. You can get started with Google Cloud free project with the following steps (they are also illustrated in [this short video](https://youtu.be/i08S0KJLnyw)):
  1. Go to https://console.cloud.google.com/, and accept Terms and conditions.
  2. Click "Select a project" button in the upper left corner of the screen, and then click "New project".
  3. Open the console menu by clicking the ☰ menu icon in the upper left corner, and select "Dashboard". You will see information about your project, including your Project ID. Insert that project ID in the cell below in place of `REPLACE_ME_WITH_YOUR_PROJECT_ID`.

In [1]:
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "REPLACE_ME_WITH_YOUR_PROJECT_ID"
my_ProjectID = "idc-tcia"

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

# Authentication

In [2]:
# you will need to authenticate with your Google ID to do anything meaningful with IDC
from google.colab import auth
auth.authenticate_user()

# Query

First, instantiate the query client, which can next be configured to run the query.

In [3]:
# python API is the most flexible way to query IDC BigQuery metadata tables
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

## Select by specific UID

In [7]:
# select rows corresponding to the specific DICOM instance, as defined by SOPInstanceUID value
# similarly, you can select by specifying StudyInstanceUID, SeriesInstanceUID or SOPInstanceUID,
# replacing the PatientID line below with the following (as examples): 
#   SOPInstanceUID = \"1.3.6.1.4.1.14519.5.2.1.6450.2626.226637977389233552278537838820\" 
#   SeriesInstanceUID = \"1.3.6.1.4.1.14519.5.2.1.4334.1501.312037286778380630549945195741\" 
#   StudyInstanceUID = \"	1.3.6.1.4.1.14519.5.2.1.4334.1501.116796918629271881210561198785\" 
selection_query = f"\
  SELECT  \
    StudyInstanceUID, \
    SeriesInstanceUID, \
    SOPInstanceUID, \
    gcs_url \
  FROM \
    `bigquery-public-data.idc_current.dicom_all` \
  WHERE \
    PatientID = \"R01-001\""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

## Select by availability of segmentations

What segmentations do we have anyway? Let's look at the distinct combinations of segmentation property category, type and anatomic location, which are the metadata attributes that describe segmentations.

In [None]:
%%bigquery --project=$my_ProjectID

SELECT
  DISTINCT(SegmentedPropertyCategory.CodeMeaning) as SegmentedPropertyCategory_CodeMeaning,
  SegmentedPropertyType.CodeMeaning as SegmentedPropertyType_CodeMeaning,
  AnatomicRegion.CodeMeaning as AnatomicRegion_CodeMeaning
FROM
  `bigquery-public-data.idc_current.segmentations`

Unnamed: 0,SegmentedPropertyCategory_CodeMeaning,SegmentedPropertyType_CodeMeaning,AnatomicRegion_CodeMeaning
0,Morphologically Altered Structure,"Neoplasm, Primary",hypopharynx
1,Morphologically Altered Structure,"Neoplasm, Primary",base of tongue
2,Morphologically Altered Structure,Lesion,Peripheral zone of the prostate
3,Morphologically Altered Structure,"Neoplasm, Primary",Head and Neck
4,Anatomical Structure,Spinal cord,
5,Morphologically Altered Structure,"Neoplasm, Secondary",lymph node of head and neck
6,Anatomical Structure,Lung,
7,Spatial and Relational Concept,Reference Region,Cerebellum
8,Morphologically Altered Structure,Edema,Brain
9,Anatomical Structure,Kidney,


Select all rows that correspond to the instances of segmentations of anything in the prostate.

In [None]:
# select rows corresponding to cases that have segmentation of prostate tumor
selection_query = f"\
  SELECT  \
    dicom_all.StudyInstanceUID, \
    dicom_all.SeriesInstanceUID, \
    dicom_all.SOPInstanceUID, \
    gcs_url \
  FROM \
    `bigquery-public-data.idc_current.dicom_all` as dicom_all \
  JOIN \
    `bigquery-public-data.idc_current.segmentations` as segmentations \
  ON \
    dicom_all.SOPInstanceUID = segmentations.SOPInstanceUID \
  WHERE \
    segmentations.SegmentedPropertyType.CodeMeaning LIKE \"%prostate%\" OR \
    segmentations.AnatomicRegion.CodeMeaning LIKE \"%prostate%\""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

In [None]:
selection_df

Unnamed: 0,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,1.3.6.1.4.1.14519.5.2.1.3671.4754.266963586071...,1.2.276.0.7230010.3.1.3.1426846371.7356.151320...,1.2.276.0.7230010.3.1.4.1426846371.7356.151320...,gs://public-datasets-idc/2688dccd-cc69-4f4a-ae...
1,1.3.6.1.4.1.14519.5.2.1.3671.4754.266963586071...,1.2.276.0.7230010.3.1.3.1426846371.7356.151320...,1.2.276.0.7230010.3.1.4.1426846371.7356.151320...,gs://public-datasets-idc/2688dccd-cc69-4f4a-ae...
2,1.3.6.1.4.1.14519.5.2.1.3671.4754.266963586071...,1.2.276.0.7230010.3.1.3.1426846371.7356.151320...,1.2.276.0.7230010.3.1.4.1426846371.7356.151320...,gs://public-datasets-idc/2688dccd-cc69-4f4a-ae...
3,1.3.6.1.4.1.14519.5.2.1.7310.5101.130276529947...,1.2.276.0.7230010.3.1.3.1070885483.11412.15991...,1.2.276.0.7230010.3.1.4.1070885483.11412.15991...,gs://public-datasets-idc/59a0d450-f21a-433d-8a...
4,1.3.6.1.4.1.14519.5.2.1.7310.5101.130276529947...,1.2.276.0.7230010.3.1.3.1070885483.11412.15991...,1.2.276.0.7230010.3.1.4.1070885483.11412.15991...,gs://public-datasets-idc/59a0d450-f21a-433d-8a...
...,...,...,...,...
525,1.3.6.1.4.1.14519.5.2.1.7311.5101.726872428105...,1.2.276.0.7230010.3.1.3.1070885483.16388.15991...,1.2.276.0.7230010.3.1.4.1070885483.16388.15991...,gs://public-datasets-idc/864543fe-9efe-4515-85...
526,1.3.6.1.4.1.14519.5.2.1.7311.5101.726872428105...,1.2.276.0.7230010.3.1.3.1070885483.16388.15991...,1.2.276.0.7230010.3.1.4.1070885483.16388.15991...,gs://public-datasets-idc/864543fe-9efe-4515-85...
527,1.3.6.1.4.1.14519.5.2.1.7311.5101.236131511359...,1.2.276.0.7230010.3.1.3.1070885483.17072.15991...,1.2.276.0.7230010.3.1.4.1070885483.17072.15991...,gs://public-datasets-idc/3fa71302-051c-4900-97...
528,1.3.6.1.4.1.14519.5.2.1.7311.5101.236131511359...,1.2.276.0.7230010.3.1.3.1070885483.17072.15991...,1.2.276.0.7230010.3.1.4.1070885483.17072.15991...,gs://public-datasets-idc/3fa71302-051c-4900-97...


# Visualization

In [8]:
# helper function to view a study or a specific series hosted by IDC
def get_idc_viewer_url(studyUID, seriesUID=None):
  url = "https://viewer.imaging.datacommons.cancer.gov/viewer/"+studyUID
  if seriesUID is not None:
    url = url+"?seriesInstanceUID="+seriesUID
  return url

my_StudyInstanceUID = selection_df["StudyInstanceUID"][0]
my_SeriesInstanceUID = selection_df[selection_df["StudyInstanceUID"] == selection_df["StudyInstanceUID"][0]]["SeriesInstanceUID"][0]

print("URL to view the entire study:")
print(get_idc_viewer_url(my_StudyInstanceUID))
print()
print("URL to view the specific series:")
print(get_idc_viewer_url(my_StudyInstanceUID, my_SeriesInstanceUID))

URL to view the entire study:
https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.4334.1501.116796918629271881210561198785

URL to view the specific series:
https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.4334.1501.116796918629271881210561198785?seriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.4334.1501.208304249098874719086628767652


# Downloading

In [13]:
import os
os.environ["DOWNLOAD_DEST"] = "/content/IDC_downloads"
os.environ["MANIFEST"] = "/content/idc_manifest.txt"

In [14]:
!mkdir -p ${DOWNLOAD_DEST}
!echo "gsutil cp \$* $DOWNLOAD_DEST" > gsutil_download.sh
!chmod +x gsutil_download.sh

In [15]:
# creating a manifest file for the subsequent download of files
selection_df["gcs_url"].to_csv(os.environ["MANIFEST"], header=False, index=False)

In [16]:
# download is this simple
%%capture

!cat ${MANIFEST} | gsutil -m cp -I ${DOWNLOAD_DEST}

If you want to download a non-trivial amount of data, you will want to parallelize downloads, as illustrated below.

In [None]:
!cat ${MANIFEST} | xargs -n 25 -P 10 ./gsutil_download.sh

# Sorting

In [9]:
%%capture
!pip install pydicom
!git clone https://github.com/pieper/dicomsort
!sudo apt-get install dcmtk

In [18]:
import os
os.environ["SORTED_DEST"] = "/content/IDC_sorted"

!mkdir -p $SORTED_DEST
!rm -rf $SORTED_DEST/*
!python dicomsort/dicomsort.py -k -u $DOWNLOAD_DEST ${SORTED_DEST}/%StudyInstanceUID/%SeriesInstanceUID/%SOPInstanceUID.dcm

100% 919/919 [00:04<00:00, 196.51it/s]
Files sorted
