<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/cookbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IDC Google Colab cookbook notebook

The goal of this notebook is to serve as the source of various small bits that should be helpful in developing analysis notebooks by the IDC users.

Please email Andrey Fedorov andrey dot fedorov at gmail dot com if you have any questions or suggestions!

Prepared: Spring 2022

Updated: Sept 2023

# Prerequisites

Please complete the prerequisites as described in this documentation page: https://learn.canceridc.dev/introduction/getting-started-with-gcp.

Insert that project ID in the cell below.

In [1]:
#@title Enter your Project ID and authenticate with GCP
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "" #@param {type:"string"}

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

from google.colab import auth
auth.authenticate_user()

# Query

First, instantiate the query client, which can next be configured to run the query.

In [8]:
# python API is the most flexible way to query IDC BigQuery metadata tables
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

## Select by specific UID

Queries below are against the [`dicom_all` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=idc_current&t=dicom_all&page=table) that has one row per DICOM file stored in IDC. That table contains the metadata extracted from DICOM files, collection-level attributes (e.g., ID of the collection, license, DOI of the collection), and URLs pointing to the location in the cloud where the file is stored.

In [None]:
# select rows corresponding to the specific DICOM instance, as defined by SOPInstanceUID value
# similarly, you can select by specifying StudyInstanceUID, SeriesInstanceUID or SOPInstanceUID,
# replacing the PatientID line below with the following (as examples):
#   SOPInstanceUID = \"1.3.6.1.4.1.14519.5.2.1.6450.2626.226637977389233552278537838820\"
#   SeriesInstanceUID = \"1.3.6.1.4.1.14519.5.2.1.4334.1501.312037286778380630549945195741\"
#   StudyInstanceUID = \"	1.3.6.1.4.1.14519.5.2.1.4334.1501.116796918629271881210561198785\"
selection_query = """
  SELECT
    StudyInstanceUID,
    SeriesInstanceUID,
    SOPInstanceUID,
    instance_size,
    gcs_url
  FROM
    `bigquery-public-data.idc_current.dicom_all`
  WHERE
    PatientID = \"R01-001\"
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

In [None]:
size_gb = selection_df["instance_size"].sum()/(1024*1024*1024)
print(f"Cohort size on disk: {size_gb} Gb")

## Select by availability of segmentations

What segmentations do we have anyway? Let's look at the distinct combinations of segmentation property category, type and anatomic location, which are the metadata attributes that describe segmentations.

In [None]:
%%bigquery --project=$my_ProjectID

SELECT
  DISTINCT(SegmentedPropertyCategory.CodeMeaning) as SegmentedPropertyCategory_CodeMeaning,
  SegmentedPropertyType.CodeMeaning as SegmentedPropertyType_CodeMeaning,
  AnatomicRegion.CodeMeaning as AnatomicRegion_CodeMeaning
FROM
  `bigquery-public-data.idc_current.segmentations`

Unnamed: 0,SegmentedPropertyCategory_CodeMeaning,SegmentedPropertyType_CodeMeaning,AnatomicRegion_CodeMeaning
0,Morphologically Altered Structure,"Neoplasm, Primary",hypopharynx
1,Morphologically Altered Structure,"Neoplasm, Primary",base of tongue
2,Morphologically Altered Structure,Lesion,Peripheral zone of the prostate
3,Morphologically Altered Structure,"Neoplasm, Primary",Head and Neck
4,Anatomical Structure,Spinal cord,
5,Morphologically Altered Structure,"Neoplasm, Secondary",lymph node of head and neck
6,Anatomical Structure,Lung,
7,Spatial and Relational Concept,Reference Region,Cerebellum
8,Morphologically Altered Structure,Edema,Brain
9,Anatomical Structure,Kidney,


Select all rows that correspond to the instances of segmentations of anything in the prostate.

In [None]:
# select rows corresponding to cases that have segmentation of prostate tumor
selection_query = f"\
  SELECT  \
    dicom_all.StudyInstanceUID, \
    dicom_all.SeriesInstanceUID, \
    dicom_all.SOPInstanceUID, \
    gcs_url \
  FROM \
    `bigquery-public-data.idc_current.dicom_all` as dicom_all \
  JOIN \
    `bigquery-public-data.idc_current.segmentations` as segmentations \
  ON \
    dicom_all.SOPInstanceUID = segmentations.SOPInstanceUID \
  WHERE \
    segmentations.SegmentedPropertyType.CodeMeaning LIKE \"%prostate%\" OR \
    segmentations.AnatomicRegion.CodeMeaning LIKE \"%prostate%\""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

In [None]:
selection_df

Unnamed: 0,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,1.3.6.1.4.1.14519.5.2.1.7311.5101.316302757120...,1.2.276.0.7230010.3.1.3.1070885483.17576.15991...,1.2.276.0.7230010.3.1.4.1070885483.17576.15991...,gs://public-datasets-idc/c591abfc-85c5-41d4-aa...
1,1.3.6.1.4.1.14519.5.2.1.7311.5101.316302757120...,1.2.276.0.7230010.3.1.3.1070885483.17576.15991...,1.2.276.0.7230010.3.1.4.1070885483.17576.15991...,gs://public-datasets-idc/c591abfc-85c5-41d4-aa...
2,1.3.6.1.4.1.14519.5.2.1.7311.5101.316302757120...,1.2.276.0.7230010.3.1.3.1070885483.17576.15991...,1.2.276.0.7230010.3.1.4.1070885483.17576.15991...,gs://public-datasets-idc/c591abfc-85c5-41d4-aa...
3,1.3.6.1.4.1.14519.5.2.1.7311.5101.740797223389...,1.2.276.0.7230010.3.1.3.1070885483.12356.15991...,1.2.276.0.7230010.3.1.4.1070885483.12356.15991...,gs://public-datasets-idc/b3469be1-c3fa-4b04-af...
4,1.3.6.1.4.1.14519.5.2.1.7311.5101.740797223389...,1.2.276.0.7230010.3.1.3.1070885483.12356.15991...,1.2.276.0.7230010.3.1.4.1070885483.12356.15991...,gs://public-datasets-idc/b3469be1-c3fa-4b04-af...
...,...,...,...,...
525,1.3.6.1.4.1.14519.5.2.1.7311.5101.334472970012...,1.2.276.0.7230010.3.1.3.1070885483.13052.15991...,1.2.276.0.7230010.3.1.4.1070885483.13052.15991...,gs://public-datasets-idc/967d17d9-37a5-43d9-83...
526,1.3.6.1.4.1.14519.5.2.1.7311.5101.334472970012...,1.2.276.0.7230010.3.1.3.1070885483.13052.15991...,1.2.276.0.7230010.3.1.4.1070885483.13052.15991...,gs://public-datasets-idc/967d17d9-37a5-43d9-83...
527,1.3.6.1.4.1.14519.5.2.1.3671.4754.394942238206...,1.2.276.0.7230010.3.1.3.1426846371.688.1513205...,1.2.276.0.7230010.3.1.4.1426846371.688.1513205...,gs://public-datasets-idc/2272a95a-b436-4f50-91...
528,1.3.6.1.4.1.14519.5.2.1.3671.4754.394942238206...,1.2.276.0.7230010.3.1.3.1426846371.688.1513205...,1.2.276.0.7230010.3.1.4.1426846371.688.1513205...,gs://public-datasets-idc/2272a95a-b436-4f50-91...


# Visualization

In [None]:
# helper function to view a study or a specific series hosted by IDC
def get_idc_viewer_url(studyUID, seriesUID=None):
  url = "https://viewer.imaging.datacommons.cancer.gov/viewer/"+studyUID
  if seriesUID is not None:
    url = url+"?seriesInstanceUID="+seriesUID
  return url

my_StudyInstanceUID = selection_df["StudyInstanceUID"][0]
my_SeriesInstanceUID = selection_df[selection_df["StudyInstanceUID"] == selection_df["StudyInstanceUID"][0]]["SeriesInstanceUID"][0]

print("URL to view the entire study:")
print(get_idc_viewer_url(my_StudyInstanceUID))

URL to view the entire study:
https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.7311.5101.316302757120840825688456720609


# Downloading

To download data you need to

1. create a manifest listing files you want to download
2. use `s5cmd` tool to download the files defined by the manifest.

These steps are dsicussed in detail in this documentation page: https://learn.canceridc.dev/data/downloading-data.

You can generate the manifest using [IDC Portal](https://portal.imaging.datacommons.cancer.gov) or BigQuery SQL. In the following, we demonstrate the second approach, which provides maximum flexibility.

## Prerequisites: install `s5cmd`

`s5cmd` installation instructions are available here: https://github.com/peak/s5cmd#installation.

`s5cmd` is a command line tool - there is no interactive interface. It takes the manifest and, extremely efficiently, downloads the files specified in the manifest. In the following we download binary package for Linux and prepare the `s5cmd` binary for the subsequent use.

In [2]:
version = "s5cmd_2.2.2_Linux-64bit"
!wget https://github.com/peak/s5cmd/releases/download/v2.2.2/{version}.tar.gz
!tar zxf {version}.tar.gz
!mv s5cmd /usr/bin

--2023-09-23 02:36:49--  https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/73909333/e095ae85-9acf-4dcc-b744-128b3311849c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230923%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230923T023649Z&X-Amz-Expires=300&X-Amz-Signature=ad24031af8660f8377f32d71fd02f88dbb33fbc5803c92c935a2c5414e298dce&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=73909333&response-content-disposition=attachment%3B%20filename%3Ds5cmd_2.2.2_Linux-64bit.tar.gz&response-content-type=application%2Foctet-stream [following]
--2023-09-23 02:36:49--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/73909333/e095ae85-9acf-4dcc-b744-

The following cell confirms `s5cmd` is installed properly.

In [4]:
!s5cmd

NAME:
   s5cmd - Blazing fast S3 and local filesystem execution tool

USAGE:
   s5cmd [global options] command [command options] [arguments...]

COMMANDS:
   ls              list buckets and objects
   cp              copy objects
   rm              remove objects
   mv              move/rename objects
   mb              make bucket
   rb              remove bucket
   select          run SQL queries on objects
   du              show object size usage
   cat             print remote object content
   pipe            stream to remote from stdin
   run             run commands in batch
   sync            sync objects
   version         print version
   bucket-version  configure bucket versioning
   presign         print remote object presign url
   help, h         Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --credentials-file value       use the specified credentials file instead of the default credentials file
   --dry-run                      fake run; show wha

## Generate manifest

The following query generates a manifest suitable for `s5cmd` use that selects files corresponding to a DICOM series defined by the specific value of `SeriesInstanceUID`. You can use the `WHERE` clause to define other criteria, such as `Modality` or `collection_id`.

Note that in this case we download the files and at the same time organize them into the hierarchy of `collection_id/StudyInstanceUID/SeriesInstanceUID`, without having to sort the files after downloading.

In [3]:
# python API is the most flexible way to query IDC BigQuery metadata tables
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

selection_query ="""
SELECT
  # Organize the files in-place right after downloading
  ANY_VALUE(CONCAT("cp s3",REGEXP_SUBSTR(aws_url, "(://.*)/"),"/* ",collection_id,"/",PatientID,"/",StudyInstanceUID,"/",SeriesInstanceUID)) AS s5cmd_command
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  # Use any filtering criteria here
  StudyInstanceUID = "1.3.6.1.4.1.14519.5.2.1.6279.6001.224985459390356936417021464571"
GROUP BY
  SeriesInstanceUID
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df.to_csv("/content/s5cmd_aws_manifest.txt", header=False, index=False)

## Download files defined by the manifest

Once manifest is ready, use the following command to download the files.

In [None]:
!mkdir downloaded_content
!cd downloaded_content && s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com run ../s5cmd_aws_manifest.txt

# Sorting

In [7]:
%%capture
!pip install pydicom
!git clone https://github.com/pieper/dicomsort

In [9]:

!mkdir -p /content/sorted_content
!rm -rf sorted_content/*
!python dicomsort/dicomsort.py -k -u /content/downloaded_content /content/sorted_content/%StudyInstanceUID/%SeriesInstanceUID/%SOPInstanceUID.dcm

100% 290/290 [00:01<00:00, 192.01it/s]
Files sorted


# Misc

## Mount Google Drive

Since everything you save in your Colab instance will disappear after restart, you may want to use some persistent location, such as Google Drive, for saving your artifacts.

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')