<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/notebooks/Cohort_download.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with IDC cohorts

This notebook is one of the examples that accompany NCI Imaging Data Commons. IDC example notebooks are located in this repository: https://github.com/ImagingDataCommons/IDC-Examples/tree/master/notebooks.

In this example we show how a cohort manifest defined using the [IDC Portal](https://portal.imaging.datacommons.cancer.gov/) can be used to download the data to a cloud VM instance.

To proceed with the cells below you will need to 
* upload your manifest to the connected runtime file system and set the `manifestLocalFile` below to point to that file OR export the cohort into a BigQuery table and set `cohortBQTable` to the table name
* initialize `manifestLocalPath` in the cell below with the actual path to the uploaded manifest
* initialize `myProjectID` in the cell below with your project ID (note: you do not need to configure billing for that project!)

In [None]:
manifestLocalPath = "##MANIFEST_LOCAL_PATH##"
cohortBQTable = "##COHORT_BQ_TABLE##"
myProjectID="##MY_PROJECT_ID##"

## Prerequisites

You will need to authenticate with Google to be able to follow this example.

In [None]:
from google.colab import auth
auth.authenticate_user()

## Approach 1 (recommended): Get GCS URLs from a BigQuery table manifest

Starting from Dec 2020 release, IDC portal allows to export cohort manifest either as a BigQuery table, or as one or more files.

BigQuery export of the cohort manifest is the recommended approach. When exporting the manifest as a file, cohorts larger than 65,000 items will be exported as multiple files, and only up to 10 files for a multi-part cohort manifest can be exported. BigQuery manifest export does not have any limitations, and can be used in the same manner no matter how large or small is the cohort you want to export.

First, let's get the GCS URLs for the items included in the cohort from the cohort manifest table.

In [None]:
%%bigquery --project=$myProjectID cohort_df

SELECT * 
FROM `##COHORT_BQ_TABLE##`

## Approach 2 (not recommended): Get GCS URLs from a manifest file

We do not recommend this approach, since there is a limit of 650,000 rows on the manifest size that can be exported as a file, and also for any manifest containing more than 65,000 rows export into a file will be split into 65,000 rows chunks. See more details in https://learn.canceridc.dev/portal/data-exploration-and-cohorts.



In [None]:
!head $manifestLocalPath

You can import IDC cohort manifest in CSV format as any other CSV file, but make sure you check the header to confirm how many lines need to be ignored. The header length may change leading to the public release of the portal.

In [None]:
import pandas as pd

def cohort_as_df(manifest_filename):
  df = pd.read_csv(manifest_filename, header=5)
  return df

cohort_df = cohort_as_df(manifestLocalPath)

## Save the GCS URLs and download the corresponding instances

Due to a known issue in the current release of IDC, we need to do remove the generation suffix from `gcs_url`.

In [None]:
cohort_df = cohort_df.join(cohort_df["gcs_url"].str.split('#', 1, expand=True).rename(columns={0:'gcs_url_no_revision', 1:'gcs_revision'}))
cohort_df["gcs_url_no_revision"].to_csv("gcs_paths.txt", header=False, index=False)

In [None]:
!head /content/gcs_paths.txt

gs://idc-tcia-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.203715003805996641695765332389135385095/1.2.276.0.7230010.3.1.3.2323910823.11504.1597260515.421/1.2.276.0.7230010.3.1.4.2323910823.11504.1597260515.422.dcm
gs://idc-tcia-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.247726286795860121686796574974227334270/1.2.276.0.7230010.3.1.3.2323910823.23864.1597260522.316/1.2.276.0.7230010.3.1.4.2323910823.23864.1597260522.317.dcm
gs://idc-tcia-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.71961866280433925571019872464419293819/1.2.276.0.7230010.3.1.3.2323910823.11644.1597260534.485/1.2.276.0.7230010.3.1.4.2323910823.11644.1597260534.486.dcm
gs://idc-tcia-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.270361505197008655909592732352678399263/1.2.276.0.7230010.3.1.3.2323910823.21456.1597260540.379/1.2.276.0.7230010.3.1.4.2323910823.21456.1597260540.380.dcm
gs://idc-tcia-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.282967364651788470277412461462049836277/1.2.276.0.7230010.3.1.3.2323910823.22

To download the files to the VM filesystem we can use the standard `gsutil` command, which is preinstalled on Colab instances.

IDC-hosted data is stored is available from free Google Storage buckets maintained under [Google Public Dataset Program](https://console.cloud.google.com/marketplace/product/gcp-public-data-idc/nci-idc-data), which sponsors free egress of the data either within or out of the Google Cloud.

In [None]:
# https://cloud.google.com/storage/docs/gsutil/commands/cp
!mkdir downloaded_cohort
!cat gcs_paths.txt | gsutil -m cp -I ./downloaded_cohort

Now the data is located in the file storage local to the VM, but all of the files are in the same directory, which is not the most convenient layout.

You can use the DICOM metadata to organize those instances, or use one of the existing tools to do this. One such tool is used below to organize the flat list of DICOM files into the PatientID-StudyInstanceUID-SeriesInstanceUID-SOPInstanceUID hierarchy.

In [None]:
!git clone https://github.com/pieper/dicomsort.git
!pip install pydicom
!python dicomsort/dicomsort.py --help

Cloning into 'dicomsort'...
remote: Enumerating objects: 126, done.[K
remote: Total 126 (delta 0), reused 0 (delta 0), pack-reused 126[K
Receiving objects: 100% (126/126), 37.03 KiB | 1.61 MiB/s, done.
Resolving deltas: 100% (63/63), done.
Collecting pydicom
[?25l  Downloading https://files.pythonhosted.org/packages/d3/56/342e1f8ce5afe63bf65c23d0b2c1cd5a05600caad1c211c39725d3a4cc56/pydicom-2.0.0-py3-none-any.whl (35.4MB)
[K     |████████████████████████████████| 35.5MB 1.3MB/s 
[?25hInstalling collected packages: pydicom
Successfully installed pydicom-2.0.0

% dicomsort.py --help
dicomsort [options...] sourceDir targetDir/<patterns>

 where [options...] can be:
    [-z,--compressTargets] - create a .zip file in the target directory
    [-d,--deleteSource] - remove source files/directories after sorting
    [-f,--forceDelete] - remove source without confirmation
    [-k,--keepGoing] - report but ignore dupicate target files
    [-v,--verbose] - print diagnostics while processing
  

The command below will sort instances into folders based on the DICOM metadata stored in the corresponding files.

In [None]:
!python dicomsort/dicomsort.py -u downloaded_cohort cohort_sorted/%PatientID/%StudyInstanceUID/%SeriesInstanceUID/%SOPInstanceUID.dcm

100% 29/29 [00:02<00:00, 11.31it/s]
Files sorted
