<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/notebooks/Cohort_download.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

This notebook is one of the examples that accompany NCI Imaging Data Commons. IDC example notebooks are located in this repository: https://github.com/ImagingDataCommons/IDC-Examples/tree/master/notebooks.

# Working with IDC cohorts

In this example we show how a cohort manifest defined using the [IDC Portal](https://portal.imaging.datacommons.cancer.gov/) can be used to download the data to a cloud VM instance.

This example was prepared using a pre-release version of IDC Portal, and it has not yet been tested using the publicly released version.

To proceed with the cells below you will need to 
* upload your manifest to the connected runtime file system
* initialize `manifestLocalPath` in the cell below with the actual path to the uploaded manifest
* initialize `myProjectID` in the cell below with a project ID that you can bill

In [21]:
manifestLocalPath = "##MANIFEST_LOCAL_PATH##"
myProjectID="##MY_PROJECT_ID##"

In [None]:
!head $manifestLocalPath

You can import IDC cohort manifest in CSV format as any other CSV file, but make sure you check the header to confirm how many lines need to be ignored. The header length may change leading to the public release of the portal.

In [23]:
import pandas as pd

def cohort_as_df(manifest_filename):
  df = pd.read_csv(manifest_filename, header=5)
  return df

cohort_df = cohort_as_df(manifestLocalPath)

The manifest will contain a Google Storage URI (`gs://`) for each of the files corresponding to the individual DICOM instances. Here we save the column containing those URIs to enable download.

In [None]:
print(cohort_df["gcs_path"])
cohort_df["gcs_path"].to_csv("gcs_paths.txt", header=False, index=False)

In [None]:
!head /content/gcs_paths.txt

To download the files to the VM filesystem we can use the standard `gsutil` command, which is preinstalled on Colab instances.

In [None]:
# https://cloud.google.com/storage/docs/gsutil/commands/cp
!cat gcs_paths.txt | gsutil -m cp -I ./downloaded_cohort

The command above will fail, since IDC-hosted data is stored in US multi-region [requester-pays Storage buckets](https://cloud.google.com/storage/docs/requester-pays). This means that you need to provide a project ID with billing configured to download the data. If you are using Google Colab from the US, the corresponding VM instance will likely be in the US, and data egress charges will be $0.01/GB (see  [GCP network egress charges](https://cloud.google.com/storage/pricing#network-buckets) for full details).

Note that if you want to donwload the data to your own computer, the costs will be much higher.

Before you can refer to a project that you own, you need to authenticate.

In [15]:
from google.colab import auth
auth.authenticate_user()

In [None]:
# https://cloud.google.com/storage/docs/gsutil/commands/cp
!mkdir downloaded_cohort
!cat gcs_paths.txt | gsutil -u $myProjectID -m cp -I ./downloaded_cohort

Now the data is located in the file storage local to the VM, but all of the files are in the same directory, which is not the most convenient layout.

You can use the DICOM metadata to organize those instances, or use one of the existing tools to do this. One such tool is used below to organize the flat list of DICOM files into the PatientID-StudyInstanceUID-SeriesInstanceUID-SOPInstanceUID hierarchy.

In [None]:
!git clone https://github.com/pieper/dicomsort.git
!pip install pydicom
!python dicomsort/dicomsort.py --help

Cloning into 'dicomsort'...
remote: Enumerating objects: 126, done.[K
remote: Total 126 (delta 0), reused 0 (delta 0), pack-reused 126[K
Receiving objects: 100% (126/126), 37.03 KiB | 1.61 MiB/s, done.
Resolving deltas: 100% (63/63), done.
Collecting pydicom
[?25l  Downloading https://files.pythonhosted.org/packages/d3/56/342e1f8ce5afe63bf65c23d0b2c1cd5a05600caad1c211c39725d3a4cc56/pydicom-2.0.0-py3-none-any.whl (35.4MB)
[K     |████████████████████████████████| 35.5MB 1.3MB/s 
[?25hInstalling collected packages: pydicom
Successfully installed pydicom-2.0.0

% dicomsort.py --help
dicomsort [options...] sourceDir targetDir/<patterns>

 where [options...] can be:
    [-z,--compressTargets] - create a .zip file in the target directory
    [-d,--deleteSource] - remove source files/directories after sorting
    [-f,--forceDelete] - remove source without confirmation
    [-k,--keepGoing] - report but ignore dupicate target files
    [-v,--verbose] - print diagnostics while processing
  

The command below will sort instances into folders based on the DICOM metadata stored in the corresponding files.

In [None]:
!python dicomsort/dicomsort.py -u downloaded_cohort cohort_sorted/%PatientID/%StudyInstanceUID/%SeriesInstanceUID/%SOPInstanceUID.dcm

100% 29/29 [00:02<00:00, 11.31it/s]
Files sorted
