<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part2_searching_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part2_searching_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with IDC - Part 2: Basics of searching IDC data

---

## Summary

This notebook is part of [the "Getting started with IDC" notebook series](https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/getting_started) introducing NCI Imaging Data Commons to the users who want to interact with IDC programmatically.

In [Part 1](https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/getting_started/part1_prerequisites.ipynb) of this tutorial series, you learned how to complete the prerequisites that are required to be able to search IDC data. If you have not completed Part 1, it's ok - this notebook is self-contained, and you don't need to go back to Part 1! This notebook continues at the beginner level, guiding you with how to use [`idc-index`](https://github.com/ImagingDataCommons/idc-index), a python package that aims to simplify access to the data available from NCI Imaging Data Commons.

Upon completion of this tutorial, you will learn:
* how to download images from IDC
* how to load and view DICOM images available from IDC
* what metadata attributes are available for filtering and how to filter data programmatically
* how to better leverage IDC visualization capabilities by programmatically generating viewer URLs for selected items, and embedding the viewer in your notebook
* how get the licensing information for the individual files, and how to comply with the attribution usage requirement, which is in place for most of the data in IDC

---
Initial version: Nov 2022

Updated: June 2024


## Warnings: read these to avoid errors!

In order to avoid errors, please keep in mind the following warnings:

1. **Execute each cell of the notebook in order without skipping** any code cells - otherwise you will likely encounter runtime errors!
2. If you are going through this notebook for the first time, **do not change any code until you successfully completed the notebook** (unless you are specifically asked to change something). Otherwise you may run into errors.



## Prerequisites

This tutorial relies on [`idc-index`](https://github.com/ImagingDataCommons/idc-index) - a python package that accompanies IDC and provides basic functionality around searching and accessing data from IDC.

We will install `idc-index` using `pip`. This package is under active development, and so we will use the `--upgrade` option to access the latest version. It should take around 1-2 minutes to install the dependencies and complete the next cell.

In [None]:
!pip install idc-index --upgrade

## Downloading data from IDC

One of the first questions we get from the users of IDC is "How do I download the images?"

With `idc-index`, answering this question is easy.

We will head out to IDC Portal, that provides a basic interface to explore the data available in IDC. As you move around the IDC Portal explore page, you will be able to copy identifiers corresponding to the content available in IDC at the different levels of data hierarchy - starting from entire collections, down to the individual image series.

Try this out yourself! Open the IDC Portal explore page [https://portal.imaging.datacommons.cancer.gov/explore/](https://portal.imaging.datacommons.cancer.gov/explore/) in a separate window, and experiment with copying these identifiers. Identifiers for collections and patients/cases will be short strings, while unique identifiers for studies and series will be rather long strings of text that contain numbers and "." characters).

![IDC Portal demo](https://github.com/ImagingDataCommons/IDC-Tutorials/releases/download/0.2.0/Tutorial-copy.gif)

The identifier you have in the clipboard, is all you need to access the corresponding images. To download those, we will first instantiate `IDCClient` that is provided by `idc-index`.

In [3]:
from idc_index import index

idc_client = index.IDCClient()

In the following cell, we pass the identifier of a study to the function `download_from_selection()` to fetch the corresponding files. For the sake of the tutorial, please proceed with the identifier used in the following cell. You can experiment with using the identifiers you copied from the portal after you completed the notebook.


In [None]:
studyInstanceUID = "1.2.840.113654.2.55.68425808326883186792123057288612355322"


idc_client.download_from_selection(studyInstanceUID=studyInstanceUID, downloadDir=".")

# You can experiment with downloading all of the data for a patient by its PatientID, or for entire collections
# by uncommenting the lines below!
#idc_client.download_from_selection(patientId="LUNG1-025", downloadDir=".")
#idc_client.download_from_selection(collection_id="ct_phantom4radiomics", downloadDir=".")


Downloading data:   0%|          | 0.00/314M [00:03<?, ?B/s]


You also have the choice of downloading from IDC using command-line helper tool that is installed as part of the `idc-index` package. The cell below shows how to download that same study from the command line.

In [6]:
!idc download-from-selection --study-instance-uid 1.2.840.113654.2.55.68425808326883186792123057288612355322 --download-dir .

2024-06-17 20:31:29,827 - Total size of files to download: 314.45 MB
2024-06-17 20:31:29,827 - Total free space on disk: 212.901072896GB
2024-06-17 20:31:29,987 - 
Temporary download manifest is generated and is passed to self._s5cmd_run

2024-06-17 20:31:29,987 - Not using s5cmd sync dry run as the destination folder is empty or sync dry or progress bar is not requested
2024-06-17 20:31:29,987 - Inputs received for tracking download:
2024-06-17 20:31:29,988 - size_MB: 314.45
2024-06-17 20:31:29,988 - downloadDir: /content
2024-06-17 20:31:29,988 - show_progress_bar: True
2024-06-17 20:31:29,988 - Initial size of the directory: 0 bytes
2024-06-17 20:31:29,988 - Approx. Size of the files need to be downloaded: 314450000.0 bytes
Downloading data: 100% 314M/314M [00:09<00:00, 34.3MB/s]
2024-06-17 20:31:39,153 - Successfully downloaded files to /content


Once the download is completed, you can check the current directory to examine the content. Note that files are downloaded in to a folder hierarchy to make navigation easier. The default hierarchy is `collection_id` > `PatientID` > `StudyInstanceUID` > `Modality`_`SeriesInstanceUID`, and the files you downloaded should be organized as shown below.

```
├── nlst   <--- collection ID
│   └── 100002   <--- Patient ID
│       └── 1.2.840.113654.2.55.68425808326883186792123057288612355322   <--- Study ID
│           ├── CT_1.2.840.113654.2.55.229650531101716203536241646069123704792   <--- Series ID
│           │   ├── 0025b198-6198-4b33-85cf-92582531ad28.dcm   <--- individual instances/files the series
│           │   ├── 00a93dcb-4cd0-46ca-ae41-9240421cb0c7.dcm        (corresponding to slices in radiology and
│           │   ├── 00aa5957-1b4e-4b8b-879f-c505e12f2dcc.dcm         resolution layers in digital pathology)
│           │   ├── 02da9050-622c-4e00-a0d4-a70dd418973c.dcm
...
```

The DICOM study we downloaded is from the National Lung Screening Trial (NLST) collection available in IDC. This study contains two Computed Tomography (CT) DICOM series, two DICOM Segmentations (SEG) with the results of segmenting the CT series using [TotalSegmentator](https://github.com/wasserth/TotalSegmentator), and 4 DICOM Structured Reporting series with the radiomics features extracted from the segmented images. Most of the CT images in the NLST collection in IDC are accompanied by TotalSegmentator segmentations and corresponding radiomics features!

## Loading and visualizing IDC DICOM images and segmentations

In this tutorial we will focus on visualization of radiology images.

Images available from IDC are stored using DICOM format. This is the format that is used virtually for all of the images produced by radiological medical imaging equipment. It is also increasingly adopted by the manufacturers of the digital pathology equipment.

DICOM format is supported by many open source tools and libraries. In the following cells we will use:
* [ITK](https://itk.org/) for loading DICOM images
* [pydicom-seg]() for loading DICOM segmentations
* and [itkWidgets](https://itkwidgets.readthedocs.io/en/latest/) for visualization.

In the following cell, which should take about a minute to complete, we will install these prerequisites. You may see some errors about incompatible packages in the end of the installation. Those should be harmless.

In [None]:
# Install the packages required for sorting and loading the data as well as visualization
!pip install -q pydicom-seg "itk-io>=5.3.0" "itkwidgets[all]>=1.0a32"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m28.0/28.0 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.5/44.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.7/52.7 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m66.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.2/80.2 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.9/73.9 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m81.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

Next we will import the necessary packages, and will load the CT image from the folder with the files we downloaded earlier.

In [None]:
import itk
from itkwidgets import view
# Pydicom-Seg is a layer on top of pydicom that handles DICOM SEG objects.
import pydicom
import pydicom_seg
import numpy as np

In [None]:
ct_image_path = "./nlst/100002/1.2.840.113654.2.55.68425808326883186792123057288612355322/CT_1.2.840.113654.2.55.229650531101716203536241646069123704792"

ct_image = itk.imread(ct_image_path)

We can next print the details about the loaded image. If the size of the image you loaded shows up as [512, 512, 126], then everything worked as expected!

In [None]:
print(ct_image)

Image (0x593a2bfc1eb0)
  RTTI typeinfo:   itk::Image<short, 3u>
  Reference Count: 1
  Modified Time: 76609
  Debug: Off
  Object Name: 
  Observers: 
    none
  Source: (none)
  Source output name: (none)
  Release Data: Off
  Data Released: False
  Global Release Data: Off
  PipelineMTime: 51268
  UpdateMTime: 76608
  RealTimeStamp: 0 seconds 
  LargestPossibleRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [512, 512, 126]
  BufferedRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [512, 512, 126]
  RequestedRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [512, 512, 126]
  Spacing: [0.703125, 0.703125, 2.5]
  Origin: [-169.6, -180, -301.545]
  Direction: 
1 0 0
0 1 0
0 0 1

  IndexToPointMatrix: 
0.703125 0 0
0 0.703125 0
0 0 2.5

  PointToIndexMatrix: 
1.42222 0 0
0 1.42222 0
0 0 0.4

  Inverse Direction: 
1 0 0
0 1 0
0 0 1

  PixelContainer: 
    ImportImageContainer (0x799cc007f3a0)
      RTTI typeinfo:   itk::ImportImageContainer<unsigned long, short>
   

Next we will load DICOM SEG series included in the study. We will use `pydicom-seg` library to load the segmentation as an array, which will then convert into ITK image, and resample to the CT image we loaded earlier.

In [None]:
# Read the DICOM SEG object using pydicom and pydicom_seg.
seg_image_path = "./nlst/100002/1.2.840.113654.2.55.68425808326883186792123057288612355322/SEG_1.2.276.0.7230010.3.1.3.313263360.15851.1706325185.577017/9629abf6-d1de-4931-bc6a-061890ae275c.dcm"
seg_dicom = pydicom.dcmread(seg_image_path)
seg_reader = pydicom_seg.MultiClassReader()
seg_obj = seg_reader.read(seg_dicom)

# Convert the DICOM SEG object into an itk image, with correct voxel origin, spacing, and directions in physical space.
seg_image = itk.image_from_array(seg_obj.data.astype(np.float32))
seg_image.SetOrigin(seg_obj.origin)
seg_image.SetSpacing(seg_obj.spacing)
seg_image.SetDirection(seg_obj.direction)
interpolator = itk.NearestNeighborInterpolateImageFunction.New(seg_image)
seg_image = itk.resample_image_filter(Input=seg_image,
                                            Interpolator=interpolator,
                                            reference_image=ct_image,
                                            use_reference_image=True)

Now that we have both the CT image and the SEG image loaded as ITK images, we can visualize them using itkWidgets `view()` function.

Note that the following cell may take about a minute to complete. Once the cell execution is done, you will need to wait a bit longer for the visualization widget to populate in the output cell, and a bit longer again (up to several minutes) for the image to properly render. While everything is being loaded, you will see a spinning icon next to the word "Image" in the upper left corner of the widget to indicate the content is being processed.

**Please be patient!** Until the image is fully processed by the widget, visualization may not show up correctly. The image below shows how the viewer should look like after everything loaded correctly.

<img src="https://github.com/ImagingDataCommons/IDC-Tutorials/releases/download/0.2.0/itkwidgets_demo.gif" alt="select" width="800"/>

In [None]:
viewer = view(ct_image, label_image=seg_image, ui_collapsed=False)


As you can see, IDC data is easy to load and visualize using popular open source libraries!

## Searching IDC data

In the previous exercise, we identified the data by navigating the IDC Portal. IDC Portal can be very helpful if you want to browse through the data, quickly check out individual images, or get an idea about what is available.

You can also filter IDC data and explore what is available programmatically: `idc-index` to the rescue!

`idc-index` is named this way because it wraps _index_ of IDC data: a table containing most important metadata attributes describing the files available in IDC. This metadata index is available in the `index` variable (which is a pandas `DataFrame`) of `IDCClient`, which contains the following columns.




In [None]:
print(idc_client.index.columns.values)

We will discuss just a few of those columns - you can learn about those not discussed in [this `idc-index` documentation page](https://idc-index.readthedocs.io/en/latest/?badge=latest#the-index-of-idc-index).

IDC is using DICOM for data representation, and in the DICOM data model, patients (identified by **`PatientID`** column) undergo imaging exams (or _studies_, in DICOM nomenclature).

Each patient will have one or more studies, with each study identified uniquely by the attribute **`StudyInstanceUID`**. During each of the imaging studies one or more imaging _series_ will be collected. As an example, a Computed Tomography (CT) imaging study may include a volume sweep before and after administration of the contrast agent. Imaging series are uniqiely identified by **`SeriesInstanceUID`**.

Individual collections within IDC group patients/cases, and are recognized by their **`collection_id`** values.

The function we used earlier - `download_from_selection()` - can be used to download images given any of these identifiers: `collection_id`,
`PatientID`, `StudyInstanceUID`, or `SeriesInstanceUID`.

In [None]:
help(idc_client.download_from_selection)

Help on method download_from_selection in module idc_index.index:

download_from_selection(downloadDir, dry_run=False, collection_id=None, patientId=None, studyInstanceUID=None, seriesInstanceUID=None, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID') method of idc_index.index.IDCClient instance
    Download the files corresponding to the selection. The filtering will be applied in sequence (but does it matter?) by first selecting the collection(s), followed by
    patient(s), study(studies) and series. If no filtering is applied, all the files will be downloaded.
    
    Args:
        downloadDir: string containing the path to the directory to download the files to
        dry_run: calculates the size of the cohort but download does not start
        collection_id: string or list of strings containing the values of collection_id to filter by
        patientId: string or list of strings con

Any of the columns included in the index can be used to build filters, or selection queries, subsetting content that meets specific requirements.

For any of the index column, we can check the unique values for each column, to first understand what is available.

As an example, let's look at the `Modality` column, which contains an abbreviation encoding the type of image.

In [None]:
print(idc_client.index['Modality'].unique())

['MR' 'PT' 'CT' 'MG' 'US' 'RTPLAN' 'DX' 'SEG' 'SR' 'RTSTRUCT' 'SM' 'PR'
 'M3D' 'RTDOSE' 'SC' 'CR' 'REG' 'NM' 'FUSION' 'XC' 'OT' 'XA' 'RF' 'RWV'
 'KO']


As an exercise of using the index to subset IDC data, we will search for all Magnetic Resonance series. This can be done using the `Modality` column we used above. If you want to know what each of the abbreviations above stands for, check out this page from the DICOM standard: https://dicom.nema.org/medical/dicom/current/output/chtml/part03/sect_C.7.3.html#sect_C.7.3.1.1.1. But of the sake of this exercise, "MR" value corresponds to "Magnetic Resonance".

`index` is just a pandas `DataFrame`, and you can use pandas syntax to do the selection, as shown in the next cell.

In [None]:
mr_selection_pd = idc_client.index[idc_client.index['Modality'] == "MR"]

As an alternative, you can use Standard Query Language (SQL). The following cell accomplishes the same task using an SQL query.

In [None]:
query = """
SELECT *
FROM index
WHERE Modality = 'MR'
"""

mr_selection_sql = idc_client.sql_query(query)

In both cases, the result will be returned as a pandas `DataFrame` containing rows that have "Modality" set to "MR". You can see that in both cases, there are 122707 series that meet this selection criteria.

In [None]:
mr_selection_sql

Unnamed: 0,collection_id,analysis_result_id,PatientID,SeriesInstanceUID,StudyInstanceUID,source_DOI,PatientAge,PatientSex,StudyDate,StudyDescription,...,Modality,Manufacturer,ManufacturerModelName,SeriesDate,SeriesDescription,SeriesNumber,instanceCount,license_short_name,series_aws_url,series_size_MB
0,prostate_mri_us_biopsy,,Prostate-MRI-US-Biopsy-1055,1.3.6.1.4.1.14519.5.2.1.1899844476410134559385...,1.3.6.1.4.1.14519.5.2.1.2628817600927956679189...,10.7937/TCIA.2020.A61IOC1A,066Y,M,2010-06-06,MR PROSTATE WO+W CONTRAST,...,MR,Siemens HealthCare GmbH,MAGNETOM Vida,2010-06-06,t2_spc_rst_axial obl_Prostate,9001,60,CC BY 4.0,s3://idc-open-data/2629b726-1d5e-46bb-ad83-84e...,12.98
1,duke_breast_cancer_mri,,Breast_MRI_366,1.3.6.1.4.1.14519.5.2.1.1967937938883908309677...,1.3.6.1.4.1.14519.5.2.1.5267683554571147092228...,10.7937/TCIA.e3sv-re93,052Y,F,1990-01-01,MRI BREAST BILATERAL W + W/O,...,MR,GE MEDICAL SYSTEMS,Optima MR450w,1990-01-01,ax dynamic,600,176,CC BY-NC 4.0,s3://idc-open-data-cr/91104ca9-fff2-4aba-a7bc-...,92.68
2,ispy2,,ISPY2-420523,1.3.6.1.4.1.14519.5.2.1.1955129313409963897130...,1.3.6.1.4.1.14519.5.2.1.1651229999144140646995...,10.7937/TCIA.D8Z0-9T85,,F,2000-09-07,ISPY2_MRI_T0,...,MR,Philips Medical Systems,Intera,2000-09-07,ISPY2: 13ML OMNI eTHRIVE_HR,601,1176,CC BY 4.0,s3://idc-open-data/5a105b2f-839c-429d-86d1-289...,747.35
3,ispy2,,ISPY2-597873,1.3.6.1.4.1.14519.5.2.1.1947902686914004457229...,1.3.6.1.4.1.14519.5.2.1.1626221496437855563089...,10.7937/TCIA.D8Z0-9T85,,F,2004-09-02,ISPY2_MRI_T0,...,MR,GE MEDICAL SYSTEMS,Optima MR450w,2004-09-02,ISPY2: Ph4/Ax VIBRANT MPh +C,704,84,CC BY 4.0,s3://idc-open-data/fa385258-6e8e-41f7-9c35-d1d...,44.59
4,ea1141,,EA1141-7499349,1.3.6.1.4.1.14519.5.2.1.1620.1225.211176807854...,1.3.6.1.4.1.14519.5.2.1.1620.1225.320692405837...,10.7937/2BAS-HR33,,,1945-01-24,MR BREAST RESEARCH EXAM,...,MR,GE MEDICAL SYSTEMS,DISCOVERY MR750,,Axial T1 FS post Asset,6,160,CC BY 4.0,s3://idc-open-data/a8f2cc40-90de-4ead-9d2c-ca6...,84.54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122702,breast_mri_nact_pilot,,UCSF-BR-11,1.3.6.1.4.1.14519.5.2.1.7695.2311.130574808628...,1.3.6.1.4.1.14519.5.2.1.7695.2311.330179937982...,10.7937/K9/TCIA.2016.QHSYHJKY,046Y,F,1992-04-30,"MR BREAS, UNIT",...,MR,GE MEDICAL SYSTEMS,GENESIS_SIGNA,1992-04-30,T2-FSE-Sagittal,2,38,CC BY 3.0,s3://idc-open-data/68eb71f1-46f6-4f73-97d1-68c...,5.15
122703,acrin_contralateral_breast_mr,,ACRIN-Contralateral-Breast-MR-858,1.3.6.1.4.1.14519.5.2.1.7009.2406.301896262402...,1.3.6.1.4.1.14519.5.2.1.7009.2406.282325974534...,10.7937/Q1EE-J082,053Y,F,1959-12-31,MRI BREAST UNI 1ST EXA,...,MR,GE MEDICAL SYSTEMS,GENESIS_SIGNA,1959-12-31,SAG FSE T2 FS,3,32,CC BY 4.0,s3://idc-open-data/efd7ae42-ee5c-4d24-8d08-687...,4.36
122704,acrin_contralateral_breast_mr,,ACRIN-Contralateral-Breast-MR-201,1.3.6.1.4.1.14519.5.2.1.7009.2406.158618405866...,1.3.6.1.4.1.14519.5.2.1.7009.2406.282649122373...,10.7937/Q1EE-J082,044Y,F,1959-12-31,66671F,...,MR,GE MEDICAL SYSTEMS,GENESIS_SIGNA,1959-12-31,,105,28,CC BY 4.0,s3://idc-open-data/72efcb16-d619-48a0-80b2-756...,3.82
122705,ispy1,,ISPY1_1094,1.3.6.1.4.1.14519.5.2.1.7695.1700.206374498355...,1.3.6.1.4.1.14519.5.2.1.7695.1700.151515454573...,10.7937/K9/TCIA.2016.HdHpgJLK,049Y,F,1986-09-05,MR BREASTUNI UE,...,MR,GE MEDICAL SYSTEMS,GENESIS_SIGNA,,PJN,915,11,CC BY 3.0,s3://idc-open-data/33eb967f-7a92-4505-8a96-ee2...,5.81


As you search the data, you can combine multiple columns to build your cohort!

`BodyPartExamined` column describes the anatomy that was imaged. Similar to what we did to understand the values available in the `Modality` column, let's find the distinct values and this time also count the number of times that value is encountered.

In [None]:
sorted_unique_values = idc_client.index['BodyPartExamined'].value_counts().sort_values(ascending=False)

print(sorted_unique_values)

BodyPartExamined
CHEST               350556
BREAST               99444
PROSTATE             22588
LUNG                 10823
ABDOMEN               8362
PELVIS                5080
COLON                 3544
KIDNEY                3380
LIVER                 2698
HEAD                  1153
EXTREMITY              924
BLADDER                846
OVARY                  844
UTERUS                 838
HEADNECK               819
CHESTABDPELVIS         720
TSPINE                 618
BRAIN                  539
CERVIX                 488
PHANTOM                364
PANCREAS               348
STOMACH                308
SKULL                  286
MEDIASTINUM            180
ABDOMENPELVIS          177
WHOLEBODY              172
ESOPHAGUS              126
ADRENAL                121
CSPINE                  51
NECKCHESTABDPELV        46
CHESTABDOMEN            44
LSPINE                  38
RECTUM                  34
THYROID                 28
THORAX                  28
SPINE                   20
FUSION     

For the sake of example, let's filter images that meet two criteria: Magnetic Resonance (`MR`) as `Modality`, and `LIVER` as `BodyPartExamined`.

In [4]:
query = """
SELECT *
FROM index
WHERE Modality = 'MR' AND BodyPartExamined = 'LIVER'
"""

liver_mr_selection_sql = idc_client.sql_query(query)

In [None]:
liver_mr_selection_sql

Unnamed: 0,collection_id,analysis_result_id,PatientID,SeriesInstanceUID,StudyInstanceUID,source_DOI,PatientAge,PatientSex,StudyDate,StudyDescription,...,Modality,Manufacturer,ManufacturerModelName,SeriesDate,SeriesDescription,SeriesNumber,instanceCount,license_short_name,series_aws_url,series_size_MB
0,tcga_lihc,,TCGA-DD-A4NP,1.3.6.1.4.1.14519.5.2.1.3344.4008.204609291887...,1.3.6.1.4.1.14519.5.2.1.3344.4008.268191322858...,10.7937/K9/TCIA.2016.IMMQW8UQ,032Y,M,1998-01-22,*MRI - ABDOMEN,...,MR,GE MEDICAL SYSTEMS,SIGNA EXCITE,1998-01-22,lava3d POST,9,372,CC BY 3.0,s3://idc-open-data/a65a1bf3-22b1-4611-8198-ac9...,197.16
1,tcga_lihc,,TCGA-BC-A216,1.3.6.1.4.1.14519.5.2.1.8421.4008.215687072128...,1.3.6.1.4.1.14519.5.2.1.8421.4008.431019104846...,10.7937/K9/TCIA.2016.IMMQW8UQ,063Y,F,1999-04-12,Outside Read or Comparison MRI,...,MR,SIEMENS,Espree,1999-04-12,t1_vibe_fs_tra_bh 1 + c,9,72,CC BY 3.0,s3://idc-open-data/4ee1e8c4-84a2-4b66-a40c-3d9...,37.98
2,tcga_lihc,,TCGA-G3-AAV1,1.3.6.1.4.1.14519.5.2.1.1079.4008.154857437339...,1.3.6.1.4.1.14519.5.2.1.1079.4008.110146226557...,10.7937/K9/TCIA.2016.IMMQW8UQ,051Y,M,2007-01-25,MR ABDOMEN NONENHANCED & ENHANCED-BODY,...,MR,SIEMENS,Aera,2007-01-25,HASTE LOCALIZER,1,15,CC BY 3.0,s3://idc-open-data/6fa7d854-1d3a-41ad-a36e-98b...,2.02
3,tcga_lihc,,TCGA-DD-A4NQ,1.3.6.1.4.1.14519.5.2.1.3344.4008.629712585201...,1.3.6.1.4.1.14519.5.2.1.3344.4008.398081275943...,10.7937/K9/TCIA.2016.IMMQW8UQ,061Y,M,2000-09-14,MRI Abd wo&w,...,MR,GE MEDICAL SYSTEMS,SIGNA EXCITE,2000-09-14,ASSET calibration,3,45,CC BY 3.0,s3://idc-open-data/57911ce7-f541-4fdb-9969-81d...,0.63
4,cptac_ccrcc,,C3N-01989,1.3.6.1.4.1.14519.5.2.1.4801.5885.132870867316...,1.3.6.1.4.1.14519.5.2.1.4801.5885.114156767912...,10.7937/K9/TCIA.2018.OBLAMN27,070Y,F,2001-10-08,MR CholangoPancreagram with without Contrast,...,MR,Philips Medical Systems,Ingenia,2001-10-08,SURVEY BH,301,11,CC BY 3.0,s3://idc-open-data/fe1d29a9-a6ad-4325-b0b7-71a...,1.54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
928,tcga_lihc,,TCGA-DD-A1EK,1.3.6.1.4.1.14519.5.2.1.3344.4008.231491102753...,1.3.6.1.4.1.14519.5.2.1.3344.4008.412308107972...,10.7937/K9/TCIA.2016.IMMQW8UQ,064Y,F,2001-06-15,"MRI ,ABD W&W/O, CONT 74183",...,MR,GE MEDICAL SYSTEMS,GENESIS_SIGNA,2001-06-15,THICK SLAB,4,6,CC BY 3.0,s3://idc-open-data/23f3df1d-1f3a-4038-a180-757...,3.18
929,tcga_lihc,,TCGA-DD-A3A0,1.3.6.1.4.1.14519.5.2.1.3344.4008.248858193777...,1.3.6.1.4.1.14519.5.2.1.3344.4008.906156291448...,10.7937/K9/TCIA.2016.IMMQW8UQ,072Y,M,1996-03-14,MRI Abd wo&w,...,MR,GE MEDICAL SYSTEMS,GENESIS_SIGNA,1996-03-14,ax rt FSE,5,24,CC BY 3.0,s3://idc-open-data/0dab553d-740a-49e7-8831-f0e...,3.29
930,tcga_lihc,,TCGA-DD-A3A0,1.3.6.1.4.1.14519.5.2.1.3344.4008.299412241240...,1.3.6.1.4.1.14519.5.2.1.3344.4008.264677955052...,10.7937/K9/TCIA.2016.IMMQW8UQ,071Y,M,1995-07-23,MRI Abd wo&w,...,MR,GE MEDICAL SYSTEMS,GENESIS_SIGNA,1995-07-23,ssfse,2,30,CC BY 3.0,s3://idc-open-data/c5520ee4-8365-47c0-9d12-acc...,4.12
931,tcga_lihc,,TCGA-DD-A3A0,1.3.6.1.4.1.14519.5.2.1.3344.4008.255590388443...,1.3.6.1.4.1.14519.5.2.1.3344.4008.264677955052...,10.7937/K9/TCIA.2016.IMMQW8UQ,071Y,M,1995-07-23,MRI Abd wo&w,...,MR,GE MEDICAL SYSTEMS,GENESIS_SIGNA,1995-07-23,Ax spgr,8,20,CC BY 3.0,s3://idc-open-data/d0806bb2-7cdb-4a94-a0c7-1a8...,2.74


## Working with the search results

Now that we learned the basics of searching, we will go over some of the operations to apply to the search results. We will learn how to download your selection, how to visualize individual images, and how to learn about the terms of use and attribution - this will be important if you use images from IDC in publications or commercial work!

### Downloading selected images

Earlier we learned how to download the selected study, where study identifier was copied from the IDC Portal.

You can use the same function to download images that you found by searching the index. In the following cell we will use `SeriesInstanceUID` passed to the `download_from_selection()` function to download the first 10 series.

In [None]:
idc_client.download_from_selection(seriesInstanceUID = list(liver_mr_selection_sql['SeriesInstanceUID'].values[:10]), downloadDir=".")

Downloading data: 100%|██████████| 331M/331M [00:05<00:00, 55.4MB/s]


### Visualizing selected images

To better understand the images available, you may want to take a look at them first. You do not need to leave the notebook to do that! `idc-index` provides a convenience function to generate a URL to open the viewer, and to even embed it in your notebook cell!

In the following cell we will pick a random row from the selection we did earlier, and generate the URL to open that image series in IDC-maintained image viewer.

In [None]:
import random

random_series = random.choice(liver_mr_selection_sql['SeriesInstanceUID'].values)
viewer_url = idc_client.get_viewer_URL(seriesInstanceUID=random_series)

print(viewer_url)

https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.8421.4008.303533339368406310446855637599?SeriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.8421.4008.268265087138563433402419981942


In the following cell, we embed the viewer that opens the URL for the selected series into the output cell. This way you can examine items you selected without leaving the notebook page!

In [None]:
from IPython.display import IFrame
IFrame(viewer_url, width=1600, height=900)

### Getting license information for the selected images

Each of the files available from IDC is accompanied by a license that defines the terms of use. In most cases, those are generic, broadly accepted Creative Commons licenses.

The abbreviated license code is available in the `license_short_name` column. In the following cell we get the list of various licenses that are encountered across all of the selected series. You will see that the only license encountered is a Creative Commons By Attribution license https://creativecommons.org/licenses/by/3.0/deed.en, which does not restrict commercial use, but requires you to acknowledge the author if you reuse the item.

Most (>95%) of images in IDC are shared under the permissive CC-BY license that allows commercial use (but does require attribution)!

In [None]:
sorted_unique_values = liver_mr_selection_sql['license_short_name'].value_counts().sort_values(ascending=False)

print(sorted_unique_values)

license_short_name
CC BY 3.0    933
Name: count, dtype: int64


### Getting citation information

To get more details about how the data was collected you should use the `source_DOI` column, which contains Digital Object Identifier (DOI) for the dataset from where the given file originates.

We can easily access the complete list of DOIs that accompany the items in our selection.

In [None]:
sorted_unique_values = liver_mr_selection_sql['source_DOI'].value_counts().sort_values(ascending=False)

print(sorted_unique_values)

source_DOI
10.7937/K9/TCIA.2016.IMMQW8UQ    910
10.7937/K9/TCIA.2018.OBLAMN27     23
Name: count, dtype: int64


If you use data from IDC, you should also acknowledge IDC as the source of the data, and cite individual datasets that you used.

To help you comply with the attribution requirements, `idc-index` provides a convenience function `citations_from_selection` that will look up the DOIs and generate the list of citations.

WARNING: As of May 30, 2024, due to server issues at api.crossref.org, the following cell may not work. In the future, we will replace the API call to CrossRef with a cached list of publication to address this issue.


In [None]:
idc_client.citations_from_selection(seriesInstanceUID=list(liver_mr_selection_sql['SeriesInstanceUID'].values))

['Erickson, B. J., Kirk, S., Lee, Y., Bathe, O., Kearns, M., Gerdes, C., Rieger-Christ, K., &amp; Lemmerman, J. (2016). <i>The Cancer Genome Atlas Liver Hepatocellular Carcinoma Collection (TCGA-LIHC)</i> (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2016.IMMQW8UQ',
 'National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). <i>The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC)</i> (Version 13) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.OBLAMN27',
 'Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., â\x80¦ Kikinis, R. (2023). National Cancer Institute Imaging Data Commons: Toward Transparency,              

You can also customize that citation list to be in BibTeX format (learn more in the documentation of the function here).

In [None]:
idc_client.citations_from_selection(seriesInstanceUID=list(liver_mr_selection_sql['SeriesInstanceUID'].values), citation_format=index.IDCClient.CITATION_FORMAT_BIBTEX)

['@misc{https://doi.org/10.7937/k9/tcia.2016.immqw8uq,\n  doi = {10.7937/K9/TCIA.2016.IMMQW8UQ},\n  url = {https://www.cancerimagingarchive.net/collection/tcga-lihc/},\n  author = {Erickson, Bradley J. and Kirk, Shanah and Lee, Yueh and Bathe, Oliver and Kearns, Melissa and Gerdes, Cindy and Rieger-Christ, Kimberly and Lemmerman, John},\n  title = {The Cancer Genome Atlas Liver Hepatocellular Carcinoma Collection (TCGA-LIHC)},\n  publisher = {The Cancer Imaging Archive},\n  year = {2016},\n  copyright = {Creative Commons Attribution 3.0 Unported}\n}\n',
 '@misc{https://doi.org/10.7937/k9/tcia.2018.oblamn27,\n  doi = {10.7937/K9/TCIA.2018.OBLAMN27},\n  url = {https://www.cancerimagingarchive.net/collection/cptac-ccrcc/},\n  author = {{National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC)}},\n  title = {The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC)},\n  publisher = {The Cancer Imaging Archive},\n  year =

## Summary

This is it! We hope you indeed learned how to search, visualize, and download images from IDC, and how to comply with the usage terms by understanding what license covers specific dataset, and how to attribute its authors.

We hope you enjoyed this tutorial! If something didn't work as expected, if you have any feedback or suggestions of what should be added to this tutorial, please contact IDC support by sending email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.canceridc.dev).

## Acknowledgments

Imaging Data Commons has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

If you use IDC in your research, please cite the following publication:

> Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. _National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence_. RadioGraphics (2023). https://doi.org/10.1148/rg.230180