<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part2_searching_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with IDC - Part 2: Basics of searching IDC data

---

## Summary

This notebook is part of [the "Getting started with IDC" notebook series](https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/getting_started) introducing NCI Imaging Data Commons to the users who want to interact with IDC programmatically.

In [Part 1](https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/getting_started/part1_prerequisites.ipynb) of this tutorial series, you learned how to complete the prerequisites that are required to be able to search IDC data. If you have not completed Part 1, it's ok - this notebook is self-contained, and you don't need to go back to Part 1! This notebook continues at the beginner level, guiding you with how to use [`idc-index`](https://github.com/ImagingDataCommons/idc-index), a python package that aims to simplify access to the data available from NCI Imaging Data Commons.

Upon completion of this tutorial, you will learn:
* how to download images from IDC
* how to load and view DICOM images available from IDC
* what metadata attributes are available for filtering and how to filter data programmatically
* how to better leverage IDC visualization capabilities by programmatically generating viewer URLs for selected items, and embedding the viewer in your notebook
* how get the licensing information for the individual files, and how to comply with the attribution usage requirement, which is in place for most of the data in IDC

---
Initial version: Nov 2022

Updated: August 2025


## Warnings: read these to avoid errors!

In order to avoid errors, please keep in mind the following warnings:

1. **Execute each cell of the notebook in order without skipping** any code cells - otherwise you will likely encounter runtime errors!
2. If you are going through this notebook for the first time, **do not change any code until you successfully completed the notebook** (unless you are specifically asked to change something). Otherwise you may run into errors.



## Prerequisites

This tutorial relies on [`idc-index`](https://github.com/ImagingDataCommons/idc-index) - a python package that accompanies IDC and provides basic functionality around searching and accessing data from IDC.

We will install `idc-index` using `pip`. This package is under active development, and so we will use the `--upgrade` option to access the latest version. It should take around 1-2 minutes to install the dependencies and complete the next cell.

In [2]:
%%capture
!pip install idc-index --upgrade

## Downloading data from IDC

One of the first questions we get from the users of IDC is "How do I download the images?"

With `idc-index`, answering this question is easy.

We will head out to IDC Portal, that provides a basic interface to explore the data available in IDC. As you move around the IDC Portal explore page, you will be able to copy identifiers corresponding to the content available in IDC at the different levels of data hierarchy - starting from entire collections, down to the individual image series.

Try this out yourself! Open the IDC Portal explore page [https://portal.imaging.datacommons.cancer.gov/explore/](https://portal.imaging.datacommons.cancer.gov/explore/) in a separate window, and experiment with copying these identifiers. Identifiers for collections and patients/cases will be short strings, while unique identifiers for studies and series will be rather long strings of text that contain numbers and "." characters).

![IDC Portal demo](https://github.com/ImagingDataCommons/IDC-Tutorials/releases/download/0.2.0/Tutorial-copy.gif)

The identifier you have in the clipboard, is all you need to access the corresponding images. To download those, we will first instantiate `IDCClient` that is provided by `idc-index`.

In [3]:
from idc_index import IDCClient

idc_client = IDCClient()

In the following cell, we pass the identifier of a study to the function `download_from_selection()` to fetch the corresponding files. For the sake of the tutorial, please proceed with the identifier used in the following cell. You can experiment with using the identifiers you copied from the portal after you completed the notebook.


In [4]:
studyInstanceUID = "1.2.840.113654.2.55.68425808326883186792123057288612355322"


idc_client.download_from_selection(studyInstanceUID=studyInstanceUID, downloadDir=".")

# You can experiment with downloading all of the data for a patient by its PatientID, or for entire collections
# by uncommenting the lines below!
#idc_client.download_from_selection(patientId="LUNG1-025", downloadDir=".")
#idc_client.download_from_selection(collection_id="ct_phantom4radiomics", downloadDir=".")


Downloading data: 100%|██████████| 314M/314M [00:03<00:00, 88.4MB/s]


You also have the choice of downloading from IDC using command-line helper tool that is installed as part of the `idc-index` package. The cell below shows how to download that same study from the command line.

In [None]:
!idc download 1.2.840.113654.2.55.68425808326883186792123057288612355322 --download-dir .

2025-08-29 14:58:13,898 - Downloading from IDC v21 index
2025-08-29 14:58:14,002 - Identified matching StudyInstanceUID: ['1.2.840.113654.2.55.68425808326883186792123057288612355322']
2025-08-29 14:58:14,055 - Disk size needed: 314.45 MB
2025-08-29 14:58:14,055 - Disk size available: 200.0 GB
2025-08-29 14:58:14,201 - Not using s5cmd sync as the destination folder is empty or sync or progress bar is not requested
2025-08-29 14:58:14,207 - Initial size of the directory: 66.35 MB
2025-08-29 14:58:14,207 - Approximate size of the files that need to be downloaded: 314.45 MB
Downloading data:  79% 248M/314M [00:02<00:00, 94.9MB/s]
2025-08-29 14:58:16,828 - Successfully downloaded files to /content


Once the download is completed, you can check the current directory to examine the content. Note that files are downloaded in to a folder hierarchy to make navigation easier. The default hierarchy is `collection_id` > `PatientID` > `StudyInstanceUID` > `Modality`_`SeriesInstanceUID`, and the files you downloaded should be organized as shown below.

```
├── nlst   <--- collection ID
│   └── 100002   <--- Patient ID
│       └── 1.2.840.113654.2.55.68425808326883186792123057288612355322   <--- Study ID
│           ├── CT_1.2.840.113654.2.55.229650531101716203536241646069123704792   <--- Series ID
│           │   ├── 0025b198-6198-4b33-85cf-92582531ad28.dcm   <--- individual instances/files the series
│           │   ├── 00a93dcb-4cd0-46ca-ae41-9240421cb0c7.dcm        (corresponding to slices in radiology and
│           │   ├── 00aa5957-1b4e-4b8b-879f-c505e12f2dcc.dcm         resolution layers in digital pathology)
│           │   ├── 02da9050-622c-4e00-a0d4-a70dd418973c.dcm
...
```

The DICOM study we downloaded is from the National Lung Screening Trial (NLST) collection available in IDC. This study contains two Computed Tomography (CT) DICOM series, two DICOM Segmentations (SEG) with the results of segmenting the CT series using [TotalSegmentator](https://github.com/wasserth/TotalSegmentator), and 4 DICOM Structured Reporting series with the radiomics features extracted from the segmented images. Most of the CT images in the NLST collection in IDC are accompanied by TotalSegmentator segmentations and corresponding radiomics features!

## Searching IDC data

In the previous exercise, we identified the data by navigating the IDC Portal. IDC Portal can be very helpful if you want to browse through the data, quickly check out individual images, or get an idea about what is available.

You can also filter IDC data and explore what is available programmatically: `idc-index` to the rescue!

`idc-index` is named this way because it wraps _index_ of IDC data: a table containing most important metadata attributes describing the files available in IDC. This metadata index is available in the `index` variable (which is a pandas `DataFrame`) of `IDCClient`, which contains the following columns.




In [5]:
print(idc_client.index.columns.values)

['collection_id' 'analysis_result_id' 'PatientID' 'SeriesInstanceUID'
 'StudyInstanceUID' 'source_DOI' 'PatientAge' 'PatientSex' 'StudyDate'
 'StudyDescription' 'BodyPartExamined' 'Modality' 'Manufacturer'
 'ManufacturerModelName' 'SeriesDate' 'SeriesDescription' 'SeriesNumber'
 'instanceCount' 'license_short_name' 'aws_bucket' 'crdc_series_uuid'
 'series_aws_url' 'series_size_MB']


We will discuss just a few of those columns - you can learn about those not discussed in [this `idc-index` documentation page](https://idc-index.readthedocs.io/en/latest/column_descriptions.html).

IDC is using DICOM for data representation, and in the DICOM data model, patients (identified by **`PatientID`** column) undergo imaging exams (or _studies_, in DICOM nomenclature).

Each patient will have one or more studies, with each study identified uniquely by the attribute **`StudyInstanceUID`**. During each of the imaging studies one or more imaging _series_ will be collected. As an example, a Computed Tomography (CT) imaging study may include a volume sweep before and after administration of the contrast agent. Imaging series are uniqiely identified by **`SeriesInstanceUID`**.

Individual collections within IDC group patients/cases, and are recognized by their **`collection_id`** values.

The function we used earlier - `download_from_selection()` - can be used to download images given any of these identifiers: `collection_id`,
`PatientID`, `StudyInstanceUID`, or `SeriesInstanceUID`.

In [6]:
help(idc_client.download_from_selection)

Help on method download_from_selection in module idc_index.index:

download_from_selection(downloadDir, dry_run=False, collection_id=None, patientId=None, studyInstanceUID=None, seriesInstanceUID=None, sopInstanceUID=None, crdc_series_uuid=None, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID', source_bucket_location='aws') method of idc_index.index.IDCClient instance
    Download the files corresponding to the selection. The filtering will be applied in sequence (but does it matter?) by first selecting the collection(s), followed by
    patient(s), study(studies) and series. If no filtering is applied, all the files will be downloaded.

    Args:
        downloadDir: string containing the path to the directory to download the files to
        dry_run: calculates the size of the cohort but download does not start
        collection_id: string or list of strings containing the values of colle

Any of the columns included in the index can be used to build filters, or selection queries, subsetting content that meets specific requirements.

For any of the index column, we can check the unique values for each column, to first understand what is available.

As an example, let's look at the `Modality` column, which contains an abbreviation encoding the type of image.

In [7]:
print(idc_client.index['Modality'].unique())

['CT' 'MR' 'M3D' 'PT' 'RTSTRUCT' 'SM' 'SEG' 'MG' 'US' 'SR' 'PR' 'DX'
 'RTPLAN' 'NM' 'REG' 'CR' 'RTDOSE' 'RWV' 'SC' 'FUSION' 'KO' 'XA' 'XC' 'OT'
 'RF' 'ANN']


As an exercise of using the index to subset IDC data, we will search for all Magnetic Resonance series. This can be done using the `Modality` column we used above. If you want to know what each of the abbreviations above stands for, check out this page from the DICOM standard: https://dicom.nema.org/medical/dicom/current/output/chtml/part03/sect_C.7.3.html#sect_C.7.3.1.1.1. But of the sake of this exercise, "MR" value corresponds to "Magnetic Resonance".

`index` is just a pandas `DataFrame`, and you can use pandas syntax to do the selection, as shown in the next cell.

In [8]:
mr_selection_pd = idc_client.index[idc_client.index['Modality'] == "MR"]

As an alternative, you can use Standard Query Language (SQL). The following cell accomplishes the same task using an SQL query.

In [9]:
query = """
SELECT *
FROM index
WHERE Modality = 'MR'
"""

mr_selection_sql = idc_client.sql_query(query)

In both cases, the result will be returned as a pandas `DataFrame` containing rows that have "Modality" set to "MR". You can see that in both cases, there are 122707 series that meet this selection criteria.

In [10]:
mr_selection_sql

Unnamed: 0,collection_id,analysis_result_id,PatientID,SeriesInstanceUID,StudyInstanceUID,source_DOI,PatientAge,PatientSex,StudyDate,StudyDescription,...,ManufacturerModelName,SeriesDate,SeriesDescription,SeriesNumber,instanceCount,license_short_name,aws_bucket,crdc_series_uuid,series_aws_url,series_size_MB
0,upenn_gbm,,UPENN-GBM-00370,1.3.6.1.4.1.14519.5.2.1.2124080571797641357837...,1.3.6.1.4.1.14519.5.2.1.2257953328133520977052...,10.7937/tcia.709x-dn49,068Y,M,2010-07-17,BrainTumor,...,TrioTim,2010-07-17,ep2d_perf BOLUS_T,13,900,CC BY 4.0,idc-open-data,b7403fcb-0f50-4676-88f6-d1b3366044e4,s3://idc-open-data/b7403fcb-0f50-4676-88f6-d1b...,33.25
1,remind,,ReMIND-067,1.3.6.1.4.1.14519.5.2.1.1079071675559977261971...,1.3.6.1.4.1.14519.5.2.1.6126101052383745395058...,10.7937/3rag-d070,,,1982-12-25,Intraop,...,Unknown model,2007-03-08,3D_AX_T1_postcontrast,6,176,CC BY 4.0,idc-open-data,594c5012-75b9-41a8-9739-d0bdad208976,s3://idc-open-data/594c5012-75b9-41a8-9739-d0b...,23.46
2,breast_mri_nact_pilot,,UCSF-BR-66,1.3.6.1.4.1.14519.5.2.1.7695.2311.136826935299...,1.3.6.1.4.1.14519.5.2.1.7695.2311.228643639548...,10.7937/k9/tcia.2016.qhsyhjky,048Y,F,1992-01-27,"MR BREAS, UNIT",...,GENESIS_SIGNA,1992-01-27,Dynamic-3dfgre: PE1,31001,60,CC BY 3.0,idc-open-data,3b9997d6-1c93-431d-83a2-80bc72ed35e5,s3://idc-open-data/3b9997d6-1c93-431d-83a2-80b...,8.19
3,tcga_brca,,TCGA-E2-A14N,1.3.6.1.4.1.14519.5.2.1.3023.4002.204593323712...,1.3.6.1.4.1.14519.5.2.1.3023.4002.322707965124...,10.7937/k9/tcia.2016.ab2nazrp,,,2001-07-11,"MR BREAST, UNILATERAL W/WO CONT",...,SIGNA EXCITE,2001-07-11,SAG 3D (PRE-CONTRAST),5,84,CC BY 3.0,idc-open-data,a464b375-ab0b-41fd-a108-3a46ec3494a4,s3://idc-open-data/a464b375-ab0b-41fd-a108-3a4...,11.57
4,cptac_ucec,,C3N-01761,1.3.6.1.4.1.14519.5.2.1.3320.3273.115066127656...,1.3.6.1.4.1.14519.5.2.1.3320.3273.325425058352...,10.7937/k9/tcia.2018.3r3juisw,,F,2000-06-08,MIEDNICA,...,Avanto_fit,2000-06-08,t1_fl3d_tra fs_dyn CM,14,88,CC BY 3.0,idc-open-data,fe228590-bffb-48f9-b845-b2534464b663,s3://idc-open-data/fe228590-bffb-48f9-b845-b25...,12.61
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123283,acrin_6698,,ACRIN-6698-942017,1.3.6.1.4.1.14519.5.2.1.7695.4164.226131662079...,1.3.6.1.4.1.14519.5.2.1.7695.4164.245512235300...,10.7937/tcia.kk02-6d95,,F,2001-02-23,ACRIN-6698_ISPY2_MRI_T2,...,Signa HDxt,2001-02-23,ACRIN-6698: DWI MASK: from S4: Whole Tumor Manual,49000,11,CC BY 4.0,idc-open-data,94e1499a-38e1-40ed-92de-fd30e63f6a5b,s3://idc-open-data/94e1499a-38e1-40ed-92de-fd3...,1.47
123284,acrin_6698,,ACRIN-6698-490023,1.3.6.1.4.1.14519.5.2.1.7695.4164.203665373427...,1.3.6.1.4.1.14519.5.2.1.7695.4164.277230754402...,10.7937/tcia.kk02-6d95,,F,2001-09-13,ACRIN-6698_ISPY2_MRI_T0,...,Avanto,2001-09-13,ISPY2: VOLSER: uni-lateral cropped: PE6,71006,80,CC BY 4.0,idc-open-data,0e8f6fff-6736-489c-ad9c-19258d40ac02,s3://idc-open-data/0e8f6fff-6736-489c-ad9c-192...,10.95
123285,acrin_6698,,ACRIN-6698-146987,1.3.6.1.4.1.14519.5.2.1.7695.4164.173316862766...,1.3.6.1.4.1.14519.5.2.1.7695.4164.984568341780...,10.7937/tcia.kk02-6d95,,F,2002-02-09,ACRIN-6698_ISPY2_MRI_T1,...,Signa HDxt,2002-02-09,ISPY2: IP:T2fseidealarc_BP,3995,47,CC BY 4.0,idc-open-data,c93e95a7-5ce8-4deb-bed8-7782e66f795a,s3://idc-open-data/c93e95a7-5ce8-4deb-bed8-778...,24.93
123286,acrin_6698,,ACRIN-6698-384830,1.3.6.1.4.1.14519.5.2.1.7695.4164.178469555093...,1.3.6.1.4.1.14519.5.2.1.7695.4164.118962673347...,10.7937/tcia.kk02-6d95,,F,2001-11-17,ACRIN-6698_ISPY2_MRI_T1,...,Avanto,2001-11-17,ISPY2: VOLSER: uni-lateral cropped: PE2,71002,80,CC BY 4.0,idc-open-data,a53accf9-ec5a-45fe-9d55-d17f7a402df3,s3://idc-open-data/a53accf9-ec5a-45fe-9d55-d17...,10.94


As you search the data, you can combine multiple columns to build your cohort!

`BodyPartExamined` column describes the anatomy that was imaged. Similar to what we did to understand the values available in the `Modality` column, let's find the distinct values and this time also count the number of times that value is encountered.

In [12]:
sorted_unique_values = idc_client.index['BodyPartExamined'].value_counts().sort_values(ascending=False)

print(sorted_unique_values)

BodyPartExamined
CHEST               353505
BREAST              100341
PROSTATE             22984
LUNG                 10971
ABDOMEN               9036
                     ...  
WO INTER                 2
BD CT ABD WO_W C         2
EAR                      2
BODY                     1
FEMUR                    1
Name: count, Length: 63, dtype: int64


For the sake of example, let's filter images that meet two criteria: Magnetic Resonance (`MR`) as `Modality`, and `LIVER` as `BodyPartExamined`.

In [13]:
query = """
SELECT *
FROM index
WHERE Modality = 'MR' AND BodyPartExamined = 'LIVER'
"""

liver_mr_selection_sql = idc_client.sql_query(query)

In [14]:
liver_mr_selection_sql

Unnamed: 0,collection_id,analysis_result_id,PatientID,SeriesInstanceUID,StudyInstanceUID,source_DOI,PatientAge,PatientSex,StudyDate,StudyDescription,...,ManufacturerModelName,SeriesDate,SeriesDescription,SeriesNumber,instanceCount,license_short_name,aws_bucket,crdc_series_uuid,series_aws_url,series_size_MB
0,tcga_lihc,,TCGA-DD-A4NO,1.3.6.1.4.1.14519.5.2.1.3344.4008.205544194556...,1.3.6.1.4.1.14519.5.2.1.3344.4008.131847076153...,10.7937/k9/tcia.2016.immqw8uq,068Y,M,2002-12-26,MRI Abd wo&w,...,Signa HDxt,2002-12-26,ax FSE,7,38,CC BY 3.0,idc-open-data,0c873968-1e73-42ad-8058-8414bd3ad8a1,s3://idc-open-data/0c873968-1e73-42ad-8058-841...,20.17
1,tcga_lihc,,TCGA-DD-A4NF,1.3.6.1.4.1.14519.5.2.1.3344.4008.601240630711...,1.3.6.1.4.1.14519.5.2.1.3344.4008.146126813486...,10.7937/k9/tcia.2016.immqw8uq,071Y,M,2004-02-24,MR ABDOMEN W/WO CONTRAST,...,Symphony,2004-02-24,testbolus,17,40,CC BY 3.0,idc-open-data,995562b0-565c-433c-a5c9-bff81399968b,s3://idc-open-data/995562b0-565c-433c-a5c9-bff...,5.37
2,tcga_lihc,,TCGA-G3-AAV7,1.3.6.1.4.1.14519.5.2.1.1079.4008.268033413554...,1.3.6.1.4.1.14519.5.2.1.1079.4008.943908897077...,10.7937/k9/tcia.2016.immqw8uq,038Y,M,2007-06-10,MR ABDOMEN NONENHANCED & ENHANCED-BODY,...,Aera,2007-06-10,Test_bolus,11,60,CC BY 3.0,idc-open-data,3bec1f96-0b81-4ee3-bcce-f43483d9efef,s3://idc-open-data/3bec1f96-0b81-4ee3-bcce-f43...,6.63
3,tcga_lihc,,TCGA-DD-A4NP,1.3.6.1.4.1.14519.5.2.1.3344.4008.219996741371...,1.3.6.1.4.1.14519.5.2.1.3344.4008.228865235786...,10.7937/k9/tcia.2016.immqw8uq,032Y,M,1998-04-26,*MRI - ABDOMEN,...,SIGNA EXCITE,1998-04-26,(epi dwi b100),6,100,CC BY 3.0,idc-open-data,a243d0e5-f721-41c2-989a-d6f447067cd8,s3://idc-open-data/a243d0e5-f721-41c2-989a-d6f...,13.67
4,tcga_lihc,,TCGA-DD-A11B,1.3.6.1.4.1.14519.5.2.1.3344.4008.193275788161...,1.3.6.1.4.1.14519.5.2.1.3344.4008.227223611743...,10.7937/k9/tcia.2016.immqw8uq,073Y,M,1998-01-30,*MRI - ABDOMEN,...,SIGNA EXCITE,1998-01-30,improved IP/OP,4,60,CC BY 3.0,idc-open-data,f8c4737c-c2c0-4ac0-b6c1-ac200c682dbc,s3://idc-open-data/f8c4737c-c2c0-4ac0-b6c1-ac2...,8.20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
928,tcga_lihc,,TCGA-DD-A4NF,1.3.6.1.4.1.14519.5.2.1.3344.4008.248956317862...,1.3.6.1.4.1.14519.5.2.1.3344.4008.820801077661...,10.7937/k9/tcia.2016.immqw8uq,072Y,M,2005-04-04,MRI ABDOMEN W/O AND WITH CONTRAST,...,SIGNA EXCITE,2005-04-04,POST-T2 FRFSE AXIAL FAT/SAT,6,34,CC BY 3.0,idc-open-data,57ef2f77-ba85-490a-8999-0335623e22e4,s3://idc-open-data/57ef2f77-ba85-490a-8999-033...,4.65
929,tcga_lihc,,TCGA-BC-A216,1.3.6.1.4.1.14519.5.2.1.8421.4008.157935428926...,1.3.6.1.4.1.14519.5.2.1.8421.4008.299584989659...,10.7937/k9/tcia.2016.immqw8uq,062Y,F,1998-10-19,Outside Read or Comparison MRI,...,Espree,1998-10-19,t1_vibe_fs_tra_bh 3,10,72,CC BY 3.0,idc-open-data,ae3ff38b-b711-45cf-8c03-9ba97c9db333,s3://idc-open-data/ae3ff38b-b711-45cf-8c03-9ba...,37.97
930,tcga_lihc,,TCGA-G3-AAV2,1.3.6.1.4.1.14519.5.2.1.1079.4008.297303823931...,1.3.6.1.4.1.14519.5.2.1.1079.4008.242945140477...,10.7937/k9/tcia.2016.immqw8uq,060Y,M,2007-01-11,MR ABDOMEN NONENHANCED & ENHANCED-BODY,...,Aera,2007-01-11,COR T2 HASTE NAV,4,52,CC BY 3.0,idc-open-data,f4866927-427a-4740-b9fc-f53efcebd05f,s3://idc-open-data/f4866927-427a-4740-b9fc-f53...,7.04
931,tcga_lihc,,TCGA-BC-A217,1.3.6.1.4.1.14519.5.2.1.8421.4008.293137743511...,1.3.6.1.4.1.14519.5.2.1.8421.4008.196083297365...,10.7937/k9/tcia.2016.immqw8uq,074Y,F,2000-06-19,Outside Read or Comparison MRI,...,SIGNA EXCITE,2000-06-19,Asset cal NON BH,5,42,CC BY 3.0,idc-open-data,890a79d6-6500-4918-ad25-9ea1a7ef56f7,s3://idc-open-data/890a79d6-6500-4918-ad25-9ea...,0.58


## Working with the search results

Now that we learned the basics of searching, we will go over some of the operations to apply to the search results. We will learn how to download your selection, how to visualize individual images, and how to learn about the terms of use and attribution - this will be important if you use images from IDC in publications or commercial work!

### Downloading selected images

Earlier we learned how to download the selected study, where study identifier was copied from the IDC Portal.

You can use the same function to download images that you found by searching the index. In the following cell we will use `SeriesInstanceUID` passed to the `download_from_selection()` function to download the first 10 series.

In [15]:
idc_client.download_from_selection(seriesInstanceUID = list(liver_mr_selection_sql['SeriesInstanceUID'].values[:10]), downloadDir=".")

Downloading data: 100%|██████████| 189M/189M [00:03<00:00, 60.5MB/s]


### Visualizing selected images

To better understand the images available, you may want to take a look at them first. You do not need to leave the notebook to do that! `idc-index` provides a convenience function to generate a URL to open the viewer, and to even embed it in your notebook cell!

In the following cell we will pick a random row from the selection we did earlier, and generate the URL to open that image series in IDC-maintained image viewer.

In [16]:
import random

random_series = random.choice(liver_mr_selection_sql['SeriesInstanceUID'].values)
viewer_url = idc_client.get_viewer_URL(seriesInstanceUID=random_series, viewer_selector="ohif_v3")

print(viewer_url)

https://viewer.imaging.datacommons.cancer.gov/v3/viewer/?StudyInstanceUIDs=1.3.6.1.4.1.14519.5.2.1.8421.4008.250364384362743777013715074894&SeriesInstanceUIDs=1.3.6.1.4.1.14519.5.2.1.8421.4008.145491249984459210883479000945


In the following cell, we embed the viewer that opens the URL for the selected series into the output cell. This way you can examine items you selected without leaving the notebook page!

In [17]:
from IPython.display import IFrame
IFrame(viewer_url, width=1600, height=900)

### Getting license information for the selected images

Each of the files available from IDC is accompanied by a license that defines the terms of use. In most cases, those are generic, broadly accepted Creative Commons licenses.

The abbreviated license code is available in the `license_short_name` column. In the following cell we get the list of various licenses that are encountered across all of the selected series. You will see that the only license encountered is a Creative Commons By Attribution license https://creativecommons.org/licenses/by/3.0/deed.en, which does not restrict commercial use, but requires you to acknowledge the author if you reuse the item.

Most (>95%) of images in IDC are shared under the permissive CC-BY license that allows commercial use (but does require attribution)!

In [18]:
sorted_unique_values = liver_mr_selection_sql['license_short_name'].value_counts().sort_values(ascending=False)

print(sorted_unique_values)

license_short_name
CC BY 3.0    933
Name: count, dtype: int64


### Getting citation information

To get more details about how the data was collected you should use the `source_DOI` column, which contains Digital Object Identifier (DOI) for the dataset from where the given file originates.

We can easily access the complete list of DOIs that accompany the items in our selection.

In [19]:
sorted_unique_values = liver_mr_selection_sql['source_DOI'].value_counts().sort_values(ascending=False)

print(sorted_unique_values)

source_DOI
10.7937/k9/tcia.2016.immqw8uq    910
10.7937/k9/tcia.2018.oblamn27     23
Name: count, dtype: int64


If you use data from IDC, you should also acknowledge IDC as the source of the data, and cite individual datasets that you used.

To help you comply with the attribution requirements, `idc-index` provides a convenience function `citations_from_selection` that will look up the DOIs and generate the list of citations.

WARNING: As of May 30, 2024, due to server issues at api.crossref.org, the following cell may not work. In the future, we will replace the API call to CrossRef with a cached list of publication to address this issue.


In [20]:
idc_client.citations_from_selection(seriesInstanceUID=list(liver_mr_selection_sql['SeriesInstanceUID'].values))

['Erickson, B. J., Kirk, S., Lee, Y., Bathe, O., Kearns, M., Gerdes, C., Rieger-Christ, K., &amp; Lemmerman, J. (2016). <i>The Cancer Genome Atlas Liver Hepatocellular Carcinoma Collection (TCGA-LIHC)</i> (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2016.IMMQW8UQ',
 'National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). <i>The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC)</i> (Version 14) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.OBLAMN27',
 'Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., â\x80¦ Kikinis, R. (2023). National Cancer Institute Imaging Data Commons: Toward Transparency,              

You can also customize that citation list to be in BibTeX format (learn more in the documentation of the function here).

In [21]:
from idc_index import index
idc_client.citations_from_selection(seriesInstanceUID=list(liver_mr_selection_sql['SeriesInstanceUID'].values), citation_format=index.IDCClient.CITATION_FORMAT_BIBTEX)

['@misc{https://doi.org/10.7937/k9/tcia.2016.immqw8uq,\n  doi = {10.7937/K9/TCIA.2016.IMMQW8UQ},\n  url = {https://www.cancerimagingarchive.net/collection/tcga-lihc/},\n  author = {Erickson, Bradley J. and Kirk, Shanah and Lee, Yueh and Bathe, Oliver and Kearns, Melissa and Gerdes, Cindy and Rieger-Christ, Kimberly and Lemmerman, John},\n  title = {The Cancer Genome Atlas Liver Hepatocellular Carcinoma Collection (TCGA-LIHC)},\n  publisher = {The Cancer Imaging Archive},\n  year = {2016},\n  copyright = {Creative Commons Attribution 3.0 Unported}\n}\n',
 '@misc{https://doi.org/10.7937/k9/tcia.2018.oblamn27,\n  doi = {10.7937/K9/TCIA.2018.OBLAMN27},\n  url = {https://www.cancerimagingarchive.net/collection/cptac-ccrcc/},\n  author = {{National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC)}},\n  title = {The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC)},\n  publisher = {The Cancer Imaging Archive},\n  year =

## Summary

This is it! We hope you indeed learned how to search, visualize, and download images from IDC, and how to comply with the usage terms by understanding what license covers specific dataset, and how to attribute its authors.

We hope you enjoyed this tutorial! If you want to learn more about IDC, you can check out the [Getting Started documentation page](https://learn.canceridc.dev/getting-started-with-idc), or take a look at other tutorials we have in the [IDC-Tutorials GitHub repository](https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master).

If something didn't work as expected, if you have any feedback or suggestions of what should be added to this tutorial, please contact IDC support by sending email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.canceridc.dev).

## Acknowledgments

Imaging Data Commons has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

If you use IDC in your research, please cite the following publication:

> Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. _National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence_. RadioGraphics (2023). https://doi.org/10.1148/rg.230180