<a href="https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries_for_Public_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary


Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution complex. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services that take major burdens of data sharing off researchers. 

**This notebook is focused on basic use cases for leveraging TCIA's REST APIs to execute queries to learn about open-access datasets that don't require a user account.**  If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks. You can also view a list of GitHub repositories that have tagged themself as relevant to TCIA at https://github.com/topics/tcia-dac.

# 1 Learn About Available Collections on the TCIA Website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and viewing [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) of datasets on TCIA are the easiest ways to become familiar with what is available. These pages will help you quickly identify datasets of interest, find valuable supporting data that are not available via our APIs (e.g. clinical spreadsheets and non-DICOM segmentation data), and answer the most common questions you might have about the datasets.  

# 2 REST API Overview 
TCIA uses software called NBIA to manage DICOM data. The NBIA REST APIs are provided for the search and download functions used in the TCIA radiology portal and allow access to both public and limited access collections.
1. The [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) allow you to perform basic queries and download data from **public** collections. These APIs do not require a TCIA account.
2. The [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) allow you to perform basic queries and download data from **public and limited-access** collections. These APIs require a TCIA account to create authentication tokens.
3. The [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) also allow access to **public and limited-access** collections, but provide query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications. This API requires a TCIA account to create authentication tokens.

This notebook will focus on the fully public [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB). If you'd like to see examples using the APIs that require authentication, check out [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/ACNS0332/ACNS0332.ipynb), which shows many similar examples with the additional steps necessary to create a secure token using your TCIA login credentials.

***Note:*** Many of the examples below allow for additional query parameters to refine your results. These are covered in the documentation links above.

### 2.1 Setting the Base URLs
The URL for accessing the Search APIs changes slightly depending on whether or not you would like to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection, which lives on its own server due to its size (26,000+ patients, ~13 TBytes). Here are the base URLs:

* All other Collections - https://services.cancerimagingarchive.net/nbia-api/services/v1/
* NLST - https://services.cancerimagingarchive.net/nlst-api/services/v1/

Let's set those as variables and also import a few modules we'll need later.


In [None]:
# set API base URLs

base_url = "https://services.cancerimagingarchive.net/nbia-api/services/v1/"
nlst_url = "https://services.cancerimagingarchive.net/nlst-api/services/v1/"

# imports

import requests
import pandas as pd
import json

# 3 Query Examples
Let's start by getting a list of available Collections (datasets) and then we'll dig into specific Collections in more detail.

## 3.1 Get a List of Collections 

In [None]:
# get list of available collections as JSON

data_url = base_url + "getCollectionValues"
data = requests.get(data_url).json()
print(json.dumps(data, indent=2))


## 3.2 List Modalities or Body Parts Examined in a Collection
Let's choose a Collection from the list above and find out more about what modalities and body parts it contains. We'll define these as functions so we can use them as part of more complex queries later in the notebook.

In [None]:
# Define a function to return modalities for a collection as JSON

# Choose a collection of interest
collection = "TCGA-LUAD"

# create the function
def getModality(collection):
    data_url = base_url + "getModalityValues?Collection=" + collection
    data = requests.get(data_url)
    if data.text != "":
        return data.json()
    else:
        print("Collection not found.")

# call the function
getModality(collection)

In [None]:
# Define a function to return body parts examined for a collection as JSON
collection = "TCGA-LUAD"

def getBodyPart(collection):
    data_url = base_url + "getBodyPartValues?Collection=" + collection
    data = requests.get(data_url)
    if data.text != "":
        return data.json()
    else:
        print("Collection not found.")

getBodyPart(collection)

## 3.3 Exploring Patient, Study, and Scan Metadata
You can use the /getPatient endpoint to obtain details about species, gender, and ethnicity where available for a given collection. You can also learn whether the subject is a [phantom](https://www.nist.gov/physics/what-are-imaging-phantoms) or not.

In [None]:
# Retrieve patient details as JSON and create pandas dataframe w/ optional file export

# Choose a collection of interest
collection="CPTAC-CCRCC"

data_url = base_url + "getPatient?Collection=" + collection
data = requests.get(data_url)

if data.text != "":
    df = pd.DataFrame(data.json())
    display(df)
    # optional - save to JSON or CSV file
    df.to_csv(collection+'_patient_metadata.csv')
    # df.to_json(collection+'_patient_metadata.json')
else:
    print("Collection not found.")


Here's an example that does the same thing with the NLST collection, which is living on its own server and uses the slightly modified API URL.  Any of the other queries shown in the notebook should work simply by setting the collection variable to "NLST" and updating "base_url" to the "nlst_url" in the "data_url" variable as shown here.

In [None]:
# Retrieve patient details as JSON and create pandas dataframe w/ optional file export

# Choose a collection of interest
collection="NLST"

# NOTE: we are using the "nlst_url" variable rather than the general "base_url"
data_url = nlst_url + "getPatient?Collection=" + collection
data = requests.get(data_url)

if data.text != "":
    df = pd.DataFrame(data.json())
    display(df)
    # optional - save to JSON or CSV file
    df.to_csv(collection+'_patient_metadata.csv')
    # df.to_json(collection+'_patient_metadata.json')
else:
    print("Collection not found.")

The /getStudy endpoint can be used to obtain study/visit details such as the anonymized study date, subject's age at the time of visit, number of scans acquired at each timepoint, and more.

In [None]:
# getStudy details for collection and create pandas dataframe w/ optional file export

collection="CPTAC-CCRCC"

data_url = base_url + "getPatientStudy?Collection=" + collection
data = requests.get(data_url)

if data.text != "":
    df = pd.DataFrame(data.json())
    display(df)
    # optional - save to JSON or CSV file
    df.to_csv(collection+'_study_metadata.csv')
    # df.to_json(collection+'_study_metadata.json')
else:
    print("Collection not found.")

We can also create reports that give useful metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer and software version, number of images). You must choose a collection, but modality is optional.

We'll define this as a function so we can use the JSON output in a more complex query later, but we'll display it in a dataframe here so it's easier to view and export to a file (if desired).  

In [None]:
# Define a function to return scan/series metadata for a collection as JSON 
# modality is optional

collection = "LIDC-IDRI"
modality = ""

def getSeries(collection, modality=""):
    if modality != "":
        data_url = base_url + "getSeries?Collection=" + collection + "&Modality=" + modality
        data = requests.get(data_url)
        if data.text != "":
            return data.json()
        else:
            print("No results: Please check to make sure the Collection " + collection + " exists and it contains " + modality + " modality.")
    else:
        data_url = base_url + "getSeries?Collection=" + collection
        data = requests.get(data_url)
        if data.text != "":
            return data.json()
        else:
            print("Collection not found.")

# call the function and save the results
data = getSeries(collection, modality)

# load results to a dataframe
df = pd.DataFrame(data)
display(df)

# optional - save to JSON or CSV file
# df.to_csv(collection+'_scan_metadata.csv')
# df.to_json(collection+'_scan_metadata.json')

## 3.4 Advanced Queries
Here are some additional examples that can be useful to address common questions about TCIA's datasets. For these we will rely on some of the functions we defined earlier in the notebook.

In [None]:
# Find out how many patients, which modalities and which body parts are in a collection

# set collection of interest
collection = "QIN-PROSTATE-Repeatability"

# get list of patients in Collection
data_url = base_url + "getPatient?Collection=" + collection
data = requests.get(data_url)

if data.text != "":
    # get modalities for collection
    modalities = getModality(collection)
    clean_modalities = set(item['Modality'] for item in modalities)
    # get body parts for collection
    bodyParts = getBodyPart(collection)
    clean_bodyParts = set()
    # replace null bodyParts with "Not Specified"
    for item in bodyParts:
        if len(item):
            clean_bodyParts.add(item['BodyPartExamined'])
        else:
            clean_bodyParts.add('Not Specified')
    # print cleaned up results
    print(collection, 'has', len(data.json()), 'patients,',
        clean_modalities, 'modalities, and',
        clean_bodyParts, 'anatomic entities')
else:
    print("Collection not found.")

In [None]:
# Calculate summary statistics for a given collection 
collection = "CPTAC-LSCC"

# Call the getSeries function we created above
data = getSeries(collection)

# convert the output to dataframe
df = pd.DataFrame(data)

# Summarize patients
print('Summary Statistics\n')
print('Subjects: ', len(df['PatientID'].value_counts()), 'subjects')
print('Subjects: ', len(df['StudyInstanceUID'].value_counts()), 'studies')
print('Subjects: ', len(df['SeriesInstanceUID'].value_counts()), 'series')
print('Images: ', df['ImageCount'].sum(), 'images\n')

# Summarize modalities
print("Series Counts - Modalities:")
print(df['Modality'].value_counts(dropna=False),'\n')

# Summarize body parts
print("Series Counts - Body Parts Examined:")
print(df['BodyPartExamined'].value_counts(dropna=False),'\n')

# Summarize manufacturers
print("Series Counts - Device Manufacturers:")
print(df['Manufacturer'].value_counts(dropna=False))

In [None]:
# get patient counts for a given modality across all Collections
# this is particularly useful for finding Collections w/ segmentation labels (SEG/RTSTRUCT)

modality = "SEG"

data_url = base_url + "getCollectionValues"
data = requests.get(data_url)

if data.text != "":
    notFound=[]
    data = data.json()
    for x in data:
        collection = x['Collection']
        patient_url = base_url + "getPatientByCollectionAndModality?Collection=" + collection + "&Modality=" + modality
        patients = requests.get(patient_url)
        if patients.text != "":
            patients = patients.json()
            print(collection, 'has', len(patients), 'patients with', modality, 'modality') 
        else:
            notFound.append(collection)
    print('The following collections have no patients with', modality, 'modality:', notFound)
else:
    print("Modality not found.")

In [None]:
# create a dataframe that shows patient counts, modalities, body parts for all collections
# note: this can take >6 minutes to run

resultsList = []

# get list of available collections
collection_url = base_url + "getCollectionValues"
collection_data = requests.get(collection_url).json()

# loop through list of collections to populate dataframe
for x in collection_data:
    collectionName = x['Collection']
    patient_url = base_url + "getPatient?Collection=" + collectionName
    patients = requests.get(patient_url).json()
    clean_PatientIds = set(item['PatientId'] for item in patients)
    patientCount = len(clean_PatientIds)
    modality_url = base_url + "getModalityValues?Collection=" + collectionName
    modalities = requests.get(modality_url).json()
    clean_modalities = set(item['Modality'] for item in modalities)
    bodyPart_url = base_url + "getBodyPartValues?Collection=" + collectionName
    bodyParts = requests.get(bodyPart_url).json()
    clean_bodyParts = set()
    for item in bodyParts:
        if len(item):
            clean_bodyParts.add(item['BodyPartExamined'])
        else:
            clean_bodyParts.add('Not Specified')
    data = [collectionName, patientCount, clean_modalities, clean_bodyParts]
    resultsList.append(data)
    df = pd.DataFrame(columns=['Collection', 'Subjects', 'Modalities', 'BodyParts'], data=resultsList)
    
display(df)

# optional export to CSV
# df.to_csv('collection_metadata.csv')

# Conclusion
Working with TCIA's REST APIs can be useful for building custom queries and cohorts of imaging data.  The next logical step would be to download and visualize the data.  You can learn more about how to do that in the other notebooks at https://github.com/kirbyju/TCIA_Notebooks. 

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/) and Qinyan Pan. If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7