<a href="https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution complex. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services that take major burdens of data sharing off researchers. 

**This notebook is focused on basic use cases for leveraging the REST APIs to execute queries to learn about TCIA datasets.**  If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks.

# 1 Learn about Available Collections on the TCIA Website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and viewing [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) of TCIA datasets are the easiest ways to become familiar with what is available. These pages will help you quickly identify datasets of interest, find valuable supporting data that are not available via our APIs (e.g. clinical spreadsheets and non-DICOM segmentation data), and answer the most common questions you might have about the datasets.  

# 2 REST API Overview 
TCIA uses software called NBIA to manage DICOM data. The NBIA REST APIs are provided for the search and download functions used in the TCIA radiology portal and allow access to both public and limited access collections.
1. The [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) allow you to perform basic queries and download data from **public** collections. These APIs do not require a TCIA account.
2. The [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) allow you to perform basic queries and download data from **public and limited-access** collections. These APIs require a TCIA account to create authentication tokens.
3. The [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) also allow access to **public and limited-access** collections, but provide query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications. This API requires a TCIA account to create authentication tokens.

# 3 Import tcia_utils

The following cells import [**tcia_utils**](https://github.com/kirbyju/TCIA_Notebooks/raw/main/tcia_utils.py) which contain a variety of useful functions for accessing TCIA via Jupyter/Python. We'll step through many of its functions in the following section.

Note that nearly all functions allow you to specify **api_url** as a query parameter.  This allows you to specify if you'd like a query to use the API that requires authentication or if you want to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection.  NLST lives on a separate server due to its size (>26,000 patients!).  We'll provide examples to show how this works later in the notebook.


In [None]:
# imports
import requests
import pandas as pd

# download tcia_utils
tcia_utils_text = requests.get("https://github.com/kirbyju/TCIA_Notebooks/raw/main/tcia_utils.py")
with open('tcia_utils.py', 'wb') as f:
    f.write(tcia_utils_text.content)

In [None]:
import tcia_utils as tcia

# 4 Query Examples

## 4.1 getCollections() 
The **getCollections()** function returns a list of collections as JSON. 

In [None]:
tcia.getCollections()


## 4.2 getBodyPart()
The **getBodyPart()** function returns a list of available body parts that were examined as JSON. Query parameters include **collection** and **modality**.

Let's look at the **TCGA-LUAD** from the list above and find out more about what body parts were examined.

In [None]:
tcia.getBodyPart(collection = "TCGA-LUAD")

## 4.3 getModality()
The **getModality()** function returns a list of available modalities as JSON. Query parameters include **collection** and **bodyPart**.

In [None]:
tcia.getModality(collection = "TCGA-LUAD")

# 4.4 getPatient()
The **getPatient()** function returns available patient information (e.g. species, gender, and ethnicity) as JSON. You can also learn whether the subject is a [phantom](https://www.nist.gov/physics/what-are-imaging-phantoms) or not.  The only query parameter for this function is **collection**.

Let's try looking at the **CPTAC-LUAD** collection this time.  We'll also convert the JSON output to a dataframe to make it easier to view.

In [None]:
data = tcia.getPatient(collection = "CPTAC-LUAD")

# convert JSON to dataframe
df = pd.DataFrame(data)
display(df)


Here's an example that does the same thing with the [National Lung Screening Trial (NLST) Collection](https://doi.org/10.7937/TCIA.HMQ8-J677).  In this case we have to set **api_url = "nlst"** to talk to the NLST server, but everything else works the same. 

In [None]:
data = tcia.getPatient(collection = "NLST", api_url = "nlst")

# convert JSON to dataframe
df = pd.DataFrame(data)
display(df)

## 4.5 getStudy()

The **getStudy()** function returns study/visit details such as the anonymized study date, subject's age at the time of visit, and number of scans acquired at each timepoint as JSON. Query parameters include **collection (required)**, **patientId**, and **studyUid**.

In [None]:
data = tcia.getStudy(collection = "CPTAC-LUAD")

# convert JSON to dataframe
df = pd.DataFrame(data)
display(df)

## 4.6 getSeries()

The **getSeries()** function returns metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer and software version, number of images) as JSON. Query parameters include **collection**, **patientId**, **studyUid**, **seriesUid**, **modality**, **bodyPart**, **manufacturer**, and **manufacturerModel**.  

In addition to converting the JSON to a dataframe, let's also add the option to export the dataframe to a CSV to share the results with a colleague.

In [None]:
data = tcia.getSeries(collection = "CPTAC-LUAD")

# convert JSON to dataframe
df = pd.DataFrame(data)
display(df)

# optional - save to CSV file
df.to_csv('scan_metadata.csv')

# 4.7 makeSeriesReport()

This function ingests the output from **getSeries()** and creates summary report.  Let's try it using the results that we saved to the **data** variable in our last query.  

Once you have a better understanding of the contents of your query you can use this info to refine your results before you download data. For example, you might decide that you are only interested in the CT scans from GE scanners rather than downloading the entire CPTAC-LUAD collection.

In [None]:
tcia.makeSeriesReport(data)

# 5 Querying "Limited Access" Collections (optional)
In some cases, you must specifically request access to collections before you can download them.  These are listed as **limited access** on the [Browse Collections](https://www.cancerimagingarchive.net/collections/) page. 

The steps to request access may vary depending on the collection, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg). Once you've created an account and have access to restricted collections you can use your login/password to create an API token with the **getToken()** function from **tcia_utils** to verify your permissions. **<font color='red'>Tokens are valid for 2 hours and must be refreshed after that point.</font>**

In [None]:
tcia.getToken()

Let's say that we're interested in the [QIN-Breast-02](https://doi.org/10.7937/TCIA.2019.4cfm06rr) collection. As you can see on the collection page, you must email help@cancerimagingarchive.net to request access to the data. Once you've recieved approval we can use **tcia.getSeries()** to get a full list of series UIDs in this restricted collection by including **api_url = "restricted"** as a parameter.

In [None]:
# getSeries with query parameters
data = tcia.getSeries(collection = "QIN-Breast-02", 
                      api_url = "restricted")

print(len(data), 'Series returned')

**Note:** If you'd like to do further exploration of restricted datasets, you can modify any of the previously discussed queries in the notebook by adding the **api_url = "restricted"** parameter as shown above.

# 6 Downloading and Visualizing Data
Once you've mastered querying for data the next logical step would be to download and visualize it.  You can learn more about how to do that in the other notebooks at https://github.com/kirbyju/TCIA_Notebooks. 

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/) and Qinyan Pan. If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7