<a href="https://colab.research.google.com/github/kirbyju/AI-Deep-Learning-Lab-2022/blob/tcia/TCIA_RSNA_Deep_Learning_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Course Description

Access to large, high quality data is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) TCIA is a public service funded by the National Cancer Institute which addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers. 

TCIA has published over 175 unique data collections containing more than 60 million images. Recognizing that images alone are not enough to conduct meaningful research, most collections are linked to rich supporting data including patient outcomes, treatment information, genomic / proteomic analyses, and expert image analyses (segmentations, annotations, and radiomic / radiogenomic features). **In this course we will address basic use cases for identifying TCIA datasets of interest and downloading them via Jupyter Notebooks.** 

# Learning Objectives

* Learn how TCIA makes data sharing easier for researchers, and hear a summary of existing datsets that are freely available for download
* Practice utilizing TCIA for data exploration, cohort definition, and downloading of data
* Learn how to access public and restricted access datasets using TCIA's REST APIs and other command line tools via Google Colab

# TCIA Overview
We'll begin with a very brief slide presentation to provide an overview of The Cancer Imaging Archive.

# TCIA Notebooks
In this course we will summarize the available notebooks at https://github.com/kirbyju/TCIA_Notebooks and walk through these three in detail:

1.  [TCIA_Linux_Data_Retriever_App.ipynb](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb) - A tutorial on how to install the [NBIA Data Retriever command-line Data Retriever utility on Linux](https://wiki.cancerimagingarchive.net/x/2QKPBQ) and use it to download TCIA datasets
2.  [TCIA_REST_API_Queries.ipynb](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb) - A Python tutorial on how to use TCIA's REST API to **query** datasets
3.  [TCIA_REST_API_Downloads.ipynb](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb) - A Python tutorial on how to use TCIA's REST API to **download** datasets

# Quiz Questions

### 1. Let's say you're a researcher interested in kidney cancer who wants to train a model that can predict proteomic subtypes from radiology images. Can you identify a kidney dataset that also has proteomics on the [Browse Collections](https://www.cancerimagingarchive.net/collections/) page? Once you've done so, click on the collection and find the link to download the manifest.  From there, can you demonstrate how to download the full collection using the manifest file with NBIA Data Retriever?

**Note:** Since this is just for learning purposes, use the code we leveraged previously to shorten the manifest file to only download a few scans so it doesn't take as long to run.

### 2. Let's say you are a researcher who wants to train a model to do tumor detection on mammograms.  Can you run an API query to find all mammogram data (modality = MG) and then figure out what the distribution of manufacturers are?  Once you've done that, can you download the mammograms for whichever manufacturer had the most data available? 

(**Note:** For the download, use the function that only downloads the first 3 scans so it completes quickly.)


### 3. Let's say that you'd like to create a Shared Cart to includes a specific set of scans to share with others.  Use https://nbia.cancerimagingarchive.net to create a small "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" with scans of your choosing and then demonstrate how to download it via the API.  

**Note:** Keep your cart small (e.g. 3-5 scans) so the download completes quickly, or if you created a large cart, use the function to only download the first 3 scans.

# Quiz Answers

### 1. Let's say you're a researcher interested in kidney cancer who wants to train a model that can predict proteomic subtypes from radiology images. Can you identify a kidney dataset that also has proteomics on the [Browse Collections](https://www.cancerimagingarchive.net/collections/) page? Once you've done so, click on the collection and find the link to download the manifest.  From there, can you demonstrate how to download the full collection using the manifest file with NBIA Data Retriever?

**Note:** Since this is just for learning purposes, use the code we leveraged previously to shorten the manifest file to only download a few scans so it doesn't take as long to run.

In [None]:
# install NBIA Data Retriever CLI software 

!mkdir /usr/share/desktop-directories/
!wget -O /content/NBIA-Data-Retriever https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.4/nbia-data-retriever-4.4.deb
!dpkg -i /content/NBIA-Data-Retriever/nbia-data-retriever-4.4.deb

In [None]:
# download the CPTAC-CCRCC manifest file

import requests

# download manifest file from CPTAC-CCRCC page
manifest = requests.get("https://wiki.cancerimagingarchive.net/download/attachments/33948213/TCIA-CPTAC-CCRCC-batch8-Sep2019.tcia?version=2&amp;modificationDate=1569606470298&amp;api=v2")
with open('CPTAC-CCRCC.tcia', 'wb') as f:
    f.write(manifest.content)

In [None]:
# Run this cell to edit the manifest file to download only the first three scans.

with open('CPTAC-CCRCC.tcia','r') as firstfile, open('CPTAC-CCRCC-Sample.tcia','a') as secondfile: 
    count = 0
    for line in firstfile:
        # append content to second file
        secondfile.write(line)
        # Stop after header and first 3 series UIDs
        count += 1;
        if count == 9:
            break

In [None]:
# download the sample CPTAC-CCRCC data using NBIA Data Retriever

!/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/CPTAC-CCRCC-Sample.tcia' -d /content/

### 2. Let's say you are a researcher who wants to train a model to do tumor detection on mammograms.  Can you run an API query to find all mammogram data (modality = MG) and then figure out what the distribution of manufacturers are?  Once you've done that, can you download the mammograms for whichever manufacturer had the most data available? 

(**Note:** For the download, use the function that only downloads the first 3 scans so it completes quickly.)


In [None]:
# imports
import requests
import pandas as pd

# download tcia_utils
tcia_utils_text = requests.get("https://github.com/kirbyju/TCIA_Notebooks/raw/main/tcia_utils.py")
with open('tcia_utils.py', 'wb') as f:
    f.write(tcia_utils_text.content)

In [None]:
# import tcia_utils 
import tcia_utils as tcia

In [None]:
# use getSeries to find all mammograms
data = tcia.getSeries(modality = "MG")

In [None]:
tcia.makeSeriesReport(data)

In [None]:
# use getSeries to find all mammograms from "HOLOGIC, Inc."
data = tcia.getSeries(modality = "MG", manufacturer = "HOLOGIC, Inc.")

In [None]:
tcia.downloadSampleSeries(data)

### 3. Let's say that you'd like to create a Shared Cart to includes a specific set of scans to share with others.  Use https://nbia.cancerimagingarchive.net to create a small "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" with scans of your choosing and then demonstrate how to download it via the API.  

**Note:** Keep your cart small (e.g. 3-5 scans) so the download completes quickly, or if you created a large cart, use the function to only download the first 3 scans.

In [None]:
# getSharedCart metadata (replace the name with your actual cart name)
data = tcia.getSharedCart(name = "nbia-49121659384603347")
print(len(data), 'Series returned')

In [None]:
# feed series_data to our downloadSampleSeries function
df = tcia.downloadSampleSeries(data)
display(df)

# Additional Resources
The following pages on TCIA may be of special interest to deep learning researchers:

1. [Finding Annotated Data for AI/ML on TCIA](https://wiki.cancerimagingarchive.net/x/TAGJAw) provides basic guidance for finding datasets that could be useful for deep learning tasks.
2. [Challenge Competitions using TCIA data](https://wiki.cancerimagingarchive.net/x/nYIaAQ) can be useful for benchmarking your model's performance.
3. [ACR Data Science Institute's Define AI Directory](https://www.acrdsi.org/DSI-Services/Define-AI) links clinically relevant AI use-cases to TCIA datasets that can be used to address them.
4. [Additional TCIA Notebooks](https://github.com/kirbyju/TCIA_Notebooks) about accessing and visualizing data are available.

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7