<a href="https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_Series_UID_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary

This notebook can be used to summarize TCIA data given a set of Series Instance UIDs (e.g. from a TCIA manifest file). The output includes:

1.   A detailed report (CSV) containing the Collection Name, Subject ID,	Study UID,	Study Description,	Study Date,	Series UID,	Series Description,	Number of Images,	File Size (Bytes), Modality and	Manufacturer for each scan
2.   A report summarizing how many Patients/Studies/Series/Images are represented along with a breakdown of collections, modalities, body parts and manufacturers that are included

You can import Series UIDs into the notebook in following ways:

1.   Upload a TCIA manifest file.
2.   Use wget with the URL of manifest file on TCIA.
3.   Upload a text file with a list of Series UIDS (one per row).

# 1 Create a Credential File and Token
To ensure you can obtain info about all series UIDs in your list, you must provide your TCIA login/password and create a token using the following steps.






In [None]:
# Create the credential file
# NOTE: You must enter your real user name and password before you run this,
# or edit the resulting text file with your real credentials after it's created.

lines = ['userName=YourUserName', 'passWord=YourPassword']
with open('credentials.txt', 'w') as f:
    f.write('\n'.join(lines))

After the file is created, you can find it by clicking the folder icon in the left sidebar. Double click to edit the file, enter your login credentials, and then close it to save the file before proceeding to the next step.

In [None]:
# extract the user/pw from the credential file to variables for use in subsequent API calls and downloads          

credentialFilePath = 'credentials.txt'
mylines = []                                  
with open (credentialFilePath, 'rt') as myfile: 
    for myline in myfile:                     
        mylines.append(myline)   

userName = mylines[0].rstrip('\n').split(r'userName=')[1]
passWord = mylines[1].rstrip('\n').split(r'passWord=')[1] 

Now we'll use the credential file you created to generate an access token to query restricted collections on TCIA.  

_**Note:**_ Tokens are valid for two hours and must be refreshed after that point. See https://wiki.cancerimagingarchive.net/x/X4ATBg for more details. 

In [None]:
# imports

import requests
import pandas as pd
import io

# set API URL
base_url = "https://services.cancerimagingarchive.net/nbia-api/services/v2/"
adv_url = "https://services.cancerimagingarchive.net/nbia-api/services/"

In [None]:
# request token

token_url = "https://services.cancerimagingarchive.net/nbia-api/oauth/token?username="+userName+"&password="+passWord+"&grant_type=password&client_id=nbiaRestAPIClient&client_secret=ItsBetweenUAndMe"
access_token = requests.get(token_url).json()["access_token"]
print ('Token created successfully: ', access_token)

# set API call headers to use the access token we created
api_call_headers = {'Authorization': 'Bearer ' + access_token}

# Import Series UID file 

_**The file must contain only Series Instance UIDS with one UID per line (no commas).**_

To import a file directly to Colab, use the menu on the left sidebar to upload it.  Once it's uploaded, right click and rename it to "series-uids.txt."

To import a file already posted on TCIA, use the WGET command in the next cell by updating it with the URL of the manifest you want to analyze.  

In [None]:
# optional: use wget to download the manifest
# replace the URL: wget -O /directory_path/series-uids.txt https://URL_on_TCIA/manifest.tcia

!wget -O /content/series-uids.txt https://wiki.cancerimagingarchive.net/download/attachments/52757630/CrowdsCureCancer2018-DICOM.TCIA?api=v2

If using a TCIA manifest file, run the step below to remove the header (six lines of text that precede the UID list). You can skip this if you have created a custom UID file in some other application like Excel.  

In [None]:
with open('series-uids.txt') as f:
    first_line = f.readline()

if "downloadServerUrl" in first_line:
    !sed -i -e 1,6d /content/series-uids.txt
    print("Header text removed.")
else:
    print("This is not a TCIA manifest file, or you've already removed the header lines.")

# Read the File

Read the series-uid.txt file for analysis in subsequent steps.

In [None]:
# initialize variable
uids = []

# save series to "uids" variable (used for option #2)
with open("series-uids.txt") as file:
    for line in file:
        uids.append(line.rstrip())

# create "csvUids" variable to add commas to submit to the API for option #1
csvUids = ",".join(uids)
print(csvUids)

# Create a Report of Series Metadata (Option 1)

Create "scan_metadata.csv" containing the Collection Name, Subject ID, Study UID, Study Description, Study Date, Series UID, Series Description, Number of images, File Size (Bytes), Modality and Manufacturer for each scan.

**Note: This report is the fastest to complete if you only need the aforementioned columns in the report.**

In [None]:
try:
    # get series metadata from API
    param = {'list': csvUids}
    data_url = adv_url + "getSeriesMetadata2"
    data = requests.post(data_url, headers = api_call_headers, data = param)

    # save output
    df = pd.read_csv(io.StringIO(data.text), sep=',')
    df.to_csv('scan_metadata.csv')
    print("Metadata report saved successfully")

    # optional: display sample of csv
    display(df)

except requests.exceptions.HTTPError as errh:
    print(errh)
except requests.exceptions.ConnectionError as errc:
    print(errc)
except requests.exceptions.Timeout as errt:
    print(errt)
except requests.exceptions.RequestException as err:
    print(err)

# Create a Report of Series Metadata with DOIs and Licenses (Option 2)

This report will take more time to complete, especially with large manifests, but includes additional columns that are not available in option 1, including Data Description URI, SOP Class UID, License Name	and License URL.

In [None]:
try:
    # create dataframe/csv by calling getSeriesMetaData for each UID
    df = pd.DataFrame()
    seriesUID = ''
    for x in uids:
        seriesUID = x
        metadata_url = base_url + "getSeriesMetaData?SeriesInstanceUID=" + seriesUID
        metadata = requests.get(metadata_url, headers = api_call_headers).json()
        newRow = pd.DataFrame.from_dict(metadata)
        tmpManifest = pd.concat([df, newRow], ignore_index = True)
        tmpManifest.reset_index()
        df = tmpManifest
        
    df.to_csv('scan_metadata_with_DOIs_Licenses.csv')
    display(df)

except requests.exceptions.HTTPError as errh:
    print(errh)
except requests.exceptions.ConnectionError as errc:
    print(errc)
except requests.exceptions.Timeout as errt:
    print(errt)
except requests.exceptions.RequestException as err:
    print(err)

# Create a Summary of Collections, Patients, Modalities, Body Parts, and Manufacturers

In [None]:
# Summarize Collections
print("Number of Series per Collection:")
print(df['Collection Name'].value_counts(dropna=False),'\n')

# Summarize patients
print('Patient/Study/Series Counts\n')
print('Subjects: ', len(df['Subject ID'].value_counts()), 'subjects')
print('Subjects: ', len(df['Study UID'].value_counts()), 'studies')
print('Subjects: ', len(df['Series ID'].value_counts()), 'series')
print('Images: ', df['Number of images'].sum(), 'images\n')

# Summarize modalities
print("Series Counts - Modalities:")
print(df['Modality'].value_counts(dropna=False),'\n')

# Summarize manufacturers
print("Series Counts - Device Manufacturers:")
print(df['Manufacturer'].value_counts(dropna=False))

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/).  If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7