# GDC April 2021 Webinar: Using the GDC API

### Monday, April 26, 2021<br>2:00 PM - 3:00 PM (EST)<br>Bill Wysocki, Lead for GDC User Services <br>University of Chicago

## <a id='toc'>Table of Contents</a>

- [API User's Guide and Other Helpful Links](#links)
- [Notebook Overview](#overview)
    - [About this notebook](#about_notebook)
    - [Using the Python requests package and interpreting request reponse messages](#requests_package)
- [GDC API Overview](#api_overview)
    - [GDC API Format](#api_format)
    
- [Using the GDC API to Query Data in GDC](#query_data)
    - [Search and Retrieval Endpoints Examples](#search_retrieve)
    - [Data Analysis Endpoints Examples](#analysis)
- [Using the GDC API to Submit Data to GDC](#submit)

## <a id='links'>API User's Guide and Other Helpful Links</a>

[GDC API User's Guide](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/)

[GDC Support Website](https://gdc.cancer.gov/support)

support@nci-gdc.datacommons.io - GDC Helpdesk E-mail

[Requests Python Package User's Guide](https://2.python-requests.org/en/master/)

[Python Documentation Website](https://www.python.org)

[Jupyter Notebook Documentation](https://jupyter.org/documentation)

# <a id='overview'>Notebook Overview</a>


### <a id='about_notebook'>About this notebook</a>

- This notebook serves to be a resource for GDC users to familiarize themselves with GDC API endpoints and allow users to edit and create custom queries "in-place" with provided template functions or submission tasks
- The provided functional templates can facilitate downstream data analyses and visualizations within the Jupyter Notebook interface and other Python packages
- Commands and functions in this notebook will rely on the following Python packages:
    - `requests` - if not already installed on your system, can install with command `pip install requests` from command line or using a new code cell in this notebook
    - `json` - part of Python standard library, should already be installed on system
    - `urllib` - part of Python standard library, should already be installed on system
- To execute code in a code cell, press either 'Cmd + Enter' or 'Control + Enter' depending on operating system and keyboard layout
- If using notebook to aid in submission requests, will need to download token file from the [GDC Submission Portal](https://docs.gdc.cancer.gov/Data_Submission_Portal/Users_Guide/Data_Submission_Process/#authentication)

In [None]:
#import packages to use in this notebook

import requests
import json
import urllib

### <a id='requests_package'>Using the Python `requests` package and interpreting request reponse messages</a>

- The `requests` package allows users to communicate with the GDC API to make standard `POST`, `PUT`, `GET` and `DELETE` HTTP methods
-  Need to specify request method as part of function (i.e. `request.get()` for `GET` method, `request.post()` for `POST` method etc.)
- When making a request with `requests` package, can save results of request as variable, i.e.:
    - `response = requests.get(url)`
- Example `GET` request:

In [None]:
response = requests.get('https://api.gdc.cancer.gov/cases?filters=%7B%22op%22%3A%20%22%3D%22%2C%20%22content%22%3A%20%7B%22field%22%3A%20%22cases.project.program.name%22%2C%20%22value%22%3A%20%22TCGA%22%7D%7D&fields=submitter_slide_ids&size=1&format=json&pretty=true')

- When returning the contents of the `response` variable, will only return HTTP status code of request, such as `<Response 200>` or `<Response 400>`; need to specify `response.text` method to get return message or data

In [None]:
response

In [None]:
json.loads(response.text)

- Typically, successful responses begin with `'2'`, like `200` or `201` and unsuccessful requests begin with `'4'`, like `400` (bad request) or `403` ('forbidden' error, result of bad or insufficient credentials)
- A list and accompanying explanations of HTTP status codes can be [found here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

# <a id='api_overview'>GDC API Overview</a>


- The GDC Application Programming Interface (API) is the external facing REpresentational State Transfer (REST) interface for the GDC
- The GDC API supports user interactions with the GDC Submission and Data Portals, as well as provides developers with a programmatic interface to query and download GDC data, metadata and annotations and submit data to the GDC.
- The [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool) client also relies on the GDC API for user authentication, reading manifests, and for download and upload features


### <a id='api_format'>GDC API Format</a>

- The HTTP URL that corresponds to the GDC API is: https://api.gdc.cancer.gov/
- GDC API format for search and retrieval use is: <b>API_URL + ENDPOINT + QUERY_PARAMETERS</b>
- In order to utilize the GDC API, calls to specific API 'endpoints' for a given query need to be made, i.e. for retrieving data about cases in the GDC, will make calls to `cases` endpoint, https://api.gdc.cancer.gov/cases/
- For search and retrieval API calls, query parameters can be included, such as <b>filters</b> on endpoint fields, and the <b>fields</b> parameter to specify fields to return from query
    - List of all indexed data fields to use specify as filters or fields for search and retrieval endpoints can be found at https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/
    - Can also view available fields for both Search and Retrieval and Data Analysis endpoints by [using the `_mapping` endpoint for a given endpoint](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#_mapping-endpoint) or at the corresponding pages at the [GDC API Documentation site](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/)
    - Formatting parameters can be specified such as <b>format</b> (TSV or JSON format) and <b>size</b> (number of hits to return)
- For submitting data using the GDC API, the format for using the GDC API Submission endpoint uses the project ID: https://api.gdc.cancer.gov/<b>program_name/project_code</b>, i.e. https://api.gdc.cancer.gov/submission/TCGA/LUAD or https://api.gdc.cancer.gov/submission/CPTAC/3 


# <a id='query_data'>Using the GDC API to Query Data in GDC</a>

### Overview

- Submitters can make use of several GDC API endpoints to retrieve various data indexed in the GDC API, including biospecimen, clinical and annotation metadata
- The HTTP `GET` method will be used to retrieve data
- Additional parameters can be specified to tailor the returned data, such as number of returned entries and filters on data at endpoint
- Data can be retrieved in `JSON` or `TSV` format by specifying in the request the format desired (see below)
- Additional features and more information regarding using the GDC API can be found at this link: https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/



### Endpoints

There are two 'types' of endpoints that can be used to query data in the GDC:


[GDC Search and Retrieval Endpoints](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#endpoints) - includes endpoints that index project, file and case information, including clinical and biospecimen metadata, as well as file version and history

[GDC Analysis Endpoints](https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/) - endpoints that are used by the GDC data analysis, visualization and exploration (DAVE) tools in the Exploration tab of the GDC Data Portal to access indexed data including gene, mutation, copy number variation and survival data. 


### Steps

1. Specify and percent-encode `filters`
2. Specify `fields` to be returned
3. Specify additional parameters (`size`, `format` of results etc.)
3. Concatenate parameters to build query url
4. Submit query and save response text to file

Note: specifying parameters are optional; not specifying `filters` will return all instances at a given endpoint, and not specifying `fields` will return all fields at endpoint, while other parameters will be set to default value (i.e. `size` = 10, `format` = JSON)

### Template queryBuilder() function

- `GET` requests can be built as a URL with the endpoint and other parameters specified using a Python function
- In notebook, need to first run code for queryBuilder() function to instantiate the function
- Parameters must be passed into parantheses in the order that they are specified in the function
- To specify default parameters, users can simply input two quotation marks, i.e. `''`, for a given variable when using the queryBuilder() function
- Users can edit the template queryBuilder() function to build url request for querying data in GDC API to include other parameters, such as `facets`, `expand`, `from` (pagination) and `sort`: 
https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#request-parameters

In [None]:
#format is specified as 'frmat' in function as format is an already declared object in python [the format() function]

def queryBuilder(endpoint, filters, fields, size, frmat):
    api_url = 'https://api.gdc.cancer.gov/'
    
    if frmat.lower() == 'json':
        request_query = api_url + endpoint + '?filters=' + filters + '&fields=' + fields + '&size=' + size + '&format=' + frmat + '&pretty=true'
    else:
        request_query = api_url + endpoint + '?filters=' + filters + '&fields=' + fields + '&size=' + size + '&format=' + frmat
    return request_query

### Templates for query `filters`

- `Filters` are used to specify which hits to return from an endpoint, such as cases of a certain project or files from a certain workflow
- Filters need to be created in JSON format that then will need to be [percent-encoded]() to be sent in the URL request (can use the `urllib` Python package for percent-endcoded formatting)
- JSON filters use [operators](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#filters-specifying-the-query) to specify relationships between a field and their possible values
- For a given endpoint, need to use indexed fields at that endpoint
    - For Search and Retrieval endpoints, can reference [Appendix A at GDC API Documentation site](https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/)
    - Can also view available fields for both Search and Retrieval and Data Analysis endpoints by [using the `_mapping` endpoint for a given endpoint](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#_mapping-endpoint)
- Specifying no filters will return all instances for a given endpoint (default)
- Below are several examples users can edit to build filters for a `GET` request

In [None]:
#one filter applied to endpoint

#one filter 
one_filter = {
            "op":"=",
            "content":{
                "field": "cases.project.project_id", 
                "value": "TCGA-BRCA"
    }
}

In [None]:
#combination of two filters applied to endpoint, i.e. (x AND/OR y) must be met

combination_two = {
    "op" : "and",
    "content":[{
        "op":"=",
         "content":{
              "field": "cases.project.project_id", 
                "value": "TCGA-BRCA"
            }
        }, 
        {
            "op":"=", 
            "content":{
                "field":"cases.disease_type",
                "value": "ductal and lobular neoplasms"
            }
        }
    ]
}

In [None]:
#combination of three filters applied to endpoint, i.e. (x AND/OR y AND/OR z) must be met

combination_three = {
    "op" : "and",
    "content":[{
        "op":"=",
         "content":{
              "field": "cases.project.project_id", 
                "value": "TCGA-BRCA"
            }
        }, 
        {
            "op":"=", 
            "content":{
                "field":"cases.disease_type",
                "value": "ductal and lobular neoplasms"
            }
        },
        {
            "op":">", 
            "content":{
                "field":"diagnoses.age_at_diagnosis",
                "value": "15000"
            }
        }
        
    ]
}

In [None]:
#complex combination of three filters applied to endpoint, i.e. (x AND/OR [y AND/OR z]) must be met

combination_three_2 = {
    "op": "and",
    "content": [{
            "op": "=",
            "content": {
                "field": "cases.project.project_id",
                "value": "TCGA-BRCA"
            }
        },
        {
            "op": "or",
            "content": [{
                    "op": "=",
                    "content": {
                        "field": "cases.disease_type",
                        "value": "cystic, mucinous and serious neoplasms"
                    }
                },
                {
                    "op": "=",
                    "content": {
                        "field": "cases.disease_type",
                        "value": "ductal and lobular neoplasms"
                    }
                }
            ]
        }
    ]
}

### Template commands for formatting filter parameters

In [None]:
#percent encoding of filters
json_string=str(json.dumps(one_filter)) #replace one_filter with input filter variable here
example_filter = urllib.parse.quote(json_string.encode('utf-8'))

### Template for formatting `fields` to be returned by query

- The `fields` parameter is passed to the API request URL as a comma-delimited list of fields to be returned
- For a given endpoint, can only specify indexed fields at that endpoint
    - For Search and Retrieval endpoints, can reference [Appendix A at GDC API Documentation site](https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/)
    - Can also view available fields for both Search and Retrieval and Data Analysis endpoints by [using the `_mapping` endpoint for a given endpoint](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#_mapping-endpoint)
- Specifying no fields will return all available fields for entities that match `filters` for a given endpoint (default)

In [None]:
#specify fields to be returned
example_fields = ",".join([
    "submitter_id",
    "disease_type",
    "samples.submitter_id",
    "samples.sample_type", 
    "samples.tissue_type",
    "diagnoses.age_at_diagnosis"
])

### Template API `GET` Request 

In [None]:
#build API query: queryBuilder(endpoint, filters, fields, size, frmat)

#to specify no filters and/or no fields to return, replace variable with ''

template_request = queryBuilder('cases', example_filter, example_fields, '11315', "json")

template_request

##### <font color="red">Note: You can also copy and paste formatted request URL into browser url bar to  return results in browser</font>

In [None]:
#send request
result = requests.get(template_request)

#write request results to file, edit file name and type 
with open("ffpe.json", "w+") as output: 
    output.write(result.text)
output.close()

## <a id='search_retrieve'>Search and Retrieval Endpoints Examples</a>

### Example 1: Retrieve case barcode, sample type and primary diagnosis data for DNA-seq files in TCGA-BRCA project

- For this example, we would like to retrieve whether BAM files in the TCGA-BRCA project are for normal or tumor samples, as well as what disease cases were diagnosed as
- Use 'files' endpoint, as this endpoint contains metadata related to files in the GDC (such as experimental strategy and data category)
- Need to filter down to files that are of the data category "sequencing reads" and experimental strategy type "WXS" (whole exome) to filter out other categories (like copy number variation, gene expression) and other experimental stragies (like RNA-Seq). 

In [None]:
#step 1: specify and encode filters

filters = {
    "op" : "and",
    "content":[{
        "op":"=",
         "content":{
              "field": "cases.project.project_id", 
                "value": "TCGA-BRCA"
            }
        }, 
        {
            "op":"=", 
            "content":{
                "field":"files.data_category",
                "value": "sequencing reads"
            }
        },
        {
            "op":"=", 
            "content":{
                "field":"files.experimental_strategy",
                "value": "WXS"
            }
        },
        {
            "op":"=", 
            "content":{
                "field":"files.data_format",
                "value": "BAM"
            }
        }
        
    ]
}

json_string=str(json.dumps(filters))
filters_format = urllib.parse.quote(json_string.encode('utf-8'))

#step 2: specify fields to be returned
fields = ",".join([
    "cases.submitter_id",
    "file_name",
    "cases.samples.sample_type",
    "cases.diagnoses.primary_diagnosis"
])

#step 3+4: specify size=1 and format=tsv, build query url with 'files' endpoint
brca_request = queryBuilder('files', filters_format, fields, '1', "tsv")

#step 5: send request
brca_result = requests.get(brca_request)

print(brca_result.text)
brca_request

### Example 2: Retrieve FFPE data for samples and portions for TCGA projects

- In this example, we will retrieve whether case samples and portions taken from cases in TCGA projects were Formalin-Fixed Paraffin-Embedded (FFPE) specimens or not
- Use the 'cases' endpoint, as this endpoint contains biospecimen and clinical information related to cases and samples in the GDC

In [None]:
#step 1: specify and encode filters
filters = {
            "op":"=",
            "content":{
                "field": "cases.project.program.name", 
                "value": "TCGA"
    }
}

json_string=str(json.dumps(filters))
filters_format = urllib.parse.quote(json_string.encode('utf-8'))

#step 2: specify fields to be returned
fields = ",".join([
    "submitter_id",
    "samples.submitter_id",
    "samples.is_ffpe",
    "samples.portions.submitter_id",
    "samples.portions.is_ffpe"
])

#step 3+4: specify size=1 and format=json, build query url with 'cases' endpoint
ffpe_request = queryBuilder('cases', filters_format, fields, '1', "json")

#step 5: send request
ffpe_result = requests.get(ffpe_request)

print(ffpe_result.text)
ffpe_request

### Example 3: Age at Diagnosis, Days to Death after Diagnosis, Vital Status and other clinical data for cases in TCGA-KIRC project

- In this example, we will retrieve age, survival and other clinical data for cases in the TCGA-KIRC project
- Use the 'cases' endpoint, as this endpoint contains biospecimen and clinical information related to cases and samples in the GDC
- Results will only show data for `demographic.days_to_death` if case is deceased

In [None]:
#step 1: specify and encode filters
filters = {
            "op":"=",
            "content":{
                "field": "cases.project.project_id", 
                "value": "TCGA-KIRC"
    }
}

json_string=str(json.dumps(filters))
filters_format = urllib.parse.quote(json_string.encode('utf-8'))

#step 2: specify fields to be returned
fields = ",".join([
    "submitter_id",
    "diagnoses.age_at_diagnosis",
    "demographic.days_to_death",
    "demographic.vital_status", 
    "demographic.ethnicity",
    "demographic.race",
    "demographic.gender"
])

#step 3+4: specify size=2 and format=tsv, build query url with 'cases' endpoint
age_request = queryBuilder('cases', filters_format, fields, '2', "tsv")

#step 5: send request
age_result = requests.get(age_request)

print(age_result.text)
age_request

## <a id='analysis'>Data Analysis Endpoints Examples</a>

### Example 4: Gene information

- In this example, we will retrieve gene IDs and positions of genes present on chromosome 8 of the human genome
- Use the 'genes' endpoint, as this endpoint contains gene information indexed in the GDC API

In [None]:
#step 1: specify and encode filters
filters = {
            "op":"=",
            "content":{
                "field": "gene_chromosome", 
                "value": "8"
    }
}

json_string=str(json.dumps(filters))
filters_format = urllib.parse.quote(json_string.encode('utf-8'))

#step 2: specify fields to be returned
fields = ",".join([
    "id",
    "symbol",
    "gene_start",
    "gene_end"
])

#step 3+4: specify size=10 and format=tsv, build query url with 'genes' endpoint
genes_request = queryBuilder('genes', filters_format, fields, '10', "tsv")

#step 5: send request
genes_result = requests.get(genes_request)

print(genes_result.text)
genes_request

In [None]:
#Can use gene_id to also query individual information about the gene in question from genes endpoint as well
#by appending the gene_id at the end of the 'genes' endpoint and specifying parameters

individual_gene_request = requests.get('https://api.gdc.cancer.gov/genes/ENSG00000160948?pretty=true')

print(individual_gene_request.text)

### Example 5: Simple Somatic Mutation Information 

- In this example, we will retrieve information on a specific mutation using its COSMIC ID
- Use the 'ssms' endpoint, as this endpoint contains mutation information indexed in the GDC API

In [None]:
#step 1: specify and encode filters
filters =  {
   "op":"in",
   "content":{
      "field":"cosmic_id",
      "value":[
         "COSM4860838"
      ]
   }
}

json_string=str(json.dumps(filters))
filters_format = urllib.parse.quote(json_string.encode('utf-8'))

#step 2: specify all fields to be returned (default =  "")
fields = ",".join([
    ""
])

#step 3+4: specify size=1 and format=json, build query url with 'ssms' endpoint
mutation_request = queryBuilder('ssms', filters_format, fields, '1', "json")

#step 5: send request
mutation_result = requests.get(mutation_request)

print(mutation_result.text)
mutation_request

### Example 6: Compare survival data for TCGA-SKCM cases with and without the `chr7:g.140753336A>T` mutation 

- For this example we wish to use the survival analysis endpoint to compare two survival plots for TCGA-SKCM cases: one plot with cases having the `chr7:g.140753336A>T`, and the other plot for cases with out the mutation. 
- Can retrieve the `ssm_id` for a mutation from the [GDC Data Portal > Exploration](https://portal.gdc.cancer.gov/exploration) tab. 
- The API query will also print the results of a chi-squared analysis between the two subsets of cases
    - Note that results of chi-square test are dependent on number of cases returned for each plot (`size` parameter); to choose all cases, use total number of cases in project for `size` parameter

In [None]:
#step 1: specify and encode filters
filters = [  
  {  
    "op":"and",
    "content":[  
      {  
        "op":"=",
        "content":{  
          "field":"cases.project.project_id",
          "value":"TCGA-SKCM"
        }
      },
      {  
        "op":"=",
        "content":{  
          "field":"gene.ssm.ssm_id",
          "value":"84aef48f-31e6-52e4-8e05-7d5b9ab15087"
        }
      }
    ]
  },
  {  
    "op":"and",
    "content":[  
      {  
        "op":"=",
        "content":{  
          "field":"cases.project.project_id",
          "value":"TCGA-SKCM"
        }
      },
      {  
        "op":"excludeifany",
        "content":{  
          "field":"gene.ssm.ssm_id",
          "value":"84aef48f-31e6-52e4-8e05-7d5b9ab15087"
        }
      }
    ]
  }
]

json_string=str(json.dumps(filters))
filters_format = urllib.parse.quote(json_string.encode('utf-8'))

#step 2: specify that all fields to be returned (default =  "") 
fields = ",".join([
    ""
])

#step 3+4: specify size=10 and format=JSON, build query url with 'analysis/survival' endpoint,
survival_request = queryBuilder('analysis/survival', filters_format, "", '10', "JSON")

#step 5: send request
survival_result = requests.get(survival_request)

#print(survival_result.text)
survival_request

In [None]:
#can parse out the chi-squared test from results
#by loading results as a JSON object and selecting
#overallStats from results

json.loads(survival_result.text)['overallStats']

# <a id='submit'>Using the GDC API to Submit Data to GDC</a>

### Overview

- For projects that have been approved to be included in the GDC, submitters can make use of the `submission` GDC API endpoint to submit node entities to submission projects
- Submission will require a token downloaded from the [GDC Submission Portal](https://docs.gdc.cancer.gov/Data_Submission_Portal/Users_Guide/Data_Submission_Process/#authentication)
- Data can be submitted in `JSON` or `TSV` format; depending on the data format, users will need to edit the `"Content-Type"` in the request command (see below)
- Additionally, `JSON` and `TSV` templates for nodes to be submitted can be downloaded from the GDC Data Dictionary Viewer webpage: https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?_top=1
- Submittable files (such as FASTQ or BAM files) should be uploaded with the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)
- Additional features and more information regarding submission using the GDC API can be found here: https://docs.gdc.cancer.gov/API/Users_Guide/Submission/ 
- [Strategies for Submitting in Bulk](https://docs.gdc.cancer.gov/Data_Submission_Portal/Users_Guide/Data_Submission_Walkthrough/#strategies-for-submitting-in-bulk)

### Endpoint

- The format for using the GDC API Submission endpoint uses the project information, i.e. `https://api.gdc.cancer.gov/submission/<program_name>/<project_code>`
- For example: https://api.gdc.cancer.gov/submission/TCGA/LUAD or https://api.gdc.cancer.gov/submission/CPTAC/3 

### Steps

1. Read in token file
2. Read in submission file
3. Edit endpoint with project ID information and submit data using `POST` (JSON file submission) or `PUT` (TSV file submission) request

### Example 7: Submitting a JSON Data File

In [None]:
#1. Read in token file

token = open("../gdc-user-token.txt").read().strip()

In [None]:
#2. Read in submission file

example_file_json = json.load(open("example_file.json"))

In [None]:
#3. Edit endpoint and submit data using POST request

ENDPT = "https://api.gdc.cancer.gov/submission/GDC/INTERNAL"

#submission request if data is in JSON format
response = requests.post(url = ENDPT, json = example_file_json, headers={'X-Auth-Token': token, "Content-Type": "application/json"})
print(response.text)

### Example 8: Submitting a TSV Data File

In [None]:
#1. Read in token file

token = open("../gdc-user-token.txt").read().strip()

In [None]:
#2. Read in submission file

example_file_tsv = open("example_file.txt", "rb")

In [None]:
#3. Edit endpoint and submit data using PUT request

ENDPT = "https://api.gdc.cancer.gov/submission/GDC/INTERNAL/_dry_run"

#submission request if data is in TSV format
res = requests.put(url = ENDPT, data = example_file_tsv, headers={'X-Auth-Token': token, "Content-Type": "text/tsv"})

res.text