<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/API/notebooks/How_to_use_IDC_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Use the IDC APIs

## Overview of this notebook
This notebook is designed as a quick introduction to the IDC APIs and how to access them with Python.

Topics covered:
* Overviews of APIs, Swagger, JSON, endpoints
* Use cases for IDC APIs
* Examples of IDC API endpoints

### Overview of APIs
An API or application-programming interface is a software intermediary that allows two applications to talk to each other. In other words, an API is the messenger that delivers your request to the provider that you’re requesting it from and then delivers the response back to you [(Wikipedia)](https://en.wikipedia.org/wiki/Application_programming_interface). Each action that an API can take is called an "endpoint".

Some useful tutorials and quick start guides on APIs are:
* [GDC's Getting Started guide for APIs](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/)
* [API Integration in Python](https://realpython.com/api-integration-in-python/)
* [Python API Tutorial: Getting Started with APIs](https://www.dataquest.io/blog/python-api-tutorial/)

### What is an HTTP Message?

Clients and the IDC API server communicate via HTTP messages. An overview of HTTP messaging can be found [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages).

### What is JSON?

JSON  or JavaScript Object Notation is a lightweight data-interchange format that is easy for humans and machines to work with. More information can be found at [json.org](https://www.json.org/).

### What is SwaggerUI?

[SwaggerUI](https://swagger.io/tools/swagger-ui/) is a user interface that allows users to try out the APIs and view their documentation easily. You can access the IDC API SwaggerUI [here](https://api.imaging.datacommons.cancer.gov/v1/swagger).

### What is an endpoint?

An endpoint is the *call* for a specific functionally of an API. For example, `/collections` at the end of the API request URL `https://api.imaging.datacommons.cancer.gov/v1/collections` is an endpoint that returns (or GETs) information about the available collections.

###IDC API Documentation

Detailed documentation on the IDC API can be found in the [API](https://learn.canceridc.dev/api/getting-started) section of the [IDC User Guide](https://learn.canceridc.dev/ps://).

### IDC API URL Preamble
We define the IDC API URL preamble as a variable so that we can easily change it.

In [None]:
idc_api_preamble = 'https://api.imaging.datacommons.cancer.gov/v1'

### Python library `requests`

In this notebook, we use the Python Requests HTTP library to access IDC API endpoints.

In [None]:
# Install requests if needed
# pip install requests

# Import the requests library
import requests

## Use cases for IDC APIs

The IDC APIs can be used for a number of different tasks for interacting with the Google Cloud Platform and BigQuery. It can be used to subset data into cohorts or to access cohorts that have been created using the IDC WebApp. The location of the DICOM objects associated with a cohort can be obtained.

## Example: GET `/about` Endpoint



We are first going to explore the `about` endpoint using the 'GET' request to the API. This API will give you such information about the IDC API as links to the Swagger UI interface and to the IDC User Guide.

In [None]:
# First submit the 'get' request to the API
response = requests.get('{}/about'.format(idc_api_preamble))

Now that we have the request response, we are going to check that we didn't receive an error code or if the request was successful. If the request was successful, then the status code will come back as 200 but if something went wrong then the status code may be something 404 or 503. If you have recieved any error codes, you can check out Google's Troubleshooting response errors guide.



In [None]:
# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))

Finally, we will define a little pretty printing function and print out the information that we have received from the API. This response returns as a dictionary though responses can also be a combination of dictionaries and lists depending on which endpoint is called. This means that you can access different data in the response the same way that you would access dictionaries and lists as demonstarted below.

In [None]:
# import json
import json
def pretty(response):
  print(json.dumps(response.json(), sort_keys=True, indent=4))

# Print the response
pretty(response)

The requests library makes it easy to use the APIs. Next we will cover a few of the other informational APIs.

## Example: GET `/versions` endpoint

Over time, the set of data hosted by the IDC will change. For the most part, such changes will be due to new data having been added. The totality of IDC hosted data resulting from any such change is represented by a unique IDC data version ID. That is, each time that the set of publicly available data changes, a new IDC version is created that exactly defines the revised data set. 

The IDC data version is intended to enable the reproducibility of research results. For example, consider a patient in the DICOM data model. Over time, new studies might be performed on a patient and become associated with that patient, and the corresponding DICOM instances added to the IDC hosted data. Moreover, additional patients might well be added to the IDC data set over time. This means that the set of subjects defined by some filter set will change over time. Thus, for purposes of reproducibility, we define a cohort in terms of a filter set and an IDC data version.

The `/versions` endpoint returns information about the defined IDC versions. This information includes the data sources (BQ tables) containing the data of each version. This endpoint returns a more complicated JSON object which has a combination of lists and dictionaries. We will first retrieve the request and then view if there was an error code within the response.

In [None]:
# Retrieve the response from the API endpoint
versions_req = requests.get('{}/versions'.format(idc_api_preamble))

# Check that there wasn't an error with the request
if versions_req.status_code != 200:
  # Print the error code and message if something went wrong
  print(versions_req.json())  # Print the error code if something went wrong

In [None]:
# Print the versions JSON text
pretty(versions_req)

The returned data is a combination of dictionaries and lists. We see  that, as of this writing, there are four IDC versions. The current or active version is idc_data_version 4.0.

## Example: GET `/collections` endpoint

An *original collection* is a set of DICOM data provided by a single source.
Original collections are comprised primarily of DICOM image data that was obtained from some set of patients. Typically, the patients in an Original collection are related by a common disease.

Analysis results are comprised of derived DICOM data that was generated by analyzing other (typically Original) collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in Analysis results include segmentations, annotations and further processing of original images. Note that some Original collections include such analysis results, though most of the data in Original collections are original images.

The /collections endpoint returns data about all the Original collections for the active IDC data version. 


In [None]:
response = requests.get('{}/collections'.format(idc_api_preamble))
# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))

In [None]:
# Print the collections JSON text
pretty(response)

Metadata for Original collections is derived from the [TCIA Data Collections page](https://www.cancerimagingarchive.net/collections/), and for Analysis results from the [TCIA Analysis Results page](https://www.cancerimagingarchive.net/tcia-analysis-results/). 

A similar /analysis_results API returns metadata on analysis results.

## Example: POST `/cohorts/manifest/preview` endpoint

The POST `/cohorts/manifest/preview` endpoint returns a manifest of *access_methods* for the objects in cohort cohort_id. Please refer to the [API](https://learn.canceridc.dev/api/getting-started) section of the [IDC User Guide](https://learn.canceridc.dev/ps://) for further information on manifests.

The `/cohorts/manifest/preview` API does not actually create a cohort, but acts as if a cohort were created. Creating a cohort requires authenticating to the API; that process is addessed in a subsequent example describing the `/cohorts/manifest/{cohort_id}` API.

A manifest is a list of *access methods*. Each access method describes how to access the study, series and/or instance DICOM objects in the cohort. A manifest can optionally include additional metadata per DICOM object. 

The objects in the preview cohort are defined by a *filters* object that is implicitly applied against the current (active) IDC data version
A *filters* object is a list of *attribute*,*values* pairs, where *values* is a list of one or more values which must be satisfied by the associated attribute.

In the following, we construct a dict, `cohortSpec`, containing the name and a description for the preview cohort, as well as a *filters* object that selects for subjects in either the TCGA-LUAD or TCGA-KIRC collections, having DICOM data with a CT or MR modality, and are Asian.



In [None]:
filters = {
  "collection_id": [ "TCGA-LUAD", "TCGA-KIRC" ],
  "Modality": ["CT", "MR"],
  "race": ["ASIAN"]
}

cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filters": filters}



The query string selects additonal metadata to be return. In addition, the amount of data returned by each call can be limited. When this is done the API can be iteratively called until all data has been received. The params object below selects a limited set of data to be returned; refer to the API documention for details. In this example, we will limit the returned data to 2 rows.

In [None]:
params = dict(
    sql = True,
    Collection_ID = True,
    Patient_ID = True,
    CRDC_Series_GUID = True,
    CRDC_Instance_GUID = False,
    GCS_URL = True,
    page_size = 2
)

We are now ready to call the endpoint. Note that /cohorts/manifest/preview is a POST method, so we call requests.post()

In [None]:
response = requests.post('{}/cohorts/manifest/preview'.format(idc_api_preamble),
                    params=params, json=cohortSpec)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))
else:
  pretty(response)

The returned data includes the cohort name, description and filterset which we passed as parameters as well as the BigQuery SQL that produced these results be returned.

We can see that there are 1581 total rows, but only 2 were returned.

## Paged results

It can be seen that the above result includes a non-null `next_page` token. When the next_page token is non_null, it indicates that more data is available. This token can be passed as a parameter in a subsequent invocation of /cohorts/manifest/nextPage to obtain additional data.

The /cohorts/manifest/nextPage endpoint only takes page_size and next_page query parameters. Other parameters to the original /cohorts/manifest/preview are implicit in the next_page token. Similarly, the /cohorts/manifest/nextPage endpoint does not return the cohort and sql metadata.


In [None]:
query_string = dict(
    page_size = 2,
    next_page = response.json()['next_page']
)
response = requests.get('{}/cohorts/manifest/nextPage'.format(idc_api_preamble),
                    params=query_string)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))
else:
  pretty(response)



The same /cohorts/metadata/nextPage endpoint can be used to obtain additional data resulting from accessing the /cohorts/metadata/{cohort_id} endpoint, described in the next section.

## Notes on Authorization and Credentials
In this section we will focus on the POST /cohorts/manifest/{cohort_id} endpoint. Unlike the /cohort/manifest/preview endpoint, the /cohorts/manifest/{cohort_id} returns a manifest against a cohort that was previously defined via the IDC POST /cohorts API or through the IDC webapp.

In order to be able to create a cohort or access a previously defined cohort from the API, the user must authenticate their identity with the API. This section will step through the authentication/authorization process. Perform the following steps on your local machine:

1. Use the [idc_auth.py](https://github.com/ImagingDataCommons/IDC-Examples/blob/master/scripts/idc_auth.py) Python script to create a Credential File on your local machine. 
  * This script can be run from the command line or from within Python but should be run on your local machine.
  * Refer to the script for execution instructions
2. Find the location of the Credential File on your local machine
  * By default, the script will save the credentail file in the user's home folder with the file name: ".idc_credentials"

We can now proceed to load the credential file into the cloud environment you are using:

In [None]:
# If you skipped earlier sections, you will need these two packages to run the
# code below
# Install requests if needed
#pip install requests

# Import json
#import json

# Import the requests library
#import requests

In [None]:
# import os
import os
# Import files helper for Colab
from google.colab import files

In [None]:
# First delete any existing .idc_credentials. If we do not do this, file.upload() will name the new file ".idc_credentials (1)"
try:
  os.remove('./.idc_credentials')
except:
  print('.idc_credentials not found')

# Upload your credentials to the cloud environment. Click on the Choose Files button to bring up a file browser
uploaded = files.upload()

Now that we have the Credentials file created and uploaded to the cloud environment, we can extract the ID token that identifies you for the purpose of authorization.

In [None]:
# Open the credentials file
token = open(".idc_credentials", "r")
# Create a json object from teh credential file
token = json.loads(token.read())
# Get Credentials from the token
creds = token['token_response']['id_token']
# Create a json object for requests header
head = {'Authorization': 'Bearer ' + creds}

**Note:** the credentials file will expire after 1 hour and a new one will need to be generated. If a new file is not generated with the idc_auth script, you can delete the original file and try running the script again.

## Example: POST `/cohorts and POST `/cohorts/manifest/{cohort_id}` Endpoints
We can now proceed to use the POST `/cohorts` endpoint to create a cohort. We will use the same cohort definition as previously.

In [None]:
filters = {
  "collection_id": [ "TCGA-LUAD", "TCGA-KIRC" ],
  "Modality": ["CT", "MR"],
  "race": ["ASIAN"]
}
  
cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filters": filters}

response = requests.post(f'{idc_api_preamble}/cohorts/',
                      json=cohortSpec, headers = head)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))
else:
  print(json.dumps(response.json(), sort_keys=True, indent=1))

  cohort_id = response.json()['cohort_properties']['cohort_id']

Note that the response includes the cohort_id of the newly created cohort. 

We will use this ID when querying for the cohort's manifest. Note also that the response repeats the filter and other cohort metadata. This time we will return only series GUIDs.

In [None]:
query_string = dict(
    CRDC_Series_GUID = True,
    page_size = 2
)

response = requests.get('{}/cohorts/manifest/{}'.format(idc_api_preamble, cohort_id),
                      params=query_string, headers = head)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print(response.json())

print(json.dumps(response.json(), sort_keys=True, indent=4))

As with POST `/cohorts/manifest/preview/`, the returned next_page token can be used in subsequent calls to obtain additional data.

##Resolving a CRDC GUID
A CRDC GUID can be resolved to a GA4GH DrsObject, and ultimately used to obtain a URL of all IDC DICOM instance objects in corresponding study, series or instance. Please refer to the [API](https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids) section of the [IDC User Guide](https://learn.canceridc.dev) for further information on DrsObjects and access methods.

We will resolve a series GUID, obtaining some set of instance GUIDs. We will then resolve each of the instance GUIDs to a GCS URL. We could have parameterized the cohorts/manifest/{cohort_id} to return GCS_URLs directly. When those URLs will be used soon, this is fine. However, IDC may, from time to time, change the GCS bucket in which a DICOM instance is stored; in that case a retained GCS URL will be invalid. However, the GUID of an IDC DICOM study, series, or instance can always be resolved to the current location of the corresponding data. Moreover, a cohort composed of study and/or series GUIDS is more space efficient than a cohort of instance GUIDs or URLs.

In the following, we demonstrate the steps to resolving a series GUID in some manifest.

In the first step we resolve each series GUID in the above response to the  cohorts/manifest/{cohort_id}. 

In [None]:
dcf_url = "https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects"
# Create a list of the series GUIDs in the previous response.
series_guids = [row['CRDC_Series_GUID'] for row in response.json()['manifest']['json_manifest']]
series_drs_objects= []
# Resolving a GUID simply requires presenting it to the nci-crdc.datacommons.io server.
for series_guid in series_guids:
  series_drs_object = requests.get(f'{dcf_url}/{series_guid}')
  series_drs_objects.append(series_drs_object)

# Print one of the DrsObjects
print('This series has {} instances'.format(len(series_drs_objects[1].json()['contents'])))
pretty(series_drs_objects[1])

Now we can resolve each of the instance DrsObjects. Each of the elements in the *contents* list is a *ContentsObject*. We can resolve the drs_uri component in each ContentsObject to a DrsObject that will contain the GCS URL of the corresponding DICOM instance. In this demo, we will do this for just a single series.

In [None]:
instance_drs_objects = []
for contents_object in series_drs_objects[1].json()['contents']:
  instance_drs_object = requests.get('{}/{}'.format(dcf_url,contents_object['id']))
  instance_drs_objects.append(instance_drs_object)

# Print one of the instance Objects
pretty(instance_drs_objects[0])

The `*access_methods*` component of this DrsObject is an array of *`access_method`* objects, each of which contains a URL. There is only a single gs type *`access_method`* for this instance, though the DrsObject structure allows for more than one in the case that the referenced data is available from multiple sources. The gs *`type`* and the gs:// prefix indicate that this instance is in a Google Cloud storage bucket. 
The bucket name is *idc-tcia-tcga-kirc* and is a good example of the need for GUID based manifests: IDC DICOM data is being migrated to a single bucket named idc-open and the objects are being renamed with shorter UUID based names. Thus URLs such the above will eventually become "stale", but  GUIDs will continue to be resolvable to the correct URLs.

##Accessing IDC DICOM data in Google Cloud Storage

The above URL can be used to access the corresponding DICOM object. You can used gsutil, curl, or Google APIs for this purpose. 
Note that IDC GCS buckets are have the Requester Pays attribute. This means that the users bears the cost of accessing the bucket, particularly the user must pay any egress charges. 

For gs type URL of the form `gs://{bucket}/{object_name}`, the following command line methods or equivalent programatic method can be used to obtain the object:
*   gsutil -u {project_id} cp gs://{bucket}/{object_name}
*   curl -X GET \\\
    -H "Authorization: Bearer {oauth2_token}" \\\
    -o {save_location} \\\
    -k "https://storage.googleapis.org/storage/v1/b/{bucket}/o/{object_name}?userProject={project_id}"

We will continue to use the same requests.get method. However, we need to generate a new oauth2 token. The Bearer token that we generated was specifically for accessing the IDC API. See the [Google documentation](https://cloud.google.com/storage/docs/using-requester-pays#using) on accessing Requester Pays buckets for more details.
In this case, we need to authenticate because the access to the bucket is set to AllAuthenticatedUsers.

Note that the authentication step below is specific to Colab. In other contexts, authentication will likely require a different process.

In [None]:
from google.colab import auth
auth.authenticate_user()
auth_token = !gcloud auth application-default print-access-token
# Build a dictionary of headers
headers = {'Authorization':'Bearer {}'.format(auth_token[0])}

In [None]:
# Get the gs type URL from one of the returned DrsObjects
gs_url = instance_drs_objects[0].json()['access_methods'][0]['access_url']['url'].split('#')[0]
# Convert the gs type URL to an https type url
bucket = gs_url.split('/')[2]
object_name = gs_url.split('/',4)[-1]
url = 'https://storage.googleapis.com/{}/{}'.format(bucket,object_name)
query_string = 'userProject=YOUR_PROJECT_ID'
response = requests.get(url,
  # params=query_string,
  headers=headers)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))



To verify that we have downloaded the instance, we will inspect it using the pydicom package.

In [None]:
!pip install pydicom
import pydicom

We can access the instance while it is in memory by converting the returned bytes to a binary I/O stream.

In [None]:
import io
dcm = io.BytesIO(response.content)

Then we read the object and dump its contents.

In [None]:
ds = pydicom.dcmread(dcm)
ds

##Example: POST /cohorts/query/preview
The POST /cohorts/query/preview API returns selected metadata for the elements in a specified cohort. Like the POST /cohorts/manifest/preview API, the `/cohorts/query/preview` API does not actually create a cohort, but acts as if a cohort were created.

The values that can be queried are the Original and Derived values that define a filter. Currently these are:
Modality, BodyPartExamined, StudyDescription, StudyInstanceUID, PatientID, SeriesInstanceUID, SOPInstanceUID, SeriesDescription, SliceThickness, SeriesNumber, StudyDate, SOPClassUID, collection_id, AnatomicRegionSequence, SegmentedPropertyCategoryCodeSequence, SegmentedPropertyTypeCodeSequence, FrameOfReferenceUID, SegmentNumber, SegmentAlgorithmType, SUVbw, Volume, Diameter, Surface_area_of, Total_Lesion, Standardized_Added_Metabolic, Percent_Within_First_Quarter_of_Intensity_Range, Percent_Within_Third_Quarter_of_Intensity_Range, Percent_Within_Fourth_Quarter_of_Intensity_Range, Percent_Within_Second_Quarter_of_Intensity_Range, Standardized_Added_Metabolic_Activity, Glycolysis_Within_First_Quarter_of_Intensity_Range, Glycolysis_Within_Third_Quarter_of_Intensity_Range, Glycolysis_Within_Fourth_Quarter_of_Intensity_Range, Glycolysis_Within_Second_Quarter_of_Intensity_Range, Internal, Sphericity, Calcification, Lobular, Spiculation, Margin, Texture, Subtlety, Malignancy 



In the following, the filter defines a cohort of LIDC_IDRC cases that have spiculation values of 4 or 5 out of 5. The queried values include the PatientID, SOPInstanceID, Modality and Spiculation. 

In [None]:
filters = {
  "collection_id": [
    "LIDC_IDRI"
   ],
   "Spiculation": [
     "4 Out of 5",
     "5 out of 5 (Marked spiculation)"
    ]
}

cohort_def = {"name": "testcohort",
              "description": "Test description",
              "filters": filters}

queryFields = {
    "fields": [
      "PatientID","SOPInstanceUID","Modality","Spiculation"
    ]
  }
queryPreviewBody = {"cohort_def": cohort_def,
                    "queryFields": queryFields}

query_string = {
    'sql': False,
    'page_size': 10
}

response = requests.post(f'{idc_api_preamble}/cohorts/query/preview',
                      json=queryPreviewBody, 
                      params=query_string)
# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print(response.json())

print(json.dumps(response.json(), sort_keys=True, indent=4))

Like the manifest APIs, there is also a GET cohorts/query/{cohort_id} form of this API, that returns additional data about an existing cohort. Also like the manifest APIs, the query APIs are paged. The /cohorts/query/nextPage endpoint can be used to obtain the additional pages.

##Example: GET /dicomMetadata
For completeness, this last section reviews the GET /dicomMetadata endpoint. This endpoint returns a fixed selection of DICOM metadata for all IDC DICOM instances. It is intended for use by other CRDC resources that might need such information for the purpose of aggregating imaging data with other data types.

Because it returns metadata on all DICOM instances, paging must be used to obtain the complete set of results.

In [None]:
query_string = dict(
    page_size = 10
)

response = requests.get('{}/dicomMetadata'.format(idc_api_preamble, cohort_id),
                      params=query_string, headers = head)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print(response.json())

print(json.dumps(response.json(), sort_keys=True, indent=4))

Subsequent pages can be obtained using the /dicomMetadata/nextPage endpoint.

In [None]:
query_string = dict(
    page_size = 10,
    next_page = response.json()['next_page']
)
response = requests.get('{}/dicomMetadata/nextPage'.format(idc_api_preamble),
                    params=query_string)

# Check that there wasn't an error with the request
if response.status_code != 200:
  # Print the error code and message if something went wrong
  print(response.json())

print(json.dumps(response.json(), sort_keys=True, indent=4))

