<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/API/notebooks/How_to_use_IDC_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Use the IDC APIs

## Overview of this notebook
This notebook is designed as a quick introduction to the IDC APIs and how to access them with Python.

Topics covered:
* Overviews of APIs, Swagger, JSON, endpoints
* Use cases for IDC APIs
* Examples of IDC API endpoints

### Overview of APIs
An API or application-programming interface is a software intermediary that allows two applications to talk to each other. In other words, an API is the messenger that delivers your request to the provider that you’re requesting it from and then delivers the response back to you [(Wikipedia)](https://en.wikipedia.org/wiki/Application_programming_interface). Each action that an API can take is called an "endpoint".

Some useful tutorials and quick start guides on APIs are:
* [GDC's Getting Started guide for APIs](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/)
* [API Integration in Python](https://realpython.com/api-integration-in-python/)
* [Python API Tutorial: Getting Started with APIs](https://www.dataquest.io/blog/python-api-tutorial/)

### What is an HTTP Message?

Clients and the IDC API server communicate via HTTP messages. An overview of HTTP messaging can be found [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages).

### What is JSON?

JSON  or JavaScript Object Notation is a lightweight data-interchange format that is easy for humans and machines to work with. More information can be found at [json.org](https://www.json.org/).

### What is SwaggerUI?

[SwaggerUI](https://swagger.io/tools/swagger-ui/) is a user interface that allows users to try out the APIs and view their documentation easily. You can access the IDC API SwaggerUI [here](https://api-dot-idc-dev.appspot.com/v1/swagger).

### What is an endpoint?

An endpoint is the *call* for a specific functionally of an API. For example, `/collections` at the end of the API request URL `https://api-dot-idc.appspot.com/v1/collections` is an endpoint that returns (or GETs) information about the available collections.

###IDC API Documentation

Detailed documentation on the IDC API can be found in the [API](https://learn.canceridc.dev/api/getting-started) section of the [IDC User Guide](https://learn.canceridc.dev/ps://).

### IDC API URL Preamble
We define the IDC API URL preamble as a variable so that we can easily change it.

In [2]:
idc_api_preamble = 'https://api-dot-idc-dev.appspot.com/v1'

### Python library `requests`

In order to use the IDC APIs with Python, the `requests` library needs to be installed and then imported.

In [3]:
# Install requests if needed
# pip install requests

# Import the requests library
import requests

## Use cases for IDC APIs

The IDC APIs can be used for a number of different tasks for interacting with the Google Cloud Platform and BigQuery. It can be used to subset data into cohorts or to access cohorts that have been created using the IDC WebApp. The location of the DICOM objects associated with a cohort can be obtained.

## Example: GET `/about` Endpoint



We are first going to explore the `about` endpoint using the 'GET' request to the API. This API will give you such information about the IDC API as links to the Swagger UI interface and to the IDC User Guide.

In [4]:
# First submit the 'get' request to the API
about_req = requests.get('{}/about'.format(idc_api_preamble))

Now that we have the request response, we are going to check that we didn't receive an error code or if the request was successful. If the request was successful, then the status code will come back as 200 but if something went wrong then the status code may be something 404 or 503. If you have recieved any error codes, you can check out Google's Troubleshooting response errors guide.



In [5]:
# Check that there wasn't an error with the request
if about_req.status_code != 200:
  # Print the error code if something went wrong
  print(about_req.status_code)

Finally, we will print out the information that we have received from the API. This response returns as a dictionary though responses can also be a combination of dictionaries and lists depending on which endpoint is called. This means that you can access different data in the response the same way that you would access dictionaries and lists as demonstarted below.

In [7]:
# Print the full response
print("Full response:\n")
print(about_req.json(), end='\n\n')

# Print the message portion of the response
print("Message:\n")
print(about_req.json()['message'], end='\n\n')

# Print the documentation portion of the response
print("Documentation:\n")
print(about_req.json()['documentation'])

Full response:

{'code': 200, 'documentation': 'SwaggerUI interface available at <https://api-dot-idc-dev.appspot.com/v1/swagger/>.Documentation available at <https://https://app.gitbook.com/login/imagingdatacommons/idc-user-guide>', 'message': 'Welcome to the NCI IDC API, Version 1'}

Message:

Welcome to the NCI IDC API, Version 1

Documentation:

SwaggerUI interface available at <https://api-dot-idc-dev.appspot.com/v1/swagger/>.Documentation available at <https://https://app.gitbook.com/login/imagingdatacommons/idc-user-guide>


The requests library makes it easy to use the APIs! Next we will cover a few of the other informational APIs.

## Example: GET `/versions` endpoint

Over time, the set of data hosted by the IDC will change. For the most part, such changes will be due to new data having been added. The totality of IDC hosted data resulting from any such change is represented by a unique IDC data version ID. That is, each time that the set of publicly available data changes, a new IDC version is created that exactly defines the revised data set. 

The IDC data version is intended to enable the reproducibility of research results. For example, consider a patient in the DICOM data model. Over time, new studies might be performed on a patient and become associated with that patient, and the corresponding DICOM instances added to the IDC hosted data. Moreover, additional patients might well be added to the IDC data set over time. This means that the set of subjects defined by some filter set will change over time. Thus, for purposes of reproducibility, we define a cohort in terms of a filter set and an IDC data version.

The `/versions` endpoint returns information about the defined IDC versions. This information includes the data sources (BQ tables) containing the data of each version, as well as the set of programs (sets of collections) belonging to a version. This endpoint returns a more complicated JSON object which has a combination of lists and dictionaries. We will first retrieve the request and then view if there was an error code within the response.

In [13]:
# Retrieve the response from the API endpoint
versions_req = requests.get('{}/versions'.format(idc_api_preamble))

# Check that there wasn't an error with the request
if versions_req.status_code != 200:
  # Print the error code and message if something went wrong
  print(versions_req.json())  # Print the error code if something went wrong

We are going to use the `json` library in order to view the response more easily.

In [14]:
# Install requests if needed
# pip install requests

# install pip json
import json

In [15]:
# Create a variable with the JSON output
versions_json = json.dumps(versions_req.json(), sort_keys=True, indent=4)

# Print the versions JSON text
print(versions_json)

{
    "code": 200,
    "versions": [
        {
            "active": true,
            "data_sources": [
                {
                    "data_type": "Clinical, Biospecimen, and Mutation Data",
                    "name": "isb-cgc.TCGA_bioclin_v0.Biospecimen"
                },
                {
                    "data_type": "Clinical, Biospecimen, and Mutation Data",
                    "name": "isb-cgc.TCGA_bioclin_v0.clinical_v1"
                },
                {
                    "data_type": "Image Data",
                    "name": "idc-dev.metadata.dicom_pivot_wave1"
                }
            ],
            "date_active": "2020-10-06",
            "idc_data_version": "1.0",
            "programs": [
                {
                    "description": null,
                    "name": "The Cancer Genome Atlas",
                    "short_name": "TCGA"
                },
                {
                    "description": null,
                    "name": "Quan

The returned data is a combination of dictionaries and lists. We see  that, as of this writing, there is a single IDC version, "1.0", that was activated on 2020-10-06. Next we will iterate over the JSON object to neatly list the data sources and programs in each version.

In [16]:
# Print out the number of IDC data versions
print('Number of IDC data versions: {}'.format(len(versions_req.json()['versions'])))

#...and for each version, print out a count of the version's programs, and a list of program names.
for version in versions_req.json()['versions']:
  print('version {} has {} programs:'.format(version['idc_data_version'], len(version['programs'])))
  for program in version['programs']:
    print('\tProgram short name {}, full name {}'.format(program['short_name'],program['name']))


Number of IDC data versions: 1
version 1.0 has 5 programs:
	Program short name TCGA, full name The Cancer Genome Atlas
	Program short name QIN, full name Quantitative Imagine Network
	Program short name ISPY, full name I-SPY TRIAL
	Program short name LIDC, full name Lung Image Database Consortium
	Program short name NSCLCR, full name NSCLC-Radiomics


## Example: GET `/collections` endpoint

A *collection* is a set of DICOM data provided by a single source. Collections are further categorized as Original collections or Analysis collections. 

Original collections are comprised primarily of DICOM image data that was obtained from some set of patients. Typically, the patients in an Original collection are related by a common disease.

Analysis collections are comprised of derived DICOM data that was generated by analyzing other (typically Original) collections. Typically such analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis collections include segmentations, annotations and further processing of original images. Note that some Original collections include such data, though most of the data in Original collections are original images.
Programs

The programs that we listed above are sets of original collections. The collections in a program are produced by a single source. Some programs provide additional non-imaging data. For example, the TCGA program provides extensive ancillary clinical and genomics data about each of the patients in the program. 

The /collections endpoint returns data about collections in a specified program for some IDC data version. If no collection is specified, it returns data about about the collections in all programs for some IDC data version. If a version is not specified then /collections defaults to the current IDC data version.


We will request the collection data for the TCGA program in IDC data version 1.0.

The program and version are passed as *query parameters* in a *query string*. The requests library accepts a dictionary of query parameters.

In [17]:
query_string = dict(
    program_name = "TCGA",
    idc_data_version = "1.0"
    )

collections_req = requests.get('{}/collections'.format(idc_api_preamble),
                    params=query_string)
# Check that there wasn't an error with the request
if collections_req.status_code != 200:
  # Print the error code and message if something went wrong
  print(collections_req.json())

In [18]:
# Create a variable with the JSON output
collections_json = json.dumps(collections_req.json(), sort_keys=True, indent=4)

# Print the verscollectionsions JSON text
print(collections_json)

{
    "code": 200,
    "programs": [
        {
            "collections": [
                {
                    "active": true,
                    "cancer_type": "Prostate Cancer",
                    "collection_id": "tcga_prad",
                    "collection_type": "Original",
                    "date_updated": "2016-08-29",
                    "description": "<div>\n\t<strong>Note:&nbsp;This collection has special restrictions on its usage. See <a href=\"https://wiki.cancerimagingarchive.net/x/c4hF\" target=\"_blank\">Data Usage Policies and Restrictions</a>.</strong></p>\n<div>\n\t&nbsp;</p>\n<div>\n\t<span>The <a href=\"http://imaging.cancer.gov/\" target=\"_blank\"><u>Cancer Imaging Program (CIP)</u></a></span><span>&thinsp;</span><span> is working directly with primary investigators from institutes participating in TCGA to obtain and load images relating to the genomic, clinical, and pathological data being stored within the <a href=\"http://tcga-data.nci.nih.gov/\" target

Metadata for original collections has been obtained from the [TCIA Data Collections page](https://www.cancerimagingarchive.net/collections/), and for analysis collections from the [TCIA Analysis Results page](https://www.cancerimagingarchive.net/tcia-analysis-results/). Not that the `idc_data_versions` component of each collection is a list because any particular collection will almost certainly be available in multiple IDC data versions.

## Example: POST `/cohorts/preview` endpoint

The `/cohorts/preview` endpoint takes a *filterSet* and other values as a *body* parameter and returns a JSON-encoded hierarchical representation of the collections, patients, studies, series and instances in the cohort defined by the filterSet, and other metadata. The filterSet includes the IDC data version against which it is applied. 

A cohort is not actually created by this endpoint. Doing so requires authenticating to the API. That process is addessed in a subsequent section.

A filter is a list of *attribute*,*values* pairs, where *values* is a list of one or more values which must be satisfied. 

In the following, we construct a dictionary, `cohortSpec`, containing the name and a description for the cohort, as well as a filterSet that selects for subjects in either the TCGA-LUAD or TCGA-KIRC collections, 



In [19]:
filterSet = {
    "idc_data_version": "1.0",
    "filters": {
        "collection_id": [ "TCGA-LUAD", "TCGA-KIRC" ],
        "Modality": ["CT", "MR"],
        "race": ["ASIAN"]}}

cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filterSet": filterSet}



This and other related endpoints are paged. The caller can limit the amount of data returned by each call, and iteratively call the endpoint until all data has been received. We will limit the returned data to 10 rows.

We can also control the depth of the hierarchy returned by the endpoint. Below we set the level to "Study""

In [20]:
query_string = dict(
    page_size = 10,
    return_level = "Instance"
)

We are now ready to call the endpoint. Note that /cohorts/preview is a POST method, so we call requests.post()

In [21]:
cohort_req = requests.post('{}/cohorts/preview/'.format(idc_api_preamble),
                    params=query_string, json=cohortSpec)

# Check that there wasn't an error with the request
if cohort_req.status_code != 200:
  # Print the error code and message if something went wrong
  print(cohort_req.json())

We will prettyprint the results for easier comprehension:

In [22]:
print(json.dumps(cohort_req.json(), indent=1))

{
 "code": 200,
 "cohort": {
  "description": "Test description",
  "filterSet": {
   "filters": {
    "Modality": [
     "CT",
     "MR"
    ],
    "collection_id": [
     "TCGA-LUAD",
     "TCGA-KIRC"
    ],
    "race": [
     "ASIAN"
    ]
   },
   "idc_data_version": "1.0"
  },
  "name": "testcohort",
  "sql": ""
 },
 "cohortObjects": {
  "collections": [
   {
    "collection_id": "tcga-kirc",
    "patients": [
     {
      "patient_id": "TCGA-CJ-4899",
      "studies": [
       {
        "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.222390132567115084999595655241",
        "series": [
         {
          "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.304062258378118200791444181901",
          "instances": [
           {
            "SOPInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.102158136938707631673416536391"
           },
           {
            "SOPInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.103667544438182365352458935319"
           },
           

The returned data includes the cohort name, description and filterset which we passed as parameters. We did not specify an IDC data version, which thus defaulted to the current version. 

The cohortObjects dictionary is a hierarchical representation of the first 10 rows of the cohort data. We can see that there are a total of 1581 DICOM instances in the cohort.

Because we limited the page size to 10, this cohortObjects dictionary includes data from only a single collection, patient and study. That study has at least three series, each having several instances.

We could also have requested that the BigQuery SQL that produced these results be returned.


### Paged results

It can be seen that the above results includes a `next_page` token. This token can be passed as a parameter in a subsequent invocation of /cohorts/preview to obtain additional data.

Following is a utility that merges the separate hierarchies returned by /cohorts/preview into a single hierarchy.

In [23]:
def merge_cohort_hierarchies(src, dst, level):
    levels = ["collections", "patients", "studies", "series", "instances"]
    keys = ["collection_id", "patient_id", "StudyInstanceUID", "SeriesInstanceUID", "SOPInstanceUID"]
    for src_item in src:
        found = False
        for dst_item in dst:
            # if src_item["id"] == dst_item["id"]:
            if src_item[keys[level]] == dst_item[keys[level]]:
                if not len(levels) == level+1:
                    merge_cohort_hierarchies(src_item[levels[level+1]], dst_item[levels[level+1]], level+1)
                found = True
                break
        if not found:
            dst.append(src_item)


We are now ready to call /cohorts/preview repeatedly to obtain and merge all data. We are going intentionally limit the total objects in the returned data by setting the return level to `Series`, so that the printed results will not be too long.

In [24]:
def paged_cohorts_preview(query_string, cohortSpec):

    # Get the first page
    response = requests.post('{}/cohorts/preview/'.format(idc_api_preamble),
                        params=query_string, json=cohortSpec)

    # Check that there wasn't an error with the request
    if response.status_code != 200:
      # Print the error code and message if something went wrong
      print(response.json())
      return None

    cohort= response.json()['cohort']
    cohortObjects = response.json()['cohortObjects']
    print("Total objects found: {}".format(cohortObjects['totalFound']))
    print("Objects returned: {}".format(cohortObjects['rowsReturned']))

    # We will merge all results into allCollections
    allCollections = cohortObjects["collections"]

    #Get the next_page token
    next_page = response.json()['next_page']
    if not next_page:
      # We are done
      return allCollections

    #Now get subsequent pages
    totalCollections = allCollections
    # Keep a running total of objects returned
    totalRowsReturned = cohortObjects['rowsReturned']

    # next_page will be null when all data has been returned
    while next_page:
        query_string['next_page'] = next_page
        response = requests.post('{}/cohorts/preview/'.format(idc_api_preamble),
                      params=query_string, json=cohortSpec)
  

        # Check that there wasn't an error with the request
        if response.status_code != 200:
            # Print the error code and message if something went wrong
            print(response.json())
            return None

        cohort = response.json()['cohort']
        cohortObjects = response.json()['cohortObjects']

        rowsReturned = cohortObjects["rowsReturned"]
        totalRowsReturned += rowsReturned

        collections = cohortObjects["collections"]

        # Merge the new results with previously returned results
        merge_cohort_hierarchies(collections, allCollections, 0)
        # allCollections.extend(collections)
        next_page = response.json()['next_page']

    print("Total objects: ".format(totalRowsReturned))

    return allCollections


In [25]:
filterSet = {
  "idc_data_version": "1.0",
  "filters": {
      "collection_id": [ "TCGA-LUAD", "TCGA-KIRC" ],
      "Modality": ["CT", "MR"],
      "race": ["ASIAN"]}}

cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filterSet": filterSet}

# Get the first page
query_string = dict(
    page_size = 10,
    return_level = "Series"
)
merged_results = paged_cohorts_preview(query_string, cohortSpec)

if not merged_results:
  print("paged_cohorts_preview failed")

print(json.dumps(merged_results, indent=1))



Total objects found: 26
Objects returned: 10
Total objects: 
[
 {
  "collection_id": "tcga-kirc",
  "patients": [
   {
    "patient_id": "TCGA-CJ-4899",
    "studies": [
     {
      "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.222390132567115084999595655241",
      "series": [
       {
        "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.209568983417342975065810158572"
       },
       {
        "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.304062258378118200791444181901"
       },
       {
        "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.338369725732670089585867585271"
       },
       {
        "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.1706.4004.613317179370101961283881481386"
       }
      ]
     }
    ]
   },
   {
    "patient_id": "TCGA-B0-4821",
    "studies": [
     {
      "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6450.4004.136179540832399945436413884813",
      "series": [
       {
        "SeriesInstanceUID": "1.3.6.1.4.1

## Example: POST `/cohort/{cohort_id}/manifest` Endpoint

This last section will focus on the POST `/cohorts/{cohort_id}/manifest` endpoint, which returns a manifest of the objects in the cohort identified by some *cohort_id*. Unlike the /cohort/preview endpoint (and the corresponding /cohort/preview/manifest endpoint), the `/cohorts/{cohort_id}/manifest` returns a manifest against a cohort that was previously defined via the IDC API or the IDC webapp.

In order to be able to save a cohort definition in the IDC webapp, the user must authenticate their identity with the web app. Similarly, in order to create, query and/or delete a cohort via the API, the user must authentic their identity with the API. This section will first step through the authentication/authorization process when working with the IDC API.


### Notes on Authorization and Credentials
The following steps are required to use an API that Requires Authorization:

1. Create a Credential File on your local machine by using the [idc_auth.py](https://github.com/ImagingDataCommons/IDC-API/tree/master/scripts/idc_auth.py) script from the [IDC-API Repository](https://github.com/ImagingDataCommons/IDC-API.git)
  * This script can be run from the command line or from within Python but should be run on your local machine.
2. Find the location of the Credential File on your local machine
  * By default, the script will save the file in the user's folder with the file name: ".idc_credentials"
3. Load the Credential file into the cloud environment you are using.

In [None]:
# If you skipped earlier sections, you will need these two packages to run the
# code below
# Install requests if needed
#pip install requests

# Install pip json
#import json

# Import the requests library
#import requests

In [57]:
# import os
import os
# Import files helper for Colab
from google.colab import files

In [58]:
# First delete any existing .idc_credentials. Otherwise file.upload() will name the new file .idc_credentials (1)
try:
  os.remove('./.idc_credentials')
except:
  print('.idc_credentials not found')

# Upload your credentials to the cloud environment
uploaded = files.upload()

.idc_credentials not found


Saving .idc_credentials to .idc_credentials


Now that we have the Credentials file created and uploaded to the cloud environment, we can open the file to create the header information need for the API to verify that you have Authorization.

In [59]:
# Open the credentials file
token = open(".idc_credentials", "r")
# Create a json object from teh credential file
token = json.loads(token.read())
# Get Credentials from the token
creds = token['token_response']['id_token']
# Create a json object for requests header
head = {'Authorization': 'Bearer ' + creds}

**Note:** the credentials file will expire after 1 hour and a new one will need to be generated. If a new file is not generated with the idc_auth script, you can delete the original file and try running the script again.

We can now proceed to create a cohort. The POST `/cohorts` endpoint creates a cohort. We will use the same cohort definition as previously.

In [66]:
filterSet = {
  "idc_data_version": "1.0",
  "filters": {
      "collection_id": [ "TCGA-LUAD", "TCGA-KIRC" ],
      "Modality": ["CT", "MR"],
      "race": ["ASIAN"]}}

cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filterSet": filterSet}

response = requests.post('{}/cohorts/'.format(idc_api_preamble),
                      json=cohortSpec, headers = head)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print(response.json())

print(json.dumps(response.json(), sort_keys=True, indent=4))

cohort_id = response.json()['cohort_properties']['cohort_id']

{
    "code": 200,
    "cohort_properties": {
        "cohort_id": 49,
        "description": "Test description",
        "filterSet": {
            "filters": {
                "Modality": [
                    "CT",
                    "MR"
                ],
                "collection_id": [
                    "TCGA-LUAD",
                    "TCGA-KIRC"
                ],
                "race": [
                    "ASIAN"
                ]
            },
            "idc_data_version": "1.0"
        },
        "name": "testcohort"
    }
}


Note that the response includes the cohort_id of the newly created cohort. We will use this ID when querying for the cohort's manifest. Note also that the response repeats the filterSet and other cohort metadata.

The GET `/cohorts/{cohort_id}/manifest` endpoint returns a manifest of *access_methods* for the objects in cohort cohort_id. Please refer to the in the [API](https://learn.canceridc.dev/api/getting-started) section of the [IDC User Guide](https://learn.canceridc.dev/ps://) for further information on manifests.

The API is configurable with respect to the data which it returns for each object. 

In [67]:
query_string = dict(
    sql = True,
    Collection_IDs = True,
    Patient_IDs = True,
    StudyInstanceUIDs = True,
    SeriesInstanceUIDs = True,
    SOPInstanceUIDs = True,
    Collection_DOIs = True,
    access_method =  'url',
    page_size = 10
)

response = requests.get('{}/cohorts/{}/manifest'.format(idc_api_preamble, cohort_id),
                      params=query_string, headers = head)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print(response.json())

print(json.dumps(response.json(), sort_keys=True, indent=4))


{
    "code": 200,
    "cohort": {
        "cohort_id": 49,
        "description": "Test description",
        "filterSet": {
            "filters": {
                "Modality": [
                    "CT",
                    "MR"
                ],
                "collection_id": [
                    "TCGA-LUAD",
                    "TCGA-KIRC"
                ],
                "race": [
                    "ASIAN"
                ]
            },
            "idc_data_version": "1.0"
        },
        "name": "testcohort",
        "sql": "\n            #standardSQL\n    \n        SELECT dicom_pivot_wave1.collection_id,dicom_pivot_wave1.PatientID,dicom_pivot_wave1.StudyInstanceUID,dicom_pivot_wave1.SeriesInstanceUID,dicom_pivot_wave1.SOPInstanceUID,dicom_pivot_wave1.source_DOI,dicom_pivot_wave1.gcs_url\n        FROM `idc-dev.metadata.dicom_pivot_wave1` dicom_pivot_wave1 \n        \n        JOIN `isb-cgc.TCGA_bioclin_v0.clinical_v1` clinical_v1\n        ON dicom_pivot_wave1.Patien

As with POST `/cohorts/preview/` (and POST `/cohorts/preview/manifest` and GET `/cohorts/{cohort_id}`), the returned next_page token can be used in subsequent calls to obtain additional data.

The query string specified that the API should return url access_methods. As can be seen these are URLs that with 'gs' scheme and can be directly used by the Google gsutil CLI.

Alternatively we can request that the API return DOIs:

In [68]:
query_string = dict(
    sql = True,
    Collection_IDs = True,
    Patient_IDs = True,
    StudyInstanceUIDs = True,
    SeriesInstanceUIDs = True,
    SOPInstanceUIDs = True,
    Collection_DOIs = True,
    access_method =  'doi',
    page_size = 10
)

response = requests.get('{}/cohorts/{}/manifest'.format(idc_api_preamble, cohort_id),
                      params=query_string, headers = head)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print(response.json())

print(json.dumps(response.json(), sort_keys=True, indent=4))

{
    "code": 200,
    "cohort": {
        "cohort_id": 49,
        "description": "Test description",
        "filterSet": {
            "filters": {
                "Modality": [
                    "CT",
                    "MR"
                ],
                "collection_id": [
                    "TCGA-LUAD",
                    "TCGA-KIRC"
                ],
                "race": [
                    "ASIAN"
                ]
            },
            "idc_data_version": "1.0"
        },
        "name": "testcohort",
        "sql": "\n            #standardSQL\n    \n        SELECT dicom_pivot_wave1.collection_id,dicom_pivot_wave1.PatientID,dicom_pivot_wave1.StudyInstanceUID,dicom_pivot_wave1.SeriesInstanceUID,dicom_pivot_wave1.SOPInstanceUID,dicom_pivot_wave1.source_DOI,dicom_pivot_wave1.crdc_instance_uuid\n        FROM `idc-dev.metadata.dicom_pivot_wave1` dicom_pivot_wave1 \n        \n        JOIN `isb-cgc.TCGA_bioclin_v0.clinical_v1` clinical_v1\n        ON dicom_pivot_w

Do not confuse the `source_DOI` returned value with the `doi` returned value. A `source_DOI` is the DOI of a TCIA page that describes the collection. A `source_DOI` can be resolved at https://doi.org.

A `doi` can be resolved to a DICOM object. Please refer to the [API](https://learn.canceridc.dev/api/getting-started) section of the [IDC User Guide](https://learn.canceridc.dev/ps://) for further information on access methods.