<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master-whc/notebooks/idc_api/How_to_use_the_IDC_V2_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Use the IDC V2 API

## Overview of this notebook
This notebook is designed as a quick introduction to version 2 of the IDC API and how to access it from Python.

Topics covered:
* Overviews of APIs, Swagger, JSON, endpoints
* Use cases for IDC APIs
* Examples of IDC API endpoints

### Overview of APIs
An API or application-programming interface is a software intermediary that allows two applications to talk to each other. In other words, an API is the messenger that delivers your request to the provider that you’re requesting it from and then delivers the response back to you [(Wikipedia)](https://en.wikipedia.org/wiki/Application_programming_interface). Each action that an API can take is called an "endpoint".

Some useful tutorials and quick start guides on APIs are:
* [GDC's Getting Started guide for APIs](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/)
* [API Integration in Python](https://realpython.com/api-integration-in-python/)
* [Python API Tutorial: Getting Started with APIs](https://www.dataquest.io/blog/python-api-tutorial/)

### What is an HTTP Message?

Clients and the IDC API server communicate via HTTP messages. An overview of HTTP messaging can be found [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages).
The IDC API uses the GET, POST and DELETE HTTP methods.

### What is JSON?

JSON  or JavaScript Object Notation is a lightweight data-interchange format that is easy for humans and machines to work with. More information can be found at [json.org](https://www.json.org/).

### What is SwaggerUI?

[SwaggerUI](https://swagger.io/tools/swagger-ui/) is a user interface that allows users to try out the APIs and view their documentation easily. You can access the IDC API SwaggerUI [here](https://api.imaging.datacommons.cancer.gov/v1/swagger).

### What is an endpoint?

An endpoint is the *call* for a specific functionally of an API. For example, `/collections` at the end of the API request URL `https://api.imaging.datacommons.cancer.gov/v2/collections` is an endpoint that returns (or GETs) information about the available IDC collections.

###IDC API Documentation

Detailed documentation on the IDC API can be found in the [API](https://learn.canceridc.dev/api/getting-started) section of the [IDC User Guide](https://learn.canceridc.dev/ps://).

### IDC API URL Preamble
We define the IDC API URL preamble as a variable so that we can easily change it.

In [None]:
idc_api_preamble = 'https://api.imaging.datacommons.cancer.gov/v2'

### Python library `requests`

In this notebook, we use the Python Requests HTTP library to access IDC API endpoints.

In [None]:
# Install requests if needed
# pip install requests

# Import the requests library
import requests

Finally, we will define a little pretty printing function and print out the information that we have received from the API.

In [None]:
# import json
import json
def pretty(response):
  print(json.dumps(response.json(), sort_keys=True, indent=4))


## Use cases for IDC APIs

The IDC APIs can be used for a number of different tasks for interacting with the Google Cloud Platform and BigQuery. It can be used to subset data into cohorts or to access cohorts that have been created using the IDC WebApp. The location of the DICOM objects associated with a cohort can be obtained.

## Example: `/about` GET Endpoint



We are first going to explore the `about` endpoint using the 'GET' request to the API. This API will give you such information about the IDC API as links to the Swagger UI interface and to the IDC User Guide.

In [None]:
# First submit the 'get' request to the API
response = requests.get('{}/about'.format(idc_api_preamble))

Now that we have the request response, we are going to check that we didn't receive an error code or if the request was successful. If the request was successful, then the status code will come back as 200 but if something went wrong then the status code may be something 404 or 503. If you have recieved any error codes, you can check out Google's Troubleshooting response errors guide.



In [None]:
# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))


# Print the response
pretty(response)

This response returns as a dictionary though responses can also be a combination of dictionaries and lists depending on which endpoint is called. This means that you can access different data in the response the same way that you would access dictionaries and lists as demonstrated below.

The requests library makes it easy to use the APIs. Next we will cover a few of the other informational APIs.

# Getting API metadata
Several API endpoints can used to obtain metadata about IDC hosted data. Some of the metadata is also used in the parameterization of other API endpoints.

## `/versions` GET endpoint

IDC data is organized as a hierarchy. The top level of the hierarchy is the IDC version. An IDC version is a set of (versions of) collections where a collection is a set of DICOM data resulting from some investigation. The following levels in the hierarchy...patient, study, series, instance...follow the DICOM model of the real world.   

Over time, the set of collections and the data in collections may change. For the most part, such changes will be due to new data having been added. The totality of available IDC hosted data resulting from any such change is represented by a unique IDC data version ID. That is, each time that the set of publicly available data changes, a new IDC version is created that exactly defines the revised data set.

The IDC data version is intended to enable the reproducibility of research results. For example, consider a patient in the DICOM data model. Over time, new studies might be performed on a patient and become associated with that patient, and with the corresponding DICOM instances added to the IDC hosted data. Or additional patients might  be added to some collection over time. This means that the set of patients defined by some search operation (as defined by a "filter set") will change over time. Thus, for purposes of reproducibility, we define a cohort in terms of a filter set **and** an IDC data version.

The `/versions` endpoint returns information about current IDC versions and each of the previously defined IDC versions. This information includes the data sources (BQ tables) that contain the data of each version. We will first retrieve the request and then view if there was an error code within the response.

In [None]:
# Retrieve the response from the API endpoint
response = requests.get('{}/versions'.format(idc_api_preamble))

# Check that there wasn't an error with the request
if response.status_code != 200:
  # Print the error code and message if something went wrong
  print(response.json())  # Print the error code if something went wrong

# Print the versions JSON text
pretty(response)

We see  that, as of this writing, there are seventeen IDC versions. The current or active version is idc_data_version 17.0.

An Analysis Result is a kind of collection comprised of derived DICOM data that was generated by analyzing other (usually Original) collections. Typically such an analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis results include segmentations, annotations and further processing of original images. Note that some Original Collections include such analysis results, though most of the data in Original collections are original images.

## `/collections` GET endpoint

An *Original* collection is a set of DICOM data provided by a single source. Original collections are comprised primarily of DICOM image data that were obtained from some set of patients, i.e. scans of some scan. Typically, the patients in an Original collection are related by a common disease.

The /collections endpoint returns data about all the Original collections for the active IDC data version.


In [None]:

response = requests.get('{}/collections'.format(idc_api_preamble))
# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))

# Print the versions JSON text
pretty(response)

## `/analysis_results` GET endpoint

An Analysis Result is a kind of collection comprised of derived DICOM data that was generated by analyzing other (usually Original) collections. Typically such an analysis is performed by a different entity than that which provided the original collection(s) on which the analysis is based. Examples of data in analysis results include segmentations, annotations and further processing of original images. Note that some Original Collections include such analysis results, though most of the data in Original collections are original images.

The /analysis_results API returns metadata onof each available analysis results.

In [None]:
# Retrieve the response from the API endpoint
response = requests.get('{}/analysis_results'.format(idc_api_preamble))

# Check that there wasn't an error with the request
if response.status_code != 200:
  # Print the error code and message if something went wrong
  print(response.json())  # Print the error code if something went wrong

# Print the versions JSON text
pretty(response)

## /filters GET endpoint

As mentioned earlier, and described in the IDC User Guide, a cohort is defined in terms of a filterset.  

A filterset is a list of filters, where each filter is given as a JSON encoded dict having the form:
```
{ attribute: [value,...]}
```
For example:
{"Modality": ["MR", "PT"]}
The filters endpoint returns a list of the available filter attributes.

In [None]:
# Retrieve the response from the API endpoint
response = requests.get('{}/filters'.format(idc_api_preamble))

# Check that there wasn't an error with the request
if response.status_code != 200:
  # Print the error code and message if something went wrong
  print(response.json())  # Print the error code if something went wrong

# Print the collections JSON text
pretty(response)

The name of each attribute is returned along with its type and units where appropriate.

## <k>/</k>filters/values/{filter} GET endpoint





Some filters recognize only certain defined values. The /filters/{attribute} returns the values accepted by a specified Categorical String or Categorical Numeric filter attribute.

For example, the following gets the values accepted by the *Modality* Categorical String filter attribute:

In [None]:
# Retrieve the response from the API endpoint
filter = 'Modality'
response = requests.get(f'{idc_api_preamble}/filters/values/{filter}')

# Check that there wasn't an error with the request
if response.status_code != 200:
  # Print the error code and message if something went wrong
  print(f'{response.reason}: {response.json()["message"]}')  # Print the error code if something went wrong

# Print the collections JSON text
pretty(response)

## /fields GET endpoint
The IDC API V2 enables getting a manifest of metadata about the objects in some cohort. The kinds of metadata to be included in a manifest are specified as a list of `fields`. The `fields` endpoint returns a list of available fields.

In [None]:
# Retrieve the response from the API endpoint
response = requests.get(f'{idc_api_preamble}/fields')

# Check that there wasn't an error with the request
if response.status_code != 200:
  # Print the error code and message if something went wrong
  print(response.json())  # Print the error code if something went wrong

# Print the collections JSON text
pretty(response)

# Managing Cohorts
The API includes endpoints for creating and managing cohorts. Using these endpoints requires authenticating to the API by providing credentials. A cohort that is created in this way will be saved under the account associated with the credentials. Such a cohort can be accessed through the IDC web app when signed in with the same account. Similarly, cohorts created through the IDC web can be accessed from the API with appropriate credentials.

## Notes on Authorization and Credentials

In order to be able to create a cohort or access a previously defined cohort from the API, the user must authenticate their identity with the API. This section will step through the authentication/authorization process. Perform the following steps on your local machine:

1. Use the [idc_auth.py](https://github.com/ImagingDataCommons/IDC-Examples/blob/master/scripts/idc_auth.py) Python script to create a Credential File on your local machine.
  * This script must be be executed from the command line
  * Refer to the script for execution instructions
2. Find the location of the Credential File on your local machine
  * By default, the script will save the credentail file in the user's home folder with the file name: `.idc_credentials`

We can now proceed to load the credential file into the cloud environment you are using:

In [None]:
import os
# Import files helper for Colab
from google.colab import files

In [None]:
# First delete any local instances of your credentials.
# If we do not do this, file.upload() will name the new file ".idc_credentials (1)"
try:
  os.remove('./.idc_credentials')
except:
  print('.idc_credentials not found')
# The above assumes that your credential file is name '.idc_credentials'. Edit as needed.

# Upload your credentials to the cloud environment. Click on the Choose Files button to bring up a file browser
uploaded = files.upload()
#Click on the "Choose File" button to open a file browser:

Now that we have the Credentials file created and uploaded to the cloud environment, we can extract the ID token that identifies you for the purpose of authorization.

In [None]:
# Open the credentials file
token = open(".idc_credentials", "r")
# Create a json object from the credential file
token = json.loads(token.read())
# Get Credentials from the token
creds = token['token_response']['id_token']
# Create a json object for requests header
auth_header = {
    'Authorization': 'Bearer ' + creds
    }

**Note:** the credentials file will expire after 1 hour and a new one will need to be generated. If a new file is not generated with the idc_auth script, you can delete the original file and try running the script again.

## <k>/</k>cohorts POST Endpoint
We can now proceed to use the POST `/cohorts` endpoint to create a cohort. We will use the same cohort definition as previously.

In [None]:
filters = {
    "collection_id": [
      "tcga_luad",
      "tcga_kirc"
    ],
    "Modality": [
      "CT",
      "MR"
    ],
    "race": [
      "WHITE"
    ],
    "age_at_diagnosis_btw": [
      65,
      75
    ]
  }


mimetype = 'application/json'
headers = {
    'Accept': mimetype,
    'Content-Type': mimetype
    }


cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filters": filters}


In [None]:
response = requests.post(f'{idc_api_preamble}/cohorts/',
            data=json.dumps(cohortSpec), headers = headers|auth_header)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))
else:
  print(json.dumps(response.json(), sort_keys=True, indent=1))

  cohort_id = response.json()['cohort_properties']['cohort_id']

The response repeats the cohort definition and includes both the cohort_id of the newly created cohort and the current IDC version. Including the IDC version in the cohort definition insures that the objects in the cohort are uniquely define.

We will use this ID in some subsequent examples.

The /cohorts GET endpoint returns a list of the callers previously cohorts:

In [None]:
# Retrieve the response from the API endpoint
response = requests.get(f'{idc_api_preamble}/cohorts',
            headers=headers|auth_header)

# Check that there wasn't an error with the request
if response.status_code != 200:
  # Print the error code and message if something went wrong
  print(response.json())  # Print the error code if something went wrong

# Print the list of cohorts
pretty(response)

The /cohorts/{cohort_id} DELETE endpoint and the /cohorts DELETE endpoint can be used to delete, respectively, a single cohort or a list of cohorts. We'll demonstrate deleting a single cohort:

In [None]:
# Retrieve the response from the API endpoint
response = requests.delete(f'{idc_api_preamble}/cohorts/{cohort_id}',
            headers=headers|auth_header)

# Check that there wasn't an error with the request
if response.status_code != 200:
  # Print the error code and message if something went wrong
  print(response.json())  # Print the error code if something went wrong

# Print the list of cohorts
pretty(response)

# Getting a manifest
As described in detail in the [IDC User Guide](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Flearn.canceridc.dev%2Fps%3A%2F%2F),the manifest is a table of metadata about the objects in a cohort. The metadata that can be included in an manifest obtained from the API can have more fields than a manifest obtained from the IDC web app. We first need to create another cohort because we deleted the cohort that we previously created:

In [None]:
response = requests.post(f'{idc_api_preamble}/cohorts/',
                      data=json.dumps(cohortSpec), headers=headers|auth_header)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))
else:
  cohort_id = response.json()['cohort_properties']['cohort_id']

The /manifest/{cohort_id} endpoint takes a cohort_id. It also accepts a JSON encoded specification of the manifest to be returned. This include a list of the fields in the manifest and some additional parameters.
In the following example, we request a manifest which includes the fields listed in the `fields` list. `fields` is a required component of the `manifestBody`.

`manifestBody` can include several other optional components.

-If the "counts" parameter is included and is True, the manifest will include the counts of the objects which make up each row in the manifest. The default is False.

-If the "size" parameter is included and is True, the manifest will include the total size in bytes of the instances that are represented by each row of the manifest. The default is False.

-If the "sql" parameter is include and is True, the BigQuery SQL that was used to generate the manifest is returned. The default is False.

-If the "page_size" parameter is included and is a valid integer, at most page_size rows of he manifest will be returned. The default is 1000 rows.

In [None]:
fields = [
  'collection_id',
  'modality',
  'race',
  'age_at_diagnosis',
  'studyinstanceuid',
]

manifestBody = {
    "fields": fields,
    "counts": True,
    "group_size": True,
    "sql": True,
    'page_size': 10,
}

response = requests.post(f'{idc_api_preamble}/cohorts/manifest/{cohort_id}',
            data=json.dumps(manifestBody), headers=headers|auth_header)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))
else:
  # Print the list of cohorts
  pretty(response)

This is a "study granularity" manifest because the filters do not include instance or series level filters (SOPInstanceUID, crdc_sop_uuid, SeriesInstanceUID, crdc_series_uuid). (See the IDC User Guide for an in-depth discussion of manifest granulariy).

Because the manifest has study granularity, each row includes the number of series (`series_count`) and the number of instance (`instance_count`) having the values of the reported fields. E.G., the first row of the manifest shows that 2 series and 54 instances have the 'CT' modality, are in the study having StudyInstanceUID "1.3.6.1.4.1.14519.5.2.1.8421.4004.694473393945766391895523698595", etc.

Because the `group_size` parameter is True, each row includes a `group_size` value. This is the size, in bytes, of all instances in the group having the specified filter values.

Because the `sql` parameter is True, the BQ SQL is include on the `cohort_def`.

Because the "counts" parameter is 10, the response shows that there are 105 rows in the manifest but only 10 rows were returned as limited by the page_size parameter. The response includes a `next_page` token. This token can be submitted to the /manifest/nextPage endpoint to obtain additional rows of the manifest:


In [None]:
next_page_token = response.json()['next_page']

params = {
    "next_page": next_page_token,
    "page_size": 10
}

response = requests.get('{}/cohorts/manifest/nextPage'.format(idc_api_preamble, cohort_id),
                      params=params, headers=headers|auth_header)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print(response.json())
else:
  pretty(response)

We received another 10 rows, but the `next_page` is not Null, indicating that there are still more row in the manifest.

## Preview manifest

The POST `/cohorts/manifest/preview` effectively creates a cohort, gets a manifest for that cohort and tne deletes the cohort. However, a cohort is never actually created and saved. Therefore this endpoint does not require user authentication.

The `/cohorts/manifest/preview` API does not actually create a cohort, but acts as if a cohort were created. Creating a cohort requires authenticating to the API; that process is addessed in a subsequent example describing the `/cohorts/manifest/{cohort_id}` API.

In the following we create a preview manifest for the same cohort as in previous examples. The manifestPreviewBody is like the manifestBody passed to the /manifest endpoint except that it also includes the "cohort_def" components.Note that the headers do not include the auth_header:



In [None]:
filters = {
    "collection_id": [
      "tcga_luad",
      "tcga_kirc"
    ],
    "Modality": [
      "CT",
      "MR"
    ],
    "race": [
      "WHITE"
    ],
    "age_at_diagnosis_btw": [
      65,
      75
    ]
  }


mimetype = 'application/json'
headers = {
    'Accept': mimetype,
    'Content-Type': mimetype
    }

cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filters": filters}

manifestPreviewBody = {
  "cohort_def": cohortSpec,
  "fields": fields,
  "counts": True,
  "group_size": True,
  "sql": True,
  'page_size': 10
}

response = requests.post(f'{idc_api_preamble}/cohorts/manifest/preview',
                      data=json.dumps(manifestPreviewBody), headers=headers)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))
else:
  # Print the list of cohorts
  pretty(response)

These are the same initial 10 rows obtained from the /manifest endpoint. Like the /manifest/nextPage endpoint, the <k>/</k>manifest/preview/nextPage endpoint can be accessed to obtain the additional rows of the manifest.

# Obtaining IDC DICOM data
An important goal of these manifest is to provide the information needed to obtain IDC DICOM data...specifically DICOM instance data. There are several `fields` that important for this purpose.




## Instance level access
The gcs_url and aws_url fields provide the GCS and AWS URL respectively of each instance in the cohort. For example:

In [None]:
filters = {
    "collection_id": [
      "tcga_luad",
      "tcga_kirc"
    ],
    "Modality": [
      "CT",
      "MR"
    ],
    "race": [
      "WHITE"
    ],
    "age_at_diagnosis_btw": [
      65,
      75
    ]
  }

mimetype = 'application/json'
headers = {
    'Accept': mimetype,
    'Content-Type': mimetype
    }

cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filters": filters}

fields = [
    "gcs_url",
    "aws_url"
]

manifestPreviewBody = {
  "cohort_def": cohortSpec,
  "fields": fields,
  "page_size": 10
}

urls = requests.post(f'{idc_api_preamble}/cohorts/manifest/preview',
                      data=json.dumps(manifestPreviewBody), headers=headers)

# Check that there wasn't an error with the request
if urls.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(urls.reason))
else:
  # Print the list of cohorts
  pretty(urls)

Given a gcs_url, there a various methods and tools, such as gsutil and curl, for obtaining the corresponding DICOM object, e.g.:

* gsutil -u {project_id} cp gs://{bucket}/{object_name}
* curl -X GET -H "Authorization: Bearer {oauth2_token}" -o {save_location} -k "https:<k>//</k>storage.googleapis.org/storage/v1/b/{bucket}/o/{object_name}?userProject={project_id}"

We will continue to use the same requests.get method. For this purpose we need to convert the gs type URL to an https URL:

In [None]:
# Get the gs type URL from one of the returned DrsObjects
gcs_url = urls.json()["manifest"]["manifest_data"][0]['gcs_url']
# Convert the gs type URL to an https type url
https_url = gcs_url.replace("gs://", "https://storage.googleapis.com/")

# Now we can get the DICOM object
response = requests.get(https_url)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))


To verify that we have downloaded the instance, we will inspect it using the pydicom package.

In [None]:
!pip install pydicom
import pydicom

We can access the instance while it is in memory by converting the returned bytes to a binary I/O stream.

In [None]:
import io
dcm = io.BytesIO(response.content)

Then we read the object and dump its contents.

In [None]:
print(pydicom.dcmread(dcm))

The process for getting the DICOM object from an AWS bucket is similar:

In [None]:
# Get the gs type URL from one of the returned DrsObjects
aws_url = urls.json()["manifest"]["manifest_data"][1]['aws_url']
# Convert the s3 type URL to an https type url
https_url = aws_url.replace("s3://", "https://s3.amazonaws.com/")

# Now we can get the DICOM object
response = requests.get(https_url)

# Check that there wasn't an error with the request
if response.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(response.reason))


Verify that we have obtained the DICOM object:

In [None]:
dcm = io.BytesIO(response.content)
print(pydicom.dcmread(dcm))

## Series level access
Each version of each IDC DICOM series is assigned a unique `crdc_series_uuid`. Similarly, each version of each IDC DICOM instance is assigned a unique `crdc_instance_uuid`. (To be specific, these are [Version 4 UUIDs](https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_(random))).

IDC uses these UUIDs to uniquely and hierarchically name each version of each IDC DICOM instance in a hierarchical. These names of IDC instance in GCS and in AWS are, respectively, of the form:
* gs://\<bucket\>/\<crdc_series_uuid\>/\<crdc_instance_uid\>.dcm
* s3://\<bucket\>/\<crdc_series_uuid\>/\<crdc_instance_uid\>.dcm

Some tools can take advantage of this hierachical naming to copy all instances in a series, given just the bucket name and `crdc_series_instance`. We will demonstrate using the `s5cmd` tool for this purpose.

First we generate a manifest that includes just the bucket and crdc_series_uuids. (We reduce the size of the cohort for demonstration purposes.):

In [None]:
filters = {
    "collection_id": [
      "tcga_luad",
      "tcga_kirc"
    ],
    "Modality": [
      "CT",
      "MR"
    ],
    "race": [
      "WHITE"
    ],
    "age_at_diagnosis": [
      34
    ]
  }

mimetype = 'application/json'
headers = {
    'Accept': mimetype,
    'Content-Type': mimetype
    }

cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filters": filters}

fields = [
    "gcs_bucket",
    "aws_bucket",
    "crdc_series_uuid"
]

manifestPreviewBody = {
  "cohort_def": cohortSpec,
  "fields": fields
}

series_uuids = requests.post(f'{idc_api_preamble}/cohorts/manifest/preview',
                      data=json.dumps(manifestPreviewBody), headers=headers)

# Check that there wasn't an error with the request
if series_uuids.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(series_uuids.reason))
else:
  # Print the list of cohorts
  pretty(series_uuids)

We'll use the [s5cmd](https://github.com/peak/s5cmd) tool to copy the instances in this manifest from GCS and AWS buckets to the local device. First, we install s5cmd:

In [None]:
!curl -o s5cmd.tar.gz -L https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-32bit.tar.gz
!tar zxf s5cmd.tar.gz

We first show how to copy all the instances in a single series using s5cmd. s5cmd expects `s3` type URLs for both AWS and GCS. A parameter controls whether s5cmd accesses an AWS or a GCS bucket. We'll demonstrate copying from AWS first. Construct the URL:

In [None]:
# Get the data from the first row of the manifest
series_data = series_uuids.json()['manifest']['manifest_data'][0]
# Construct the URL
aws_url = f's3://{series_data["aws_bucket"]}/{series_data["crdc_series_uuid"]}/*'
print(aws_url)
# We need a directory into which to copy the data:
!mkdir -vp /tmp/dicom_data
# Execute s5cmd
!./s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com cp $aws_url /tmp/dicom_data
# Delete the directory and contents
!rm -vr /tmp/dicom_data

We use a similar script to copy data from GCS. There are two differences:
* The URL contains the GCS bucket rather than the AWS bucket
* The command has a different value for the --endpoint-url parameter



In [None]:
# Get the data from the first row of the manifest
series_data = series_uuids.json()['manifest']['manifest_data'][0]
# Construct the URL
gcs_url = f's3://{series_data["gcs_bucket"]}/{series_data["crdc_series_uuid"]}/*'
print(aws_url)
# We need a directory into which to copy the data:
!mkdir -vp /tmp/dicom_data
# Execute s5cmd
!./s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com cp $gcs_url q/tmp/dicom_data
# Delete the directory and contents
!rm -vr /tmp/dicom_data

The manifest includes several series. We can get all the data from all the series using the s5cmd `run` command:

In [None]:
# We need a directory into which to copy the data:
!mkdir -vp /tmp/dicom_data
# We first build a manifest of the series URLS. Each row includes the target directory
aws_urls = [f'cp s3://{row["aws_bucket"]}/{row["crdc_series_uuid"]}/* /tmp/dicom_data/.' for row in series_uuids.json()['manifest']['manifest_data']]
# Export the manifest to a file
with open('/tmp/s5cmd_manifest.s5cmd', "w") as f:
  for row in aws_urls:
    f.write(f'{row}\n')
#Here's what our manifest looks like:
! cat /tmp/s5cmd_manifest.s5cmd
# Execute s5cmd
! ./s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com run /tmp/s5cmd_manifest.s5cmd
! ls -l /tmp/dicom_data
# Delete the directory and contents
! rm -vr /tmp/dicom_data

Copying the entire IDC manifest from GCS is similar.

##Resolving DRS IDs
A `DRS ID` is a token that can be presented to a `DRS server` to obtain information on accessing some data object represented by that token.
Please refer to the [API](https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids) section of the [IDC User Guide](https://learn.canceridc.dev) for further information on `DRS`, `DrsObjects` and DRS `access methods`. In addition, links to DRS specifications can be found [here](https://ga4gh.github.io/data-repository-service-schemas/).

As discussed in the IDC User Guide, it is possible that the gcs_url or aws_url of an IDC DICOM instance can change over time. Therefore such URLs should not be used for long term documentation of the source of data. However, the `DRS ID` of some IDC DICOM instance can always be used to locate the corresponding 'bits' of that instance.

For this purpose, a `DRS ID` is 'resolved' to obtain a JSON structure called a `DrsObject', and which will contain information for accessing the the DICOM instance wherever it is.

As it turns out, the `crdc_instance_uuid` of an instance is also its `DRS ID`.We will resolve the `crdc_instance_guid`s of the instances in the cohort which we have been working with, and to obtain their URLs. We will then use s5cmd to copy them to the local file system.

We need to change the `fields` in our manifest to obtain the `crdc_instance_uuid`s of the instances in the cohort. And we reduced the cohort size even further because resolving DRS IDs is relatively slow:


In [None]:
filters = {
    "collection_id": [
      "tcga_luad",
      "tcga_kirc"
    ],
    "Modality": [
      "CT",
      "MR"
    ],
    "race": [
      "WHITE"
    ],
    "age_at_diagnosis": [
      82
    ]
  }

mimetype = 'application/json'
headers = {
    'Accept': mimetype,
    'Content-Type': mimetype
    }

cohortSpec = {"name": "testcohort",
              "description": "Test description",
              "filters": filters}

fields = [
  "crdc_instance_uuid"
]

manifestPreviewBody = {
  "cohort_def": cohortSpec,
  "fields": fields
}

instance_uuids = requests.post(f'{idc_api_preamble}/cohorts/manifest/preview',
                      data=json.dumps(manifestPreviewBody), headers=headers)

# Check that there wasn't an error with the request
if instance_uuids.status_code != 200:
    # Print the error code and message if something went wrong
    print('Request failed: {}'.format(series_uuids.reason))
else:
  # Print the list of cohorts
  pretty(instance_uuids)

Now we can resolve each crdc_instance_uuid/DRS ID. To resolve such a DRS ID, we present it to the CRDC DRS server:

In [None]:
crdc_drs_server_url = "https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects"
# Create a list of the series GUIDs in the previous response.
drs_ids = [row['crdc_instance_uuid'] for row in instance_uuids.json()['manifest']['manifest_data']]
# We'll accumulate the DrsObjects here:
drs_objects= []
# Resolving a GUID simply requires presenting it to the nci-crdc.datacommons.io server.
for drs_id in drs_ids:
  drs_object = requests.get(f'{crdc_drs_server_url}/{drs_id}').json()
  drs_objects.append(drs_object)

# Print one of the DrsObjects
print('This cohort has {} instances'.format(len(drs_objects)))
print(json.dumps(drs_objects[0], indent=2))

Each DrsObject includes an "access_method" list, each element of which describes one method for accessing the object repreesented by the DrsObject. The "type" component describes the access method, and the "url" in the "access_url" is a  "fully resolvable URL that can be used to fetch the actual object bytes."

Note that the "access_id" is used when obtaining a signed URL to the object. IDC's AWS and GCS data do not require a signed URL.)

In the following we will assemble a manifest for obtaining data from GCS. We'll then use s5cmd to obtain the data objects:

In [None]:
drs_object

In [None]:
# We need a directory into which to copy the data:
!mkdir -vp /tmp/dicom_data
# We first build a manifest of the series URLS. Each row includes the target directory
gcs_urls = []
for drsObject in drs_objects:
  gcs_url = next(access_method["access_url"]["url"] for access_method in drs_object["access_methods"] if access_method["type"]=="gs").replace('gs', 's3', 1)
  # For s5cmd we need to change "gs" to "s3"
  gcs_urls.append(f'cp {gcs_url} /tmp/dicom_data/.')
# Export the manifest to a file
with open('/tmp/s5cmd_manifest.s5cmd', "w") as f:
  for row in aws_urls:
    f.write(f'{row}\n')

#Here's what our manifest looks like:
! cat /tmp/s5cmd_manifest.s5cmd


Now we copy the instances to our local disk:

In [None]:
# Execute s5cmd
! ./s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com run /tmp/s5cmd_manifest.s5cmd
! ls -l /tmp/dicom_data
# Delete the directory and contents
! rm -vr /tmp/dicom_data