# Using the Azul DRS Endpoint

This notebook will demonstrate how to use the Azul services GA4GH Data Repository Service (DRS) endpoints.

## Introduction

The [HCA Data Storage Service](https://github.com/HumanCellAtlas/data-store) (DSS) provides software to enable cloud agnostic data storage. When used with the [Azul](https://github.com/DataBiosphere/azul) service, DSS metadata is made accessible for easy querying.

The Global Alliance for Genomcs and Health (GA4GH) supports the [Data Repository Service](https://github.com/ga4gh/data-repository-service-schemas) (DRS). This interface provides a set of standard methods to support basic data interchange concerns. 

This notebook will demonstrate the `GetDataObject` method to download a file from its DRS URL.

## Initialize a client

This notebook presumes you have a functioning version of Azul *service lambda* running locally. See the [azul docs](https://github.com/DataBiosphere/azul/tree/develop/src/azul).

In [45]:
import requests
base_url = "http://localhost:8000"
drs_endpoint = "ga4gh/dos/v1/dataobjects"

health_check = requests.get("{}/health".format(base_url))
print(health_check.json())

{u'status': u'UP', u'elasticsearch': {u'status': u'UP', u'domain': u'azul-index-dev'}}


## Using a DRS URL

A DRS URL will often be made available as part of "interchange metadata" like a file manifest. Using DRS URLs allows applications to separate the cloud location of a file from its identity.

In [46]:
my_drs_url = "drs://b3ebf536-e8a6-4796-a66a-7b3c088680a3"
my_drs_id = drs_url.replace('drs://', '')


Now with our DRS identifier we can construct a `GetDataObject` request.

In [24]:
drs_response = requests.get("{base_url}/{drs_endpoint}/{drs_id}".format(
    base_url=base_url,
    drs_endpoint=drs_endpoint,
    drs_id=drs_id))
print("{base_url}/{drs_endpoint}/{drs_id}".format(
    base_url=base_url,
    drs_endpoint=drs_endpoint,
    drs_id=drs_id))
print(drs_response.json())

http://localhost:8000/ga4gh/dos/v1/dataobjects/b3ebf536-e8a6-4796-a66a-7b3c088680a3
{u'data_object': {u'name': u'AB-HE0202B-CZI-day3-Drop_S3_R1_001.fastq.gz', u'version': u'2018-12-05T230803.983133Z', u'urls': [{u'url': u'http://localhost:8000/fetch/dss/files/b3ebf536-e8a6-4796-a66a-7b3c088680a3?version=2018-12-05T230803.983133Z&replica=aws'}], u'checksums': [{u'checksum': u'a91d88ac03b52649d7299f8a5efcd74fc5b4f8d1901214c2a80b29f325573a37', u'type': u'sha256'}], u'size': u'2067772560', u'id': u'b3ebf536-e8a6-4796-a66a-7b3c088680a3', u'aliases': [u'AB-HE0202B-CZI-day3-Drop_S3_R1_001.fastq.gz']}}


## Getting a fetch URL from the DRS endpoint

DRS Data Objects may contain a list of URLs for accessing the files. These are in an array called `urls`.

In [47]:
data_object = drs_response.json()['data_object']
print(data_object['urls'])

[{u'url': u'http://localhost:8000/fetch/dss/files/b3ebf536-e8a6-4796-a66a-7b3c088680a3?version=2018-12-05T230803.983133Z&replica=aws'}]


In [28]:
fetch_url = data_object['urls'][0]['url']
print(fetch_url)

http://localhost:8000/fetch/dss/files/b3ebf536-e8a6-4796-a66a-7b3c088680a3?version=2018-12-05T230803.983133Z&replica=aws


## Get a pre-signed URL

The Azul service provides a proxy for interacting with the DSS. The actual pre-signed URL made available to the DSS is realized in the response to `fetch`.

In [41]:
fetch_response = requests.get(fetch_url)
print(fetch_response.json())
presigned_url = fetch_response.json()['Location']

{u'Status': 302, u'Location': u'https://org-hca-dss-checkout-prod.s3.amazonaws.com/blobs/a91d88ac03b52649d7299f8a5efcd74fc5b4f8d1901214c2a80b29f325573a37.2a3eff0e5f41f2bc50bfb7fafcb811d4d4914aee.74ed217ade9c93560af2f5e1853221be-31.3337d3ca?AWSAccessKeyId=ASIARSZHKI4KPZQ3WXQM&Signature=DCcUp9dEZn7UzFJYNFiWAm6H8gY%3D&x-amz-security-token=REDACTED&Expires=1547691807'}


## Downloading from the pre-signed URL

Once we have a presigned URL we can download using `wget`, `curl`, or similar.

In [44]:
!wget "$presigned_url"

The name is too long, 674 chars total.
Trying to shorten...
New name is a91d88ac03b52649d7299f8a5efcd74fc5b4f8d1901214c2a80b29f325573a37.2a3eff0e5f41f2bc50bfb7fafcb811d4d4914aee.74ed217ade9c93560a.
--2019-01-16 17:24:33--  https://org-hca-dss-checkout-prod.s3.amazonaws.com/blobs/a91d88ac03b52649d7299f8a5efcd74fc5b4f8d1901214c2a80b29f325573a37.2a3eff0e5f41f2bc50bfb7fafcb811d4d4914aee.74ed217ade9c93560af2f5e1853221be-31.3337d3ca?AWSAccessKeyId=ASIARSZHKI4KPZQ3WXQM&Signature=DCcUp9dEZn7UzFJYNFiWAm6H8gY%3D&x-amz-security-token=REDACTED&Expires=1547691807
Resolving org-hca-dss-checkout-prod.s3.amazonaws.com (org-hca-dss-checkout-prod.s3.amazonaws.com)... 52.216.179.75
Connecting to org-hca-dss-checkout-prod.s3.amazonaws.com (org-hca-dss-checkout-prod.s3.amazonaws.com)|52.216.179.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2067772560 (1.9G) [binary/octet-stream]
Saving to: ‘a91d88ac03b52649d7299f8a5efcd74fc5b4f8d1901214c2a80b29f325573a37.2a3eff0e5f41f2bc50bfb

## More reading

https://github.com/ga4gh/data-repository-service-schemas

