# Using the dos-azul-lambda

This notebook demonstrates usage of the dos-azul-lambda, which exposes metadata from the dss-azul-index. This index provides a "file-based" view of HCA DSS data.

For this demonstration, we will use the lambda at the following url:

In [29]:
lambda_url = "https://5ybh0f5iai.execute-api.us-west-2.amazonaws.com/api/"

For convenience, a swagger schema of the API is made available:

In [30]:
!curl $lambda_url/swagger.json

{"info": {"version": "0.2.0", "title": "Data Object Service"}, "paths": {"/databundles/{data_bundle_id}/versions": {"get": {"x-swagger-router-controller": "ga4gh.dos.server", "responses": {"200": {"description": "The versions for the Data Bundle were found successfully.", "schema": {"$ref": "#/definitions/GetDataBundleVersionsResponse"}}, "404": {"description": "The requested Data Bundle wasn't found.", "schema": {"$ref": "#/definitions/ErrorResponse"}}, "403": {"description": "The requester is not authorized to perform this action.", "schema": {"$ref": "#/definitions/ErrorResponse"}}, "401": {"description": "The request is unauthorized.", "schema": {"$ref": "#/definitions/ErrorResponse"}}, "400": {"description": "The request is malformed.", "schema": {"$ref": "#/definitions/ErrorResponse"}}, "500": {"description": "An unexpected error occurred.", "schema": {"$ref": "#/definitions/ErrorResponse"}}}, "parameters": [{"required": true, "type": "string", "name": "data_bundle_id", "in": "pa

We'll use this later to instantiate a DOS client to access data in the lambda.

## Listing Data Objects

Data Object Service's provide a `ListDataObjects` method that can be used to get an idea of what the system provides. We'll start by making a list request.

In [31]:
import requests
response = requests.get("{}/ga4gh/dos/v1/dataobjects".format(lambda_url))
print(response)

<Response [200]>


In [32]:
print(response.json())

{u'next_page_token': u'1', u'data_objects': [{u'updated': u'2018-01-23T20:08:16.647500Z', u'name': u'NWD259170.recab.cram.crai', u'version': u'2018-01-31T081722.854147Z', u'urls': [{u'url': u's3://commons-dss-commons/blobs/1c7d249c9123007d693857eab3dd4646bc8d742e76c6716c80debbbdd5d48e8b.be947abb597d1a21f2da9d97d96f58e7ca07a214.b933d8fb97268c951e610b6dfa20924d.6fbcd0b3'}], u'checksums': [{u'checksum': u'be947abb597d1a21f2da9d97d96f58e7ca07a214', u'type': u'md5'}], u'aliases': [u'repoDataBundleId:0d6371a8-fc4f-5232-9660-e655903b17ea', u'center_name:UW', u'submitter_donor_id:', u'sampleId:9f1e5d7d-90f8-57c6-8ccb-ca1d89d34611', u'submittedSampleId:HG01110_sample', u'repoBaseUrl:', u'analysis_type:sequence_upload', u'repoCode:Redwood-AWS-Oregon', u'repoCountry:US', u'file_type:crai', u'submittedSpecimenId:HG01110', u'file_version:2018-01-31T081722.854147Z', u'workflow:spinnaker:1.1.2', u'access:public', u'fileMd5sum:be947abb597d1a21f2da9d97d96f58e7ca07a214', u'program:TOPMed', u'repoType:Bl

There's a lot of information in that response, but it's in the Data Object Service schemas, so we can iterate through it without needing to introspect on the JSON.

In [33]:
for data_object in response.json()['data_objects']:
    print(data_object['id'])

46c8a5f1-15ab-48fa-8d1c-63099422e3c7
a62ee491-489d-405a-8a3b-83765f9e91fb
fff5a29f-d184-4e3b-9c5b-6f44aea7f527
24f5248b-d1b0-4348-94e5-1476db16fd8a
2aa56d6f-66a4-42c1-b8b7-ba3fe86c2974
252b5eb3-d46a-4f74-8b55-51aa9b3ef702
141b2f96-e2f5-42c2-bb3f-63c6df8fca72
2ce291de-4f12-4cf3-95ab-e75522007958
a3db2c85-8e79-4f03-af2c-e82c1a9c6b0f
17bcd11f-a0e3-4568-8fea-cbe998a3dea7


### Paging through results

More than 10 results exist in the dataset. To iterate past the initial 10, we use the Data Object Service `page_token` and `page_size` to control the size of results.

In [34]:
response = requests.get("{}/ga4gh/dos/v1/dataobjects?page_size=100".format(lambda_url))
print(len(response.json()['data_objects']))

100


The response returns a token that can be used to get the next page:

In [35]:
print(response.json()['next_page_token'])
next_page_token = response.json()['next_page_token']

1


In [36]:
import urllib
list_request = urllib.urlencode({'page_size': 100, 'page_token': next_page_token})
response_2 = requests.get("{}/ga4gh/dos/v1/dataobjects?{}".format(lambda_url, list_request))
print(len(response_2.json()['data_objects']))
print(response_2.json()['data_objects'][0]['id'])
print(response.json()['data_objects'][99]['id'])

100
a62ee491-489d-405a-8a3b-83765f9e91fb
1529dcf9-8f61-40fd-9eb0-89a480cbf156


### Using alias search

One of the main features of the dos-azul-index is providing facet based search on fields from the nested metadata in the DSS. This DOS over the azul-index attempts to present a similar feature using tags, without modifying the azul-index.

To construct an example query, let's look at some example aliases:

In [37]:
print(response.json()['data_objects'][99]['aliases'][9:15])
len(response.json()['data_objects'][99]['aliases'])

[u'file_type:crai', u'submittedSpecimenId:SRS1231092', u'file_version:2018-02-28T060238.897266Z', u'workflow:topmed-spinnaker:Alpha Build 1', u'access:public', u'fileMd5sum:7d89245522666b3beea7684a7fbad04a26a575c2']


36

In this version of the azul-index there are 36 fields that can help construct a search. DOS alias search is meant to be simple to implement, and performs simple string matching. Here, we can constrain results to a certain specimen.

In [38]:
list_request = urllib.urlencode({'alias': 'submittedSpecimenId:SRS1304405'})
response = requests.get("{}/ga4gh/dos/v1/dataobjects?{}".format(lambda_url, list_request))
print(len(response.json()['data_objects']))
print(response.json()['data_objects'][0]['aliases'])
print('submittedSpecimenId:SRS1304405' in response.json()['data_objects'][0]['aliases'])

2
[u'repoDataBundleId:3f107d36-0ad5-525a-8755-f132a5fcb979', u'center_name:Broad', u'submitter_donor_id:', u'sampleId:f1781010-452b-5372-8bef-3056153b7d6b', u'submittedSampleId:NWD467250', u'repoBaseUrl:', u'analysis_type:alignment', u'repoCode:Redwood-AWS-Oregon', u'repoCountry:US', u'file_type:crai', u'submittedSpecimenId:SRS1304405', u'file_version:2018-02-28T090140.829798Z', u'workflow:topmed-spinnaker:Alpha Build 1', u'access:public', u'fileMd5sum:95f7049e2d21f1c3055ea704ac3cc9034826a8ca', u'program:NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish', u'repoType:Blue Box', u'repoName:Redwood-AWS-Oregon', u'donor:81c975fa-be54-5098-8723-c35eb7f0c188', u'workflowVersion:Alpha Build 1', u'experimentalStrategy:Seq_DNA_SNP_CNV; Seq_DNA_WholeGenome', u'download_id:3f107d36-0ad5-525a-8755-f132a5fcb979', u'repoOrg:UCSC', u'fileSize:1366325', u'submittedDonorId:DBG00880', u'specimen_type:Normal - Blood', u'metadataJson:', u'lastModified:2018-02-22T16:36:15.032391', u'study:Amish

Here, we see there are two Data Objects matching this specimenId, and we show that the first did return the alias in the response.

## Getting Data Objects by id

Using a `GetDataObjectRequest` we can request individual Data Objects from the dos-azul-lambda.

First, we'll grab an identifier from our previous list response.

In [39]:
data_object_id = response.json()['data_objects'][0]['id']
print(data_object_id)

ef1543a1-fbb5-44f5-b941-ccef5cf53962


We can then use simple HTTP requests to get that individual Data Object:

In [40]:
!curl $lambda_url/ga4gh/dos/v1/dataobjects/$data_object_id

{"data_object": {"updated": "2018-02-22T16:36:15.032391Z", "name": "NWD467250.b38.irc.v1.cram.crai", "version": "2018-02-28T090140.829798Z", "urls": [{"url": "s3://nih-nhlbi-datacommons/NWD467250.b38.irc.v1.cram.crai"}], "checksums": [{"checksum": "95f7049e2d21f1c3055ea704ac3cc9034826a8ca", "type": "md5"}], "size": "1366325", "id": "ef1543a1-fbb5-44f5-b941-ccef5cf53962", "aliases": ["repoDataBundleId:3f107d36-0ad5-525a-8755-f132a5fcb979", "center_name:Broad", "submitter_donor_id:", "sampleId:f1781010-452b-5372-8bef-3056153b7d6b", "submittedSampleId:NWD467250", "repoBaseUrl:", "analysis_type:alignment", "repoCode:Redwood-AWS-Oregon", "repoCountry:US", "file_type:crai", "submittedSpecimenId:SRS1304405", "file_version:2018-02-28T090140.829798Z", "workflow:topmed-spinnaker:Alpha Build 1", "access:public", "fileMd5sum:95f7049e2d21f1c3055ea704ac3cc9034826a8ca", "program:NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish", "repoType:Blue Box", "repoName:Redwood-AWS-Oregon", "donor:

We can also use the `requests` module to perform the same action:

In [41]:
response = requests.get('{}/ga4gh/dos/v1/dataobjects/{}'.format(lambda_url, data_object_id))
print(response.json())
print("")
print('The first URL to access the Data Object')
print(response.json()['data_object']['urls'][0])

{u'data_object': {u'updated': u'2018-02-22T16:36:15.032391Z', u'name': u'NWD467250.b38.irc.v1.cram.crai', u'version': u'2018-02-28T090140.829798Z', u'urls': [{u'url': u's3://nih-nhlbi-datacommons/NWD467250.b38.irc.v1.cram.crai'}], u'checksums': [{u'checksum': u'95f7049e2d21f1c3055ea704ac3cc9034826a8ca', u'type': u'md5'}], u'aliases': [u'repoDataBundleId:3f107d36-0ad5-525a-8755-f132a5fcb979', u'center_name:Broad', u'submitter_donor_id:', u'sampleId:f1781010-452b-5372-8bef-3056153b7d6b', u'submittedSampleId:NWD467250', u'repoBaseUrl:', u'analysis_type:alignment', u'repoCode:Redwood-AWS-Oregon', u'repoCountry:US', u'file_type:crai', u'submittedSpecimenId:SRS1304405', u'file_version:2018-02-28T090140.829798Z', u'workflow:topmed-spinnaker:Alpha Build 1', u'access:public', u'fileMd5sum:95f7049e2d21f1c3055ea704ac3cc9034826a8ca', u'program:NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish', u'repoType:Blue Box', u'repoName:Redwood-AWS-Oregon', u'donor:81c975fa-be54-5098-8723-c35eb7

## Using the DOS Client to access dos-azul-lambda

The 0.2.1 version of the ga4gh-dos-schemas include a client that can be used to access services like the dos-azul-lambda. First, we import, and instantiate the client to point at the service.

In [42]:
from ga4gh.dos.client import Client
client = Client(lambda_url)
c = client.client
models = client.models

Now we can access models in the Data Object Service schemas and issue requests without having to write out the endpoints!

In [43]:
response = c.GetDataObject(data_object_id=data_object_id).result()
print(data_object_id)
print(response.data_object.id)
print(response.data_object.aliases[10])

ef1543a1-fbb5-44f5-b941-ccef5cf53962
ef1543a1-fbb5-44f5-b941-ccef5cf53962
submittedSpecimenId:SRS1304405


The client can also be used to make `ListDataObjectsRequest`. We'll try to find it using the alias `submittedSpecimenId:SRS1304405`. We should get back this Data Object somewhere in the response.

In [44]:
list_response = c.ListDataObjects(alias=response.data_object.aliases[10]).result()
print(len(list_response.data_objects))

2


Two Objects were returned, presumably the cram and crai files for the specimen. We can demonstrate this by showing our data_object_id showed up in the results:

In [45]:
print(data_object_id in [x.id for x in list_response.data_objects])

True


## Updating a Data Object with a new Alias

One of the advantages of having a request schema like the one DOS provides is that we can easily use it to interoperate with underlying systems in ways that suit a use case.

Here we want to be able to "tag" an object after the fact, without having to modify the store that feeds into the azul-index. This is done by sending an `UpdateDataObjectRequest`, which the dos-azul-lambda knows how to convert to the underlying index.

This is done in a way that attempts to respect existing keys, though a list of `safe_keys` is maintained in the code that allows certain keys to be modified.

### Authorizing a Request

Since this service makes modifications to an index an `access_token` is required. This is provided in the chalice `config.json` and provided as an environment variable to each instance of the lambda.

A convenience endpoint for testing tokens is provided:

In [46]:
access_token = "f4ce9d3d23f4ac9dfdc3c825608dc660"
headers = {'access_token': access_token}
auth_check = requests.get("{}/test_token".format(lambda_url), headers=headers)
print(auth_check.json())

{u'authorized': True}


We can now add aliases to our Data Object as needed to support various use cases. We'll add a key called `doi` which is meant to represent a Digital Object Identifier, but could be any string. `doi` is one of the safe keys we can rewrite.

### Getting an Object to Update

We can reuse a Data Object from above for this demonstration.

In [47]:
data_object = response.data_object
print(data_object['id'])
print(data_object['aliases'][0:5])

ef1543a1-fbb5-44f5-b941-ccef5cf53962
[u'repoDataBundleId:3f107d36-0ad5-525a-8755-f132a5fcb979', u'center_name:Broad', u'submitter_donor_id:', u'sampleId:f1781010-452b-5372-8bef-3056153b7d6b', u'submittedSampleId:NWD467250']


It already has a number of aliases, and we want to upsert a new value, so it works somewhat how you might expect:

### Modifying a Data Object to include a DOI

In [73]:
fake_doi_alias = 'doi:10.0.0.1/12345'
data_object['aliases'].append(fake_doi_alias)

### Making an UpdateDataObjectRequest

We then make a `UpdateDataObjectRequest` which includes the fields we want to update. The dos-azul-lambda allows writing to new fields, or to fields listed in a list of `safe_keys`.

In [74]:
UpdateDataObjectRequest = models.get_model('UpdateDataObjectRequest')
update_request = UpdateDataObjectRequest(data_object=data_object)
update_response = c.UpdateDataObject(
    data_object_id=data_object.id,
    body=update_request,
    _request_options={"headers": {"access_token": access_token}}).result()
print(update_response['data_object_id'])

ef1543a1-fbb5-44f5-b941-ccef5cf53962


It returns the identifier of the Data Object updated.

### Verifying the new alias


First, we'll get the Data Object by identifier to make sure that it has our new metadata. We do this by performing a GetDataObject using the server specific identifier (a UUID).

In [76]:
doi = c.GetDataObject(data_object_id=data_object.id).result().data_object.aliases[27]
print(doi == fake_doi_alias)
print(doi)

True
doi:10.0.0.1/12345


The returned alias agrees with the alias we wanted to set!

### Finding Data Objects by Alias

Now that we know the alias is there as expected. We can make a request to find Data Objects by that alias. We would like to be able to tag items with GUIDs, and have them be immediately and easily findable using the same interface.

To do that we perform a `ListDataObjectsRequest` with the alias we would like to filter for.

In [79]:
list_response = c.ListDataObjects(alias=fake_doi_alias).result()
print(list_response.data_objects[0].id)
print(data_object.id == list_response.data_objects[0].id)
print(list_response.data_objects[0].aliases[27])

ef1543a1-fbb5-44f5-b941-ccef5cf53962
True
doi:10.0.0.1/12345


The returned Data Object matched our alias request and contained the expected metadata!

## Looking forward

For more information on the Data Object Service schemas in general, check out the [DOS github](https://github.com/ga4gh/data-object-service-schemas)!

This effort is meant to support tagging of data easily from the HCA DSS, future work to maintain authentication across accounts would allow for collaborative metadata tagging and filtering.

The azul-index currently flattens to a map of strings. As the mapping of the index changes, more features may become possible.