# GA4GH DOS - GDC Example

Data stored in the signpost service are issued identifiers and made available for use in the NCI Genomic Data Commons.

In an effort to maintain an interoperability layer that is inclusive of all implementations of data access services, we offer the Data Object Service.

## Design

```                                                                                         
+------------------+      +--------------+        +-------------------+
| ga4gh-dos-client |------|dos-gdc-lambda|--------|api.gdc.cancer.gov |
+--------|---------+      +--------------+        +-------------------+
         |                        |                                                         
         |                        |                                                         
         |------------------swagger.json                                                    
```

For this pilot we have created a lambda that creates a lightweight layer that can be used to access data in signpost using GA4GH libraries.

There is an example of using Python to [GDC API here](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/).

The lambda accepts GA4GH requests and converts them into requests against requisite signpost endpoints. The results are then translated into GA4GH style messages before being returned to the client.

To make it easy for developers to create clients against this API, the Open API description is made available, which we will see later.

## Initializing the DOS GA4GH Client

* Note, the URLs are subject to change!

We begin by initializing the client, which will access the lambda to get the swagger description and may take a moment.

To install this client use `pip install git+git://github.com/david4096/data-object-schemas@fixes-cleanup --process-dependency-links`.

In [11]:
from ga4gh.dos.client import Client
local_client = Client('https://gmyakqsfp8.execute-api.us-west-2.amazonaws.com/api/')

For convenience, we then initialize a few objects that will make it easier to use the DOS endpoint.

In [12]:
client = local_client.client
models = local_client.models

## Listing data from GDC via GA4GH

Now that we have initialized the DOS client against the DOS-GDC lambda, we can access data using GA4GH methods.

In [13]:
ListDataObjectsRequest = models.get_model('ga4ghListDataObjectsRequest')
list_request = client.ListDataObjects(body=ListDataObjectsRequest(page_size=100))
list_response = list_request.result()
print("Number of Data Objects: {} ".format(len(list_response.data_objects)))

Number of Data Objects: 100 


In [14]:
data_object = client.GetDataObject(
    data_object_id=list_response.data_objects[1].id).result().data_object
print(data_object.urls)

[ga4ghURL(system_metadata=protobufStruct(fields=None), url=u'https://api.gdc.cancer.gov/data/8df6b042-a108-4fc8-8419-084250b2418e', user_metadata=None)]


## Downloading data using DOS

For publicly available data, we can quickly download the files using the DOS client.

In [15]:
# https://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py
import requests
def download_file(url, filename):
    # NOTE the stream=True parameter
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                #f.flush() commented by recommendation from J.F.Sebastian
    return filename

In [16]:
download_file(data_object.urls[0].url, data_object.id)

u'8df6b042-a108-4fc8-8419-084250b2418e'

## Verifying a checksum

Now that we have downloaded a file we can verify the checksum on that file against what is in the DOS record.

In [17]:
import hashlib
# https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

In [18]:
print(md5(data_object.id))
print(data_object.checksums[0].checksum)
# If these don't match you probably tried to download a controlled access file.

9372f1fec5a04f91428815f3d35e075e
9372f1fec5a04f91428815f3d35e075e


# Accessing data from signpost

```                                                                                         
+------------------+      +-------------------+        +----------------------------------+
| ga4gh-dos-client |------|dos-signpost-lambda|--------|signpost.opensciencedatacloud.org |
+--------|---------+      +-------------------+        +----------------------------------+
         |                        |                                                         
         |                        |                                                         
         |------------------swagger.json                                                    
```

A lambda similar to that arranged for the GDC public API is create for signpost.

## Listing data from signpost via GA4GH DOS lambda

We'll now instantiate a client that is directed at the lambda service that will make data from `signpost.opensciencedatacloud.org` available.

https://github.com/david4096/dos-signpost-lambda

In [19]:
signpost_client = Client('https://wfzf7mc8i2.execute-api.us-west-2.amazonaws.com/api/')
client = signpost_client.client

ListDataObjectsRequest = models.get_model('ga4ghListDataObjectsRequest')
list_request = client.ListDataObjects(body=ListDataObjectsRequest(page_size=10))

list_response = list_request.result()
print("Number of Data Objects: {} ".format(len(list_response.data_objects)))

Number of Data Objects: 10 


Signpost returns a list of identifers by default, a pattern we can copy using DOS. However, this means that `DataObjects` returned from the list request need to be materialized into full documents.

In [20]:
data_object = client.GetDataObject(data_object_id=list_response.data_objects[0].id).result().data_object

In [21]:
print(data_object.urls)

[ga4ghURL(system_metadata=None, url=u'https://s3.amazonaws.com/noaa-nexrad-level2/2002/12/31/KBYX/KBYX20021231_203851.gz', user_metadata=None)]


## Download data from signpost

We can reuse the method defined above. This feature will be offered by a standalone [dos-downloader](https://github.com/david4096/dos-downloader).

In [22]:
download_file(data_object.urls[0].url, data_object.id)

u'00000009-abcb-554e-8a9a-4610e946e548'

## Verifying data from signpost

signpost offers md5 checksums to verify files. Again, we can reuse the function defined above. Checksum verification could be included with any downloader.

In [23]:
print(md5(data_object.id))
print(data_object.checksums[0].checksum)

f8d7524668e9fb2580b809052e509694
f8d7524668e9fb2580b809052e509694


# Performing an analysis

Since some data in the GDC API are publicly available and DOS replicates their metadata, we can access public htseq counts to perform an analysis.

https://github.com/david4096/dos-gdc-lambda/blob/master/gdc-analyze-htseq.ipynb