# GA4GH DOS - GDC Example

Data stored in the signpost service are issued identifiers and made available for use in the NCI Genomic Data Commons.

In an effort to maintain an interoperability layer that is inclusive of all implementations of data access services, we offer the Data Object Service.

## Design

```                                                                                         
+------------------+      +--------------+        +-------------------+
| ga4gh-dos-client |------|dos-gdc-lambda|--------|api.gdc.cancer.gov |
+--------|---------+      +--------------+        +-------------------+
         |                        |                                                         
         |                        |                                                         
         |------------------swagger.json                                                    
```

For this pilot we have created a lambda that creates a lightweight layer that can be used to access data in signpost using GA4GH libraries.

There is an example of using Python to [GDC API here](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/).

The lambda accepts GA4GH requests and converts them into requests against requisite signpost endpoints. The results are then translated into GA4GH style messages before being returned to the client.

To make it easy for developers to create clients against this API, the Open API description is made available, which we will see later.

## Initializing the DOS GA4GH Client

* Note, the URLs are subject to change!

We begin by initializing the client, which will access the lambda to get the swagger description and may take a moment.

To install this client use `pip install git+git://github.com/david4096/data-object-schemas@dos-minimal2 --process-dependency-links`.

In [1]:
from ga4gh.dos.client import Client
local_client = Client('https://gmyakqsfp8.execute-api.us-west-2.amazonaws.com/api/')

For convenience, we then initialize a few objects that will make it easier to use the DOS endpoint.

In [2]:
client = local_client.client
models = local_client.models

## Listing data from GDC via GA4GH

Now that we have initialized the DOS client against the DOS-GDC lambda, we can access data using GA4GH methods.

In [4]:
ListDataObjectsRequest = models.get_model('ga4ghListDataObjectsRequest')
list_request = client.ListDataObjects(body=ListDataObjectsRequest(page_size=100))
list_response = list_request.result()
print("Number of Data Objects: {} ".format(len(list_response.data_objects)))

Number of Data Objects: 100 


In [5]:
data_object = client.GetDataObject(
    data_object_id=list_response.data_objects[0].id).result().data_object
print(data_object.urls)

[ga4ghURL(system_metadata=protobufStruct(fields=None), url=u'https://api.gdc.cancer.gov/data/c5c4b4a3-3224-4a72-a883-c99c7747e47b', user_metadata=None)]


## Downloading data using DOS

For publicly available data, we can quickly download the files using the DOS client.

In [6]:
# https://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py
import requests
def download_file(url, filename):
    # NOTE the stream=True parameter
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                #f.flush() commented by recommendation from J.F.Sebastian
    return filename

In [7]:
download_file(data_object.urls[0].url, data_object.id)

u'c5c4b4a3-3224-4a72-a883-c99c7747e47b'

## Verifying a checksum

Now that we have downloaded a file we can verify the checksum on that file against what is in the DOS record.

In [8]:
import hashlib
# https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

In [9]:
print(md5(data_object.id))
print(data_object.checksums[0].checksum)

f7beee5951c58b5c99ce0c5ae6c2c5f1
f7beee5951c58b5c99ce0c5ae6c2c5f1
