# Accessing DSS via DOS

Data in the HCA DSS is replicated across cloud stores. This allows it to be downloaded from the "nearest" location, to avoid egress fees. 

The data in the DSS has been made available using the Data Object Service schemas, which provide an interoperable way for exposing replicated data and versioned data over a simple HTTP API.

## Using the requests module

To access services over HTTP we use the requests module.

In [4]:
from pprint import pprint
import requests
SERVICE_URL = "https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api"
print(SERVICE_URL)

https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api


The `ListDataObjects` method has not been implemented yet. However, one can access the DSS' bundle oriented index using `ListDataBundles`.

### Listing Data Bundles

In [5]:
BASE_URL = "ga4gh/dos/v1"
LIST_DATA_BUNDLES_URL = "{}/{}/{}".format(SERVICE_URL, BASE_URL, "databundles")
data_bundles = requests.get(LIST_DATA_BUNDLES_URL).json()['data_bundles']
pprint(data_bundles)

[{'id': 'e9556c3d-53cb-4c24-856e-951085735d45', 'version': '2018-06-07T001704'},
 {'id': '15a0ce60-261d-4bc1-8463-f4d87aa483f0', 'version': '2018-06-07T001659'},
 {'id': 'b7003567-37a6-4f70-8be3-0e8ee5c1f020', 'version': '2018-06-07T001844'},
 {'id': 'd1dca21c-71a3-466d-8690-1212c22491c3', 'version': '2018-06-07T001714'},
 {'id': '41680495-06a3-4963-9d2c-9280c6e9979b', 'version': '2018-06-07T001749'},
 {'id': 'c3a74a9d-aebb-4fcb-a664-c49f8abbaa8c', 'version': '2018-06-07T001808'},
 {'id': '4eb3190b-de14-4248-b143-65084c930741', 'version': '2018-06-07T001848'},
 {'id': 'c70c2f7d-5e68-468e-bc66-36e1c4caac80', 'version': '2018-06-07T001908'},
 {'id': '29d2be81-e0e0-484b-81ac-efa385f1e9bc', 'version': '2018-06-07T001812'},
 {'id': '65b0e718-6b30-4833-8039-526bd9ced487', 'version': '2018-06-07T001822'}]


### Create BDBag with the data object links in `fetch.txt` (of the bag) of all data objects in this list of data bundles

In [6]:
ls -l

total 72
-rw-rw-r-- 1 michael michael  8910 Jul 10 13:16 app.py
-rw-rw-r-- 1 michael michael  6448 Jul 11 11:50 example-usage.ipynb
-rw-rw-r-- 1 michael michael 12085 Jul 20 13:30 example-usage-mk.ipynb
-rw-rw-r-- 1 michael michael  1072 Jul 10 13:16 LICENSE
drwxrwxr-x 2 michael michael  4096 Jul 20 13:28 [0m[01;34m__pycache__[0m/
-rw-rw-r-- 1 michael michael  4565 Jul 10 13:16 README.md
-rw-rw-r-- 1 michael michael  7205 Jul 20 12:59 remote_to_bag.py
-rw-rw-r-- 1 michael michael    22 Jul 11 13:24 requirements.txt
-rw-rw-r-- 1 michael michael   816 Jul 13 15:32 test_data_object.json
-rw-rw-r-- 1 michael michael  6533 Jul 20 11:28 test_RemoteToBag.py


In [7]:
from remote_to_bag import make_bag
make_bag(data_bundles, SERVICE_URL, BASE_URL)

In [8]:
ls -l

total 76
-rw-rw-r-- 1 michael michael  8910 Jul 10 13:16 app.py
drwxrwxr-x 3 michael michael  4096 Jul 20 13:32 [0m[01;34mbag_path[0m/
-rw-rw-r-- 1 michael michael  6448 Jul 11 11:50 example-usage.ipynb
-rw-rw-r-- 1 michael michael 11036 Jul 20 13:32 example-usage-mk.ipynb
-rw-rw-r-- 1 michael michael  1072 Jul 10 13:16 LICENSE
drwxrwxr-x 2 michael michael  4096 Jul 20 13:28 [01;34m__pycache__[0m/
-rw-rw-r-- 1 michael michael  4565 Jul 10 13:16 README.md
-rw-rw-r-- 1 michael michael  7205 Jul 20 12:59 remote_to_bag.py
-rw-rw-r-- 1 michael michael    22 Jul 11 13:24 requirements.txt
-rw-rw-r-- 1 michael michael   816 Jul 13 15:32 test_data_object.json
-rw-rw-r-- 1 michael michael  6533 Jul 20 11:28 test_RemoteToBag.py


In [9]:
ls -l bag_path/

total 28
-rw-rw-r-- 1 michael michael  290 Jul 20 13:32 bag-info.txt
-rw-rw-r-- 1 michael michael   55 Jul 20 13:32 bagit.txt
drwxrwxr-x 2 michael michael 4096 Jul 20 13:32 [0m[01;34mdata[0m/
-rw-rw-r-- 1 michael michael 1496 Jul 20 13:32 fetch.txt
-rw-rw-r-- 1 michael michael  656 Jul 20 13:32 manifest-sha1.txt
-rw-rw-r-- 1 michael michael  896 Jul 20 13:32 manifest-sha256.txt
-rw-rw-r-- 1 michael michael  396 Jul 20 13:32 tagmanifest-sha256.txt


In [10]:
!cat bag_path/fetch.txt

https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/8ff23235-4435-4929-8fb2-5d55b4564999	5897	data/dss_data_object_0
https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/e8d5b3b5-1a49-4765-8670-e39860cce6c5	5897	data/dss_data_object_12
https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/8012e3d0-e6b6-4052-b079-2b9ab66def30	5894	data/dss_data_object_15
https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/35a5c756-cb0c-4f43-be1d-967988151676	5894	data/dss_data_object_18
https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/ccccf270-1b5b-49a0-abc8-c89674434c75	5889	data/dss_data_object_21
https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/1ebc2aa4-de12-452e-a02b-0ebfccd0b053	5888	data/dss_data_object_24
https://ekivlnizh1.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/c28005a1-91

### Getting Data Bundle details

Now that we have some Data Bundle identifiers, we can use `GetDataBundle` to retrieve more information about a bundle.

In [15]:
DATA_BUNDLE_URL = "{}/{}/databundles/{}".format(SERVICE_URL, BASE_URL, data_bundles[0]['id'])
data_bundle = requests.get(DATA_BUNDLE_URL).json()['data_bundle']
pprint(data_bundle['data_object_ids'])

['8ff23235-4435-4929-8fb2-5d55b4564999',
 '25065a60-6b7d-4c50-abf9-0cf86c5b483a',
 '44ec7963-5c7a-4974-bb4d-c8b5c002019c']


### Getting Data Object Details

We can now access Data Objects for download using the data object identifiers from the Data Bundle. Both signed URLs and cloud native URLs are available.

In [16]:
data_object_id = data_bundle['data_object_ids'][0]
DATA_OBJECT_URL = "{}/{}/dataobjects/{}".format(SERVICE_URL, BASE_URL, data_object_id)
data_object = requests.get(DATA_OBJECT_URL).json()['data_object']

The Data Object contains a list of URLs and checksums that can be used to download and access the file.

In [17]:
print("-----------URLS------------")
for url in data_object['urls']:
    pprint(url)
print("-----------checksums-----------")
for checksum in data_object['checksums']:
    pprint(checksum)

-----------URLS------------
{'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=aws'}
{'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=azure'}
{'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=gcp'}
-----------checksums-----------
{'checksum': 'c873835a74cea9c811cc7799f8897ac480cccf84f631c99b5293900f7a071b53',
 'type': 'sha256'}
{'checksum': '57db2e71deb4dab5e4b3f251ac9243b0', 'type': 'etag'}
{'checksum': '05f818a54510272c17dcda69c948f8d904b5aae3', 'type': 'sha1'}
{'checksum': '63439d51', 'type': 'crc32c'}


Now, using a HTTP, S3, or GCP downloader, one can access these files.

### Print data objects so we see what gets mapped

In [6]:
pprint(data_bundle)

{'data_object_ids': ['8ff23235-4435-4929-8fb2-5d55b4564999',
                     '25065a60-6b7d-4c50-abf9-0cf86c5b483a',
                     '44ec7963-5c7a-4974-bb4d-c8b5c002019c'],
 'id': 'e9556c3d-53cb-4c24-856e-951085735d45',
 'version': '2018-06-07T001704.126624Z'}


In [7]:
pprint(data_object_id)

'8ff23235-4435-4929-8fb2-5d55b4564999'


In [8]:
pprint(data_object)

{'checksums': [{'checksum': 'c873835a74cea9c811cc7799f8897ac480cccf84f631c99b5293900f7a071b53',
                'type': 'sha256'},
               {'checksum': '57db2e71deb4dab5e4b3f251ac9243b0', 'type': 'etag'},
               {'checksum': '05f818a54510272c17dcda69c948f8d904b5aae3',
                'type': 'sha1'},
               {'checksum': '63439d51', 'type': 'crc32c'}],
 'content_type': 'application/json',
 'id': '8ff23235-4435-4929-8fb2-5d55b4564999',
 'urls': [{'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=aws'},
          {'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=azure'},
          {'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=gcp'}],
 'version': '2018-06-07T001700.470245Z'}


In [12]:
bdbag_api.make_bag

<function bdbag.bdbag_api.make_bag(bag_path, algs=None, update=False, save_manifests=True, prune_manifests=False, metadata=None, metadata_file=None, remote_file_manifest=None, config_file='/home/michael/.bdbag/bdbag.json', ro_metadata=None, ro_metadata_file=None)>

In [13]:
from remote_to_bag import RemoteToBag