# Accessing Data Object Service via Identifiers.org

This notebook will demonstrate how identifiers.org can be used to create stable URLs for DOS objects. These DOS objects are making available data in the Genomic Data Commons, and the metadata are public.

## Access data directly from dos-gdc-lambda

We'll begin by accessing a Data Object provided by the dos-gdc-lambda. Note the URL here is not expected to be stable.

In [4]:
import requests
test_id = "23fa7b4b-9d68-429b-aece-658b11124bb3"
DOS_GDC_URL = "https://dos-gdc.ucsc-cgp-dev.org/ga4gh/dos/v1"
response = requests.get("{}/dataobjects/{}".format(DOS_GDC_URL, test_id))
print(response.json()['data_object']['name'])

jhu-usc.edu_OV.HumanMethylation27.1.lvl-3.TCGA-09-0364-01A-02D-0359-05.gdc_hg38.txt


## Create a stable URL

Identifiers.org works by setting up redirects following a specific regex scheme. They manually curate these, and they are meant to provide stable URI's to be used across various platforms.

You can see the entry for the prefix `dev.ga4ghdos` here: https://identifiers.org/dev.ga4ghdos.

In [5]:
base = "https://identifiers.org"
prefix = "dev.ga4ghdos"
url = "{}/{}:{}".format(base, prefix, test_id)
print(url)

https://identifiers.org/dev.ga4ghdos:23fa7b4b-9d68-429b-aece-658b11124bb3


As mentioned above, identifiers.org creates redirects following a pattern to the underlying service. By requesting this URL, we should be redirected to the above service.

In [11]:
identifiers_org_response = requests.get(url)
print(identifiers_org_response.json()['data_object']['name'])
print("Are they equivalent?")
print(identifiers_org_response.json()['data_object'] == response.json()['data_object'])

jhu-usc.edu_OV.HumanMethylation27.1.lvl-3.TCGA-09-0364-01A-02D-0359-05.gdc_hg38.txt
Are they equivalent?
True


## Using cURL and wget with identifiers.org and DOS

Simple http requests from the command line can be used to find our data:

In [21]:
!curl -v $url

*   Trying 193.62.193.83...
* Connected to identifiers.org (193.62.193.83) port 443 (#0)
* found 173 certificates in /etc/ssl/certs/ca-certificates.crt
* found 697 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: identifiers.org (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: OU=Domain Control Validated,CN=identifiers.org
* 	 start date: Thu, 21 Sep 2017 14:01:01 GMT
* 	 expire date: Sun, 27 Sep 2020 12:12:39 GMT
* 	 issuer: C=US,ST=Arizona,L=Scottsdale,O=GoDaddy.com\, Inc.,OU=http://certs.godaddy.com/repository/,CN=Go Daddy Secure Certificate Authority - G2
* 	 compression: NULL
* ALPN, server did not agree to a protocol
> GET /dev.ga4ghdos:23fa7b4b-9d68-429b-aece-658b11124bb3 HTTP/1.1
> Host: i

Performing a request we see that cURL is redirected to a different URL with status code of `302`, to follow this redirect we add a flag `-L`.

In [19]:
!curl -L $url

{"data_object": {"name": "jhu-usc.edu_OV.HumanMethylation27.1.lvl-3.TCGA-09-0364-01A-02D-0359-05.gdc_hg38.txt", "version": "2017-03-24T18:43:16.886826-05:00", "urls": [{"url": "https://api.gdc.cancer.gov/data/23fa7b4b-9d68-429b-aece-658b11124bb3", "system_metadata": {"data_type": "Methylation Beta Value", "updated_datetime": "2017-03-24T18:43:16.886826-05:00", "created_datetime": "2016-10-27T21:58:12.297090-05:00", "file_name": "jhu-usc.edu_OV.HumanMethylation27.1.lvl-3.TCGA-09-0364-01A-02D-0359-05.gdc_hg38.txt", "md5sum": "9163285d8eadc921d7244f29faca50da", "data_format": "TXT", "acl": ["open"], "access": "open", "platform": "Illumina Human Methylation 27", "state": "live", "file_id": "23fa7b4b-9d68-429b-aece-658b11124bb3", "data_category": "DNA Methylation", "file_size": 9951504, "submitter_id": "cde73b7c-0a50-4444-bb33-11e3debd3f79-beta-value", "type": "methylation_beta_value", "file_state": "submitted", "experimental_strategy": "Methylation Array"}}], "checksums": [{"checksum": "91

`wget` on the other hand, will follow the redirects automatically and create a file with our prefix and identifier to the local directory.

In [25]:
!wget $url

--2018-04-13 16:54:52--  https://identifiers.org/dev.ga4ghdos:23fa7b4b-9d68-429b-aece-658b11124bb3
Resolving identifiers.org (identifiers.org)... 193.62.193.83, 193.62.192.83
Connecting to identifiers.org (identifiers.org)|193.62.193.83|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://dos-gdc.ucsc-cgp-dev.org/ga4gh/dos/v1/dataobjects/23fa7b4b-9d68-429b-aece-658b11124bb3 [following]
--2018-04-13 16:54:53--  https://dos-gdc.ucsc-cgp-dev.org/ga4gh/dos/v1/dataobjects/23fa7b4b-9d68-429b-aece-658b11124bb3
Resolving dos-gdc.ucsc-cgp-dev.org (dos-gdc.ucsc-cgp-dev.org)... 52.84.237.242, 52.84.237.116, 52.84.237.244, ...
Connecting to dos-gdc.ucsc-cgp-dev.org (dos-gdc.ucsc-cgp-dev.org)|52.84.237.242|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115 (1.1K) [application/json]
Saving to: ‘dev.ga4ghdos:23fa7b4b-9d68-429b-aece-658b11124bb3’


2018-04-13 16:54:54 (253 MB/s) - ‘dev.ga4ghdos:23fa7b4b-9d68-429b-aece-658b11124bb3’ saved [1

## Using a DOS identifier resolver

The identifiers.org resolver points at a single service, and there are multiple DOS services available. In order to make the best use of the stable identifiers, an identifier resolver can request the same identifier from multiple services, and return the client the first found, for example.

By curating the list of DOS services that are being resolved by the identifiers.org gateway, it is possible to control which data are served under the controlled prefix.