 ### Basic DRS
 
#### Learning Objectives
Workshop attendees will learn how use the GA4GH Data Repository Service (DRS).  

What will participants do as part of the exercise?

 - Understanding the two main DRS methods
 - Find where a file is available
 - Use a Python client to access DRS and return results
 
 
     
 
 #### Icons in this Guide

 🖐 A hands-on section where you will code something or interact with the server
 
 #### 1. Run a cell in a Jupyter notebook
 To run a cell in a Jupyter notebook
 - Click to the left of the cell
 - Click the Run icon in the toolbar below the menu bar.
 
 🖐 Try it out with the following cell

In [1]:
host_url = 'https://locate.be-md.ncbi.nlm.nih.gov'
drs_id = 'fb1cfb04d3ef99d07c21f9dbf87ccc68'

full_url = host_url + '/ga4gh/drs/v1/objects/' + drs_id
print(full_url)

https://locate.be-md.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/fb1cfb04d3ef99d07c21f9dbf87ccc68


The result of the cell is printed out below the cell.

The python code above built a URL to access a API function which will provide information about where a file is available.

 #### 2. Call the API using the link above
 🖐 Open the link above a new web browser window.

See that a response is produced, but that it is not a detailed web page. It is a response which is intended to be read by a computer program.

We will look at the response more closely below.

Close the browser window

 #### 3. Call the API from Python

The url we built is stored in the variable called full_url.

In the next cell we can use the Python requests module to make the request to the DRS server.

 🖐 Click the cell and run it to the the response

In [2]:
# First to make requests to a web server the requests module is imported 
import requests

response = requests.get(full_url)
print(response.json())

{'access_methods': [{'access_id': '1e4846c05c81a49f684e7f940ffbd3a98e5f0e335f019ee4d32d85c72096b743', 'region': 'gs.US', 'type': 'https'}, {'access_id': 'b14572d74b5aafe87a0fcc873050d6c3993f27338cdd088b5883aed4b118f0c8', 'type': 'https'}, {'access_id': '0623f9350999297e5fa3a77a05c08b8cf1fbd10ef4e392c0d52dde9a4e469a85', 'region': 's3.us-east-1', 'type': 'https'}], 'checksums': [{'checksum': 'fb1cfb04d3ef99d07c21f9dbf87ccc68', 'type': 'md5'}], 'created_time': '2013-02-25T23:24:10Z', 'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68', 'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam', 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/fb1cfb04d3ef99d07c21f9dbf87ccc68', 'size': 8752606127}


That's still not very readable. We can define a function to print the response in a more readble form.

#### 4. Understanding the DRS response

🖐 Click and run the next two cells in turn.

In [3]:
import json
def pretty_print(a_dict):
    print(json.dumps(a_dict, indent=3))

In [4]:
pretty_print(response.json())

{
   "access_methods": [
      {
         "access_id": "1e4846c05c81a49f684e7f940ffbd3a98e5f0e335f019ee4d32d85c72096b743",
         "region": "gs.US",
         "type": "https"
      },
      {
         "access_id": "b14572d74b5aafe87a0fcc873050d6c3993f27338cdd088b5883aed4b118f0c8",
         "type": "https"
      },
      {
         "access_id": "0623f9350999297e5fa3a77a05c08b8cf1fbd10ef4e392c0d52dde9a4e469a85",
         "region": "s3.us-east-1",
         "type": "https"
      }
   ],
   "checksums": [
      {
         "checksum": "fb1cfb04d3ef99d07c21f9dbf87ccc68",
         "type": "md5"
      }
   ],
   "created_time": "2013-02-25T23:24:10Z",
   "id": "fb1cfb04d3ef99d07c21f9dbf87ccc68",
   "name": "NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam",
   "self_url": "drs://locate.be-md.ncbi.nlm.nih.gov/fb1cfb04d3ef99d07c21f9dbf87ccc68",
   "size": 8752606127
}


The most relevant section of the response is the access_methods.

In this example shows that there are three ways the file could be accessed.
The 'region' tells us that the files are available in the US region of Google Cloud Storage (gs.us) and in Amazon S3 storage in the us-east-1 region (s3.us-east-1).

We'll pass on the second of the three access methods for now.

#### 5. Making the second DRS call - getting a URL to access the file

Let's say we have credits available to compute on one of the clouds available. We would pick the access_id from above and use the second api call to obtain a URl to access the file.

Note that we say access and not download. Because the bam file is large, and we may want to work with many such files we may want to run analysis on the file where it is. We will come back to this later.

For now we'll just get the URL.

🖐 As before click on the cell to get the URL

In [5]:
access_id = "1e4846c05c81a49f684e7f940ffbd3a98e5f0e335f019ee4d32d85c72096b743"
full_url = '{}/ga4gh/drs/v1/objects/{}/access/{}'.format(host_url, drs_id, access_id)

print(full_url)

https://locate.be-md.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/fb1cfb04d3ef99d07c21f9dbf87ccc68/access/1e4846c05c81a49f684e7f940ffbd3a98e5f0e335f019ee4d32d85c72096b743


🖐 And click on the cell below to send the request and print the response

In [6]:
url_response = requests.get(full_url)
print(url_response.json())

{'url': 'https://storage.googleapis.com/genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/NA18948/exome_alignment/NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'}


Note the size of the bam file. Though we have a URL for it we won't download it


🖐 Using what you learnt above add code to the example below to retrieve the access id's for each access method.

In [7]:
drs_response = response.json()
for access_method in drs_response['access_methods']:
    # Add code here to make the DRS call to retrieve the URL for each access_id    access_id = access_method['access_id']
    print (access_method['access_id'])


1e4846c05c81a49f684e7f940ffbd3a98e5f0e335f019ee4d32d85c72096b743
b14572d74b5aafe87a0fcc873050d6c3993f27338cdd088b5883aed4b118f0c8
0623f9350999297e5fa3a77a05c08b8cf1fbd10ef4e392c0d52dde9a4e469a85


#### 6. Optional - stretch goal - for python experts

🖐 Imagine you have a preference for working in a particular cloud provider and region. Complete the following function to use DRS to obtain the URL for the file in a specific region

In [8]:
def get_url_for_region(drs_id, region):
    full_url = '{}/ga4gh/drs/v1/objects/{}'.format(host_url, drs_id)
    r = requests.get(full_url)
    drs_response = r.json()
    # add code here - find the access_id for the region
    # Watch out that not all access_methods have region
    # make the DRS call to get the url
    ai = [am['access_id'] for am in drs_response['access_methods'] if 'region' in am and am['region'] == region]
    if len(ai) > 0:
        am_url = '{}/ga4gh/drs/v1/objects/{}/access/{}'.format(host_url, drs_id, ai[0])
        r2 = requests.get(am_url)
        url = r2.json()['url']
    else:
        print("File not available in region {}".format(region))
        url = None
        
    return url

#### 🖐  Test it

In [9]:
get_url_for_region(drs_id, 'gs.US')

'https://storage.googleapis.com/genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/NA18948/exome_alignment/NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'

In [10]:
get_url_for_region(drs_id, 's3.us-east-1')

'https://1000genomes.s3.amazonaws.com/phase3/data/NA18948/exome_alignment/NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'

In [11]:
get_url_for_region(drs_id, 's3.us-west-1')

File not available in region s3.us-west-1


#### 7. Using a DRS Python Client
The above showed how individual calls to DRS can be made. As we are likely to do this repetitively we created a set of functions that could be used to call DRS so we can focus on more interesting aspects of the task.

We can still make use of the variables like host_id and drs_id previously, but now we will pass them to our client.

🖐 Click on the following to make the first DRS request

In [12]:
from fasp.loc import DRSClient
cl = DRSClient(host_url, public=True)
cl.get_object(drs_id)

{'access_methods': [{'access_id': '1e4846c05c81a49f684e7f940ffbd3a98e5f0e335f019ee4d32d85c72096b743',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': 'b14572d74b5aafe87a0fcc873050d6c3993f27338cdd088b5883aed4b118f0c8',
   'type': 'https'},
  {'access_id': '0623f9350999297e5fa3a77a05c08b8cf1fbd10ef4e392c0d52dde9a4e469a85',
   'region': 's3.us-east-1',
   'type': 'https'}],
 'checksums': [{'checksum': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
   'type': 'md5'}],
 'created_time': '2013-02-25T23:24:10Z',
 'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'size': 8752606127}

🖐 and again to get the access URL

In [13]:
cl.get_access_url(drs_id, 'b14572d74b5aafe87a0fcc873050d6c3993f27338cdd088b5883aed4b118f0c8')

'https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/NA18948/exome_alignment/NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'

Our client also includes the function we set as a task above

🖐 Click to test it

In [14]:
cl.get_url_for_region(drs_id, 's3.us-east-1')

'https://1000genomes.s3.amazonaws.com/phase3/data/NA18948/exome_alignment/NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'

#### Extra
🖐 Find a DRS id for a pdf file
Use the functions above to 
find where the file is available
Obtain the url
Download and view the file (you could write code for this last step, but it's not really necessary).

The point is that DRS ids can be used to point to any kind of file - not just genomic files.

In [15]:
from fasp.loc import crdcDRSClient
cl2 = crdcDRSClient("~/.keys/crdc_credentials.json")
image_drs = "bd7cdca3-fd5f-4d72-8612-eeec3de560a5"
cl2.get_object(image_drs)

{'access_methods': [{'access_id': 'gs',
   'access_url': {'url': 'gs://gdc-tcga-phs000178-open/bd7cdca3-fd5f-4d72-8612-eeec3de560a5/TCGA-06-5418-01A-01-TS1.6600b787-bac7-4ad1-8711-f27bae721e7a.svs'},
   'region': '',
   'type': 'gs'},
  {'access_id': 's3',
   'access_url': {'url': 's3://tcga-2-open/bd7cdca3-fd5f-4d72-8612-eeec3de560a5/TCGA-06-5418-01A-01-TS1.6600b787-bac7-4ad1-8711-f27bae721e7a.svs'},
   'region': '',
   'type': 's3'},
  {'access_id': 'https',
   'access_url': {'url': 'https://api.gdc.cancer.gov/data/bd7cdca3-fd5f-4d72-8612-eeec3de560a5'},
   'region': '',
   'type': 'https'}],
 'aliases': [],
 'checksums': [{'checksum': '695103a19f08f9d60f7edd845904a9d3',
   'type': 'md5'}],
 'created_time': '2021-11-29T22:34:36.131709',
 'description': None,
 'form': 'object',
 'id': 'bd7cdca3-fd5f-4d72-8612-eeec3de560a5',
 'mime_type': 'application/json',
 'name': None,
 'self_uri': 'drs://dg.4DFC:bd7cdca3-fd5f-4d72-8612-eeec3de560a5',
 'size': 6007209,
 'updated_time': '2022-02-03T

In [16]:
cl2.get_access_url(image_drs,'s3')

'https://tcga-2-open.s3.amazonaws.com/bd7cdca3-fd5f-4d72-8612-eeec3de560a5/TCGA-06-5418-01A-01-TS1.6600b787-bac7-4ad1-8711-f27bae721e7a.svs?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAINBJ6QVTSWMR7UZQ%2F20220629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220629T205603Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&user_id=2417&username=forei&X-Amz-Signature=c43c642efa67c3eb2fcc0380d9b3e989fdbe6c1576db1622d330421985eef0a3'