## Using DRS to access files from SRA
This notebook explores two approaches to getting to specific objects or files via DRS.

For context, another notebook shows how the files identified via the approaches here can be submitted for compute via a  WES service. 

The data and files used are from the Thousand Genomes project. The following query using Data Connect shows how, in a single step, the DRS ids for mapped BAM files for whole exome sequencing for subjects from a particular population.

In [2]:
from fasp.search import DataConnectClient

# Step 1 - Discovery
# query for relevant DRS objects
searchClient = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')

query = '''SELECT f.sample_name, drs_id bam_drs_id, acc
FROM thousand_genomes.onek_genomes.ssd_drs s 
join thousand_genomes.onek_genomes.sra_drs_files f on f.sample_name = s.su_submitter_id 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  population = 'JPT' '''

resultRows = searchClient.runQuery(query, returnType='dataframe')
resultRows

Retrieving the query
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________


Unnamed: 0,sample_name,bam_drs_id,acc
0,NA18948,fb1cfb04d3ef99d07c21f9dbf87ccc68,SRR1601121
1,NA18945,9327fb44eb81b49a41e38c8d86eb3b3a,SRR1601115
2,NA18943,9f38253b281c7e9c99e4bdbececd8e2f,SRR1606910
3,NA18944,5aff9cee759c930666e94e65dbb0af94,SRR1601113
4,NA18940,333a651b55970c9402db51ebb5e55d09,SRR1607212
...,...,...,...
99,NA19074,0805baa0849485a2a63ea41429b9b37c,SRR1604135
100,NA19081,cb072733f15565af2790a90efe60b0e1,SRR1598082
101,NA19080,6f9f1fc52166530ed0568d61451b032f,SRR1598080
102,NA19087,b5f9609124241ade815fe49e2eb38c4f,SRR1603951


In [11]:
searchClient.listTableInfo('thousand_genomes.onek_genomes.sra_drs_files', verbose=True)

_Schema for tablethousand_genomes.onek_genomes.sra_drs_files_
{
   "name": "thousand_genomes.onek_genomes.sra_drs_files",
   "description": "Automatically generated schema",
   "data_model": {
      "$id": "https://ga4gh-search-adapter-presto-public.prod.dnastack.com/table/thousand_genomes.onek_genomes.sra_drs_files/info",
      "description": "Automatically generated schema",
      "$schema": "http://json-schema.org/draft-07/schema#",
      "properties": {
         "acc": {
            "format": "varchar",
            "type": "string",
            "$comment": "varchar"
         },
         "filename": {
            "format": "varchar",
            "type": "string",
            "$comment": "varchar"
         },
         "drs_id": {
            "format": "varchar",
            "type": "string",
            "$comment": "varchar"
         },
         "filetype": {
            "format": "varchar",
            "type": "string",
            "$comment": "varchar"
         },
         "sample_

<fasp.search.data_connect_client.SearchSchema at 0x13084aee0>

The following shows how the SRA DRS server can be used to determine where the files can be obtained from. The following shows this for the first DRS id from the query results. 

In [72]:
from fasp.loc import DRSClient

# Set up a client to access NCBI's  DRS Server for the Sequence Read Archive (SRA)
drsClient = DRSClient('https://locate.be-md.ncbi.nlm.nih.gov', debug=True, public=True)
# Get the DRS id 
test_id = '59eb87314f05d99a4ef8cd250353d151'
# Use the DRS GetObject function to find out where the file is availble for access
objInfo = drsClient.getObject(test_id)
objInfo

https://locate.be-md.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/59eb87314f05d99a4ef8cd250353d151


{'access_methods': [{'access_id': 'aed1336035380817df6565a2e1ed72cad908160aea401422a22faef6f99df92e',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': '6d0c7ff04ed6411c1bc01f0a443cbd02314b9fc03e18826b15aba89dd7eeb0e5',
   'type': 'https'},
  {'access_id': '9a6d6ddec179ac5abfcbd3a98820476edf2eb81e1efeb65ebff474ba262db06a',
   'region': 's3.us-east-1',
   'type': 'https'}],
 'checksums': [{'checksum': '59eb87314f05d99a4ef8cd250353d151',
   'type': 'md5'}],
 'created_time': '2012-11-18T05:32:45Z',
 'id': '59eb87314f05d99a4ef8cd250353d151',
 'name': 'NA19077.mapped.ILLUMINA.bwa.JPT.exome.20120522.bam',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/59eb87314f05d99a4ef8cd250353d151',
 'size': 9831919221}

A second DRS call can be used to obtain a url to access the file from one of the above locations.

Note that unlike other DRS servers the SRA DRS server uses arbitrary user_ids (consistent with spec) so our SRA DRS client function to obtain a URL takes the region we want to use rather than the access_id.


In [73]:
def getAccessForRegion(drs_response, region):
    access_methods = drs_response['access_methods']
    access_method = [am for am in access_methods if ('region' in am and am['region'] == region)]
    if len(access_method) == 0:
        print ('object not in region {}'.format(region))
        return None
    return access_method[0]['access_id']

In [74]:
access_id = getAccessForRegion(objInfo, 'gs.US')
access_id

'aed1336035380817df6565a2e1ed72cad908160aea401422a22faef6f99df92e'

In [76]:
print('access_id:{}'.format(access_id))
#url = drsClient.getAccessURL(test_id, access_id=access_id)
url = drsClient.getAccessURL(test_id, access_id)
print('Access url: {}'.format(url))

access_id:aed1336035380817df6565a2e1ed72cad908160aea401422a22faef6f99df92e
https://locate.be-md.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/59eb87314f05d99a4ef8cd250353d151/access/aed1336035380817df6565a2e1ed72cad908160aea401422a22faef6f99df92e
<Response [200]>
Access url: https://storage.googleapis.com/genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/NA19077/exome_alignment/NA19077.mapped.ILLUMINA.bwa.JPT.exome.20120522.bam


## The SRA Identity Exchange and DRS services
Can we take an SRA accession number from above and see what it looks like through the SRA IDentity eXchange service (IDX), and how that works through in DRS. We'll start with a run accession an SRR?

The SRADRSClient has an additional function to access the IDX service with a SRA accession number 

In [7]:
import json
from fasp.loc import SRADRSClient
drsClient = SRADRSClient('https://locate.be-md.ncbi.nlm.nih.gov', debug=True, public=True)

accession = 'SRR1601121'
idx = drsClient.acc2drs(accession)
print(json.dumps(idx, indent=3))
drsId = idx['response'][accession]['drs']
print (drsId)


{
   "drs-base": "drs://locate.be-md.ncbi.nlm.nih.gov",
   "response": {
      "SRR1601121": {
         "drs": "9466d7c1ec8fde019ce630c9bd88582e",
         "status_code": 200
      }
   }
}
9466d7c1ec8fde019ce630c9bd88582e


<del>Note: the base URI returned in the result above suggests the DRS service could be accessed at the URL https://locate.ncbi.nlm.nih.gov . At present, for performance purposes the SRA DRS service should be accessed at https://locate.be-md.ncbi.nlm.nih.gov. See the example above</del>

Now use the DRS service with that id.

In [8]:
drsClient.getObject(drsId)

https://locate.be-md.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/9466d7c1ec8fde019ce630c9bd88582e


{'checksums': [{'checksum': '9466d7c1ec8fde019ce630c9bd88582e',
   'type': 'md5'}],
 'contents': [{'id': '519de9933298caa8bdf551351426d120',
   'name': 'NA18948.unmapped.ILLUMINA.bwa.JPT.exome.20121211.bam'},
  {'id': 'a027e7c2a917cba582a9684244ad339d',
   'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam.bai'},
  {'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
   'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'}],
 'created_time': '2013-02-25T23:24:10Z',
 'id': '9466d7c1ec8fde019ce630c9bd88582e',
 'name': 'SRR1601121',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/9466d7c1ec8fde019ce630c9bd88582e',
 'size': 8763581919}

Our intent as in the first approach is to work with the mapped bam file. We can see visually, from the filename, which file which DRS id we need.
#### An issue
This highlights the first issue with this approach. The information we need to identify the file we need is in the file name. That would be fine for low throughput situations carried out by human eye. It does not scale to machine actionable larger use cases.

#### Moving on
We use the manually identified id via DRS to identify how we may get the file of interest. This is identical to how we did this under the first approach.

In [83]:
drsClient.getObject('fb1cfb04d3ef99d07c21f9dbf87ccc68')

https://locate.be-md.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/fb1cfb04d3ef99d07c21f9dbf87ccc68


{'access_methods': [{'access_id': '1e4846c05c81a49f684e7f940ffbd3a98e5f0e335f019ee4d32d85c72096b743',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': 'b14572d74b5aafe87a0fcc873050d6c3993f27338cdd088b5883aed4b118f0c8',
   'type': 'https'},
  {'access_id': '0623f9350999297e5fa3a77a05c08b8cf1fbd10ef4e392c0d52dde9a4e469a85',
   'region': 's3.us-east-1',
   'type': 'https'}],
 'checksums': [{'checksum': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
   'type': 'md5'}],
 'created_time': '2013-02-25T23:24:10Z',
 'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'size': 8752606127}