 ### Combining GA4GH standards to perform an end-to-end workflow
 
#### Learning Objectives
Combine Data Connect, WES and DRS services  

What will participants do as part of the exercise?

 - Search for files with Data Connect
 - Obtain links to access files
 - Submit the files to a WES workflow
 - Retrieve the results of the analysis
 
 
 #### Icons in this Guide

 🖐 A hands-on section where you will code something or interact with the server
 
 #### 1. Run a cell in a Jupyter notebook
 
 ## Obtain Thousand Genomes files from SRA DRS and submit to Seven Bridges WES

🖐 Set up your project name, location of your file

In [2]:
SB_PROJECT = 'forei/ismb-tutorial'
SB_API_KEY_PATH = '~/.keys/sbcgc_key.json'
DOWNLOAD_LOCATION = '~/Downloads'

In [3]:
from fasp.search import DataConnectClient

# Step 1 - Discovery
# query for relevant DRS objects
searchClient = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')

query = '''SELECT f.sample_name, drs_id bam_drs_id, acc
FROM thousand_genomes.onek_genomes.ssd_drs s 
join thousand_genomes.onek_genomes.sra_drs_files f on f.sample_name = s.su_submitter_id 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  population = 'PUR' LIMIT 3'''

json_result = searchClient.run_query(query, returnType='json')
json_result

Retrieving the query
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________


[{'sample_name': 'HG00731',
  'bam_drs_id': '515ae091f29ac699a4d2e272812cea47',
  'acc': 'SRR1606560'},
 {'sample_name': 'HG00637',
  'bam_drs_id': '475dfc02f643c368036df6816d05afe4',
  'acc': 'SRR1596919'},
 {'sample_name': 'HG00640',
  'bam_drs_id': '58e2964f2a0adbf41ab0e8c7a95e7d0c',
  'acc': 'SRR1596923'}]

### Convert the result into a Dataframe

In [4]:
import pandas as pd
first_df = pd.DataFrame(json_result)
first_df

Unnamed: 0,sample_name,bam_drs_id,acc
0,HG00731,515ae091f29ac699a4d2e272812cea47,SRR1606560
1,HG00637,475dfc02f643c368036df6816d05afe4,SRR1596919
2,HG00640,58e2964f2a0adbf41ab0e8c7a95e7d0c,SRR1596923


### Use DRS to obtain file details

The following shows how the SRA DRS server can be used to determine where the files can be obtained from. The following shows this for the first DRS id from the query results. 

In [5]:
from fasp.loc import DRSClient

drsClient = DRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True, debug=False)
test_id = json_result[0]['bam_drs_id']
print(test_id)
objInfo = drsClient.get_object(test_id)
objInfo

515ae091f29ac699a4d2e272812cea47


{'access_methods': [{'access_id': '8cc282a3e09887491fa5aa7ff1c209b1a4b9bf1cc55dd9767075e625968f364a',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': '2c8e9f0f20117987660e677c0d3556c198ec109447b378dfcc6f8639b6a0b5e2',
   'type': 'https'},
  {'access_id': 'fa57eb71b1f001a479f2462a0ca9fc9f35f64e544150086e4a55fc86d8eeaed3',
   'region': 's3.us-east-1',
   'type': 'https'}],
 'checksums': [{'checksum': '515ae091f29ac699a4d2e272812cea47',
   'type': 'md5'}],
 'created_time': '2013-05-08T10:25:13Z',
 'id': '515ae091f29ac699a4d2e272812cea47',
 'name': 'HG00731.mapped.ILLUMINA.bwa.PUR.exome.20130422.bam',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/515ae091f29ac699a4d2e272812cea47',
 'size': 32108614682}

A second DRS call can be used to obtain a url to access the file from one of the above locations.

In [6]:
access_id = objInfo['access_methods'][0]['access_id']
print('access_id:{}'.format(access_id))
url = drsClient.get_access_url(test_id, access_id=access_id)
print('url:{}'.format(url))

access_id:8cc282a3e09887491fa5aa7ff1c209b1a4b9bf1cc55dd9767075e625968f364a
url:https://storage.googleapis.com/genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00731/exome_alignment/HG00731.mapped.ILLUMINA.bwa.PUR.exome.20130422.bam


### Set up a WES client

In [7]:
from fasp.workflow import sbcgcWESClient
wesClient = sbcgcWESClient(SB_PROJECT, api_key_path=SB_API_KEY_PATH)

#### Define a function to submit the workflow

In the following it may be necessary to point to your copy of the application.
Instructions will we provided

In [10]:
import json
import requests

def runWorkflow(wesClient, fileurl, outfile):

    sam_view_app = 'sbg://forei/ismb-tutorial/samtools-view-drsurl-1-8-url'
    #replace with your copy of the app
    #sam_view_app = 'sbg://forei/ismb-tutorial/samtools-view-drsurl-1-8-url'



    #ref_drs_id = 'drs://cgc-ga4gh-api.sbgenomics.com/5caf7ebec80cb0e41b007adf'
    ref_drs_id = 'drs://cgc-ga4gh-api.sbgenomics.com/62b07ea84e3edb6b1c23c8d5'

    params = {
        "project": wesClient.project_id,
        "inputs": {
          "alignment_file_url": fileurl,
          "count_alignments": True,
          "reference_file": {
            "path": ref_drs_id,
            "name": "references-hs37d5-hs37d5.fasta",
            "class": "File"
          },
          "output_file_path": outfile
        }
     }


    body = {
      "workflow_params": (None, json.dumps(params), 'application/json'),
      "workflow_type": "CWL",
      "workflow_type_version": "sbg:draft-2",
      "workflow_url": sam_view_app
    }
    
    run_id= wesClient.run_generic_workflow(
        workflow_url=sam_view_app,
        workflow_params = json.dumps(params),
        workflow_type = "CWL",
        workflow_type_version = "sbg:draft-2",
        verbose=False
    )
    return(run_id)

#### For each result of the query above submit a task to the Cancer Genomics Cloud

In [11]:
import datetime

# set the region we want to access data from
region = 's3.us-east-1'
my_runs = []
        
for row in json_result:

    print("subject={}, drsID={}".format(row['bam_drs_id'], row['sample_name']))
    drs_id = row['bam_drs_id']


    objInfo = drsClient.get_object(drs_id)
    url = drsClient.get_url_for_region(drs_id,region)

    # Step 3 - Run a pipeline on the file at the drs url
    if url != None:
        outfile = "{}.txt".format(row['sample_name'])
        time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        run_id = runWorkflow(wesClient, url, outfile)
        print('Submitted run {} to {}'.format(run_id, wesClient.__class__.__name__))
        my_runs.append(run_id)
        row['run_id']=run_id
    print('_________________________________________________________________________')

subject=515ae091f29ac699a4d2e272812cea47, drsID=HG00731
Submitted run b9dfbc06-e31c-4f22-b5d5-fbcd517ceb7d to sbcgcWESClient
_________________________________________________________________________
subject=475dfc02f643c368036df6816d05afe4, drsID=HG00637
Submitted run 55574c1a-a47c-42e5-b146-a46faf20487c to sbcgcWESClient
_________________________________________________________________________
subject=58e2964f2a0adbf41ab0e8c7a95e7d0c, drsID=HG00640
Submitted run 26d015b0-17d8-4d41-be58-c5cba814af1c to sbcgcWESClient
_________________________________________________________________________


In [13]:
json_result

[{'sample_name': 'HG00731',
  'bam_drs_id': '515ae091f29ac699a4d2e272812cea47',
  'acc': 'SRR1606560',
  'run_id': '9545528c-8014-4d73-a5c7-3c07c4b2472d'},
 {'sample_name': 'HG00637',
  'bam_drs_id': '475dfc02f643c368036df6816d05afe4',
  'acc': 'SRR1596919',
  'run_id': '4925d6a5-3227-401c-b4be-66e7285b410f'},
 {'sample_name': 'HG00640',
  'bam_drs_id': '58e2964f2a0adbf41ab0e8c7a95e7d0c',
  'acc': 'SRR1596923',
  'run_id': '4791be68-21a7-459d-ae95-d735c0b14d41'}]

In [13]:
for run in json_result:
    status = wesClient.get_task_status(run['run_id'])
    print(("Run {} {}".format(run['run_id'], status)))

Run b9dfbc06-e31c-4f22-b5d5-fbcd517ceb7d COMPLETE
Run 55574c1a-a47c-42e5-b146-a46faf20487c COMPLETE
Run 26d015b0-17d8-4d41-be58-c5cba814af1c COMPLETE


### Check status above until completion
Expect these runs to take 7-10 minutes to complete

## Getting the results

In [14]:
runLog = wesClient.get_run_log(my_runs[0])
runLog

{'request': {'tags': {},
  'workflow_params': {'name': 'SAMtools View 1.8 run - 06-30-22 13:04:09',
   'project': 'forei/ismb-tutorial',
   'inputs': {'total_memory_GB': None,
    'coverage_limit': None,
    'count_alignments': True,
    'include_only_read_group': None,
    'remove_duplicates': None,
    'max_insert_size': None,
    'reference_file': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/62b07ea84e3edb6b1c23c8d5',
     'name': 'Homo_sapiens_assembly19_1000genomes_decoy.fasta',
     'class': 'File'},
    'output_file_path': 'HG00731.txt',
    'alignment_file_url': 'https://1000genomes.s3.amazonaws.com/phase3/data/HG00731/exome_alignment/HG00731.mapped.ILLUMINA.bwa.PUR.exome.20130422.bam'}},
  'workflow_type': 'CWL',
  'workflow_engine_params': {},
  'workflow_url': 'sbg://forei/ismb-tutorial/samtools-view-drsurl-1-8-url'},
 'state': 'COMPLETE',
 'outputs': {'counts': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/62bda38c14b0e420a0e2ea10',
   'name': '_5_HG00731.txt',
   'class': '

Use the Seven Bridges CGC DRS service to retrieve the output file

In [15]:
from  fasp.loc import sbcgcDRSClient
results_DRS_client = sbcgcDRSClient(SB_API_KEY_PATH, 's3')
resultsDRSID = runLog['outputs']['counts']['path']
resultsDRSID = resultsDRSID.split('/')[-1]
print(resultsDRSID)
fileDetails = results_DRS_client.get_object(resultsDRSID)
print(fileDetails)
url = results_DRS_client.get_access_url(resultsDRSID, 's3')

62bda38c14b0e420a0e2ea10
{'id': '62bda38c14b0e420a0e2ea10', 'name': '_5_HG00731.txt', 'size': 10, 'checksums': [{'type': 'etag', 'checksum': '67e4fcf39160d5fc84b28ef28d3eac7a-1'}], 'self_uri': 'drs://cgc-ga4gh-api.sbgenomics.com/62bda38c14b0e420a0e2ea10', 'created_time': '2022-06-30T13:22:20Z', 'updated_time': '2022-06-30T13:22:20Z', 'mime_type': 'application/json', 'access_methods': [{'type': 's3', 'region': 'us-east-1', 'access_id': 'aws-us-east-1'}]}


In [17]:
wesClient.get_run_log('4c1d705f-c9bb-4347-818b-a1e84af60255')

{'request': {'tags': {},
  'workflow_params': {'name': 'SAMtools View 1.8 run - 06-29-22 23:50:13',
   'project': 'forei/ismb-tutorial',
   'inputs': {'total_memory_GB': None,
    'coverage_limit': None,
    'count_alignments': True,
    'include_only_read_group': None,
    'remove_duplicates': None,
    'max_insert_size': None,
    'reference_file': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/62b07ea84e3edb6b1c23c8d5',
     'name': 'Homo_sapiens_assembly19_1000genomes_decoy.fasta',
     'class': 'File'},
    'output_file_path': 'NA18948.txt',
    'alignment_file_url': 'https://1000genomes.s3.amazonaws.com/phase3/data/NA18948/exome_alignment/NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'}},
  'workflow_type': 'CWL',
  'workflow_engine_params': {},
  'workflow_url': 'sbg://forei/ismb-tutorial/samtools-view-drsurl-1-8-url'},
 'state': 'COMPLETE',
 'outputs': {'counts': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/62bce6d14e3edb6b1c423d75',
   'name': 'NA18948.txt',
   'class': 'Fil

The next cell defines a function to retrieve the results from the WES server

* Retrieve the result
* Downlaod the result file
* Extract the count from the file
* Return the count


In [19]:
import tempfile

def get_sam_view_result(run_id):
    # WES API call to retrieve the log of the run - including the results
    log = wesClient.get_run_log(run_id)
    resultsDRSID = log['outputs']['counts']['path']
    resultsDRSID = resultsDRSID.split('/')[-1]
    
    # DRS API call to get the results file
    url = results_DRS_client.get_access_url(resultsDRSID,'s3')
    
    with tempfile.NamedTemporaryFile(mode='r+') as file:
        response = requests.get(url)
        file.write(response.text)
        file.seek(0)
        x = file.read()
    return x.strip()

    

In [20]:
for run in json_result:
    status = wesClient.get_task_status(run['run_id'])
    if  status == 'COMPLETE':
        count_result = get_sam_view_result(run['run_id'])
        run['count_result'] = count_result
    else:
        run['count_result'] = status


In [21]:
json_result

[{'sample_name': 'HG00731',
  'bam_drs_id': '515ae091f29ac699a4d2e272812cea47',
  'acc': 'SRR1606560',
  'run_id': 'b9dfbc06-e31c-4f22-b5d5-fbcd517ceb7d',
  'count_result': '432255472'},
 {'sample_name': 'HG00637',
  'bam_drs_id': '475dfc02f643c368036df6816d05afe4',
  'acc': 'SRR1596919',
  'run_id': '55574c1a-a47c-42e5-b146-a46faf20487c',
  'count_result': '102554431'},
 {'sample_name': 'HG00640',
  'bam_drs_id': '58e2964f2a0adbf41ab0e8c7a95e7d0c',
  'acc': 'SRR1596923',
  'run_id': '26d015b0-17d8-4d41-be58-c5cba814af1c',
  'count_result': '102424655'}]

In [22]:
import pandas as pd
df = pd.DataFrame(json_result)
df

Unnamed: 0,sample_name,bam_drs_id,acc,run_id,count_result
0,HG00731,515ae091f29ac699a4d2e272812cea47,SRR1606560,b9dfbc06-e31c-4f22-b5d5-fbcd517ceb7d,432255472
1,HG00637,475dfc02f643c368036df6816d05afe4,SRR1596919,55574c1a-a47c-42e5-b146-a46faf20487c,102554431
2,HG00640,58e2964f2a0adbf41ab0e8c7a95e7d0c,SRR1596923,26d015b0-17d8-4d41-be58-c5cba814af1c,102424655
