 ### Combining GA4GH standards to perform an end-to-end workflow
 
#### Learning Objectives
Combine Data Connect, WES and DRS services  

What will participants do as part of the exercise?

 - Search for files with Data Connect
 - Obtain links to access files
 - Submit the files to a WES workflow
 - Retrieve the results of the analysis
 
 
 #### Icons in this Guide

 🖐 A hands-on section where you will code something or interact with the server
 
 #### 1. Run a cell in a Jupyter notebook
 
 ## Obtain Thousand Genomes files from SRA DRS and submit to Seven Bridges WES

🖐 Set up your project name, location of your file

In [None]:
SB_PROJECT = 'forei/ismb-tutorial'
SB_API_KEY_PATH = '~/.keys/sbcgc_key.json'
DOWNLOAD_LOCATION = '~/Downloads'

In [None]:
from fasp.search import DataConnectClient

# Step 1 - Discovery
# query for relevant DRS objects
searchClient = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')

query = '''SELECT f.sample_name, drs_id bam_drs_id, acc
FROM thousand_genomes.onek_genomes.ssd_drs s 
join thousand_genomes.onek_genomes.sra_drs_files f on f.sample_name = s.su_submitter_id 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  population = 'PUR' LIMIT 3'''

json_result = searchClient.run_query(query, returnType='json')
json_result

### Convert the result into a Dataframe

In [None]:
import pandas as pd
first_df = pd.DataFrame(json_result)
first_df

### Use DRS to obtain file details

The following shows how the SRA DRS server can be used to determine where the files can be obtained from. The following shows this for the first DRS id from the query results. 

In [None]:
from fasp.loc import DRSClient

drsClient = DRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True, debug=False)
test_id = json_result[0]['bam_drs_id']
print(test_id)
objInfo = drsClient.get_object(test_id)
objInfo

A second DRS call can be used to obtain a url to access the file from one of the above locations.

In [None]:
access_id = objInfo['access_methods'][0]['access_id']
print('access_id:{}'.format(access_id))
url = drsClient.get_access_url(test_id, access_id=access_id)
print('url:{}'.format(url))

### Set up a WES client

In [None]:
from fasp.workflow import sbcgcWESClient
wesClient = sbcgcWESClient(SB_PROJECT, api_key_path=SB_API_KEY_PATH)

#### Define a function to submit the workflow

🖐 In the following it may be necessary to point to your copy of the application.
Instructions will we provided

In [None]:
import json
import requests

def runWorkflow(wesClient, fileurl, outfile):

    sam_view_app = 'sbg://forei/ismb-tutorial/samtools-view-drsurl-1-8-url'
    #replace with your copy of the app
    #sam_view_app = 'sbg://forei/ismb-tutorial/samtools-view-drsurl-1-8-url'

    ref_drs_id = 'drs://cgc-ga4gh-api.sbgenomics.com/62b07ea84e3edb6b1c23c8d5'

    params = {
        "project": SB_PROJECT,
        "inputs": {
          "alignment_file_url": fileurl,
          "count_alignments": True,
          "reference_file": {
            "path": ref_drs_id,
            "name": "references-hs37d5-hs37d5.fasta",
            "class": "File"
          },
          "output_file_path": outfile
        }
     }


    body = {
      "workflow_params": (None, json.dumps(params), 'application/json'),
      "workflow_type": "CWL",
      "workflow_type_version": "sbg:draft-2",
      "workflow_url": sam_view_app
    }
    
    run_id= wesClient.run_generic_workflow(
        workflow_url=sam_view_app,
        workflow_params = json.dumps(params),
        workflow_type = "CWL",
        workflow_type_version = "sbg:draft-2",
        verbose=False
    )
    return(run_id)

#### For each result of the query above submit a task to the Cancer Genomics Cloud

In [None]:
import datetime

# set the region we want to access data from
region = 's3.us-east-1'
my_runs = []
        
for row in json_result:

    print("subject={}, drsID={}".format(row['bam_drs_id'], row['sample_name']))
    drs_id = row['bam_drs_id']


    objInfo = drsClient.get_object(drs_id)
    url = drsClient.get_url_for_region(drs_id,region)

    # Step 3 - Run a pipeline on the file at the drs url
    if url != None:
        outfile = "{}.txt".format(row['sample_name'])
        time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        run_id = runWorkflow(wesClient, url, outfile)
        print('Submitted run {} to {}'.format(run_id, wesClient.__class__.__name__))
        my_runs.append(run_id)
        row['run_id']=run_id
    print('_________________________________________________________________________')

In [None]:
json_result

In [None]:
for run in json_result:
    status = wesClient.get_task_status(run['run_id'])
    print(("Run {} {}".format(run['run_id'], status)))

### Check status above until completion
Expect these runs to take 7-10 minutes to complete

## Getting the results

Use the Seven Bridges CGC DRS service to retrieve the output file

The next cell defines a function to retrieve the results from the WES server

* Retrieve the result
* Download the result file
* Extract the count from the file
* Return the count


In [None]:
import tempfile

def get_sam_view_result(run_id):
    # WES API call to retrieve the log of the run - including the results
    log = wesClient.get_run_log(run_id)
    resultsDRSID = log['outputs']['counts']['path']
    resultsDRSID = resultsDRSID.split('/')[-1]
    
    # DRS API call to get the results file
    url = results_DRS_client.get_access_url(resultsDRSID,'s3')
    
    with tempfile.NamedTemporaryFile(mode='r+') as file:
        response = requests.get(url)
        file.write(response.text)
        file.seek(0)
        x = file.read()
    return x.strip()

    

In [None]:
for run in json_result:
    status = wesClient.get_task_status(run['run_id'])
    if  status == 'COMPLETE':
        count_result = get_sam_view_result(run['run_id'])
        run['count_result'] = count_result
    else:
        run['count_result'] = status


In [None]:
json_result

In [None]:
import pandas as pd
df = pd.DataFrame(json_result)
df