## Running samtools view via the WES API

#### Learning Objectives
Workshop attendees will learn how use the GA4GH Workflow Execution Service (WES).  

What will participants do as part of the exercise?

 - Understanding how to run a workflow via WES
 - Adjust some parameters of the workflow
 - Check the status of the runs
 - Access the workflow results via DRS
 
 #### Icons in this Guide

 🖐 A hands-on section where you will code something or interact with the server
 
 
 
Just as we used a python client to submit DRS requests in the previous notebook we will use a similar client from the fasp package to run workflows.

By setting the debug flag to True on the client, the actual http calls will be shown.


In [30]:
from fasp.workflow import sbcgcWESClient
SB_PROJECT = 'forei/ismb-tutorial'
SB_API_KEY_PATH = '~/.keys/sbcgc_key.json'

cl = sbcgcWESClient(SB_PROJECT, debug=True)

In [2]:
task_name = "Tutorial run 1 test via WES - header only"

drs_uris  = ['drs://cgc-ga4gh-api.sbgenomics.com/5832fef8507c17de5bfc5806',
'drs://cgc-ga4gh-api.sbgenomics.com/5772b6f8507c175267448709']

params = {
"project": project_name,
"name": task_name,
"inputs": {
    "output_header_only": True,
    "include_header": True,
        "in_alignments": {
          "path": drs_uris[0],
          "class": "File"
        }
    }
}

## Calling WES from Python

Now we have formulated the body in the way that it can be passed to a client function as follows.

In [3]:
import json

#sam_view_app = 'sbg://admin/sbg-public-data/samtools-view-1-9-cwl1-0'
#sam_view_app = 'sbg://yasasvinip/test-1/samtools-view-1-9-cwl1-0'
sam_view_app = 'sbg://forei/ismb-tutorial/samtools-view-1-9-cwl1-0'

run_id= cl.run_generic_workflow(
    workflow_url=sam_view_app,
    workflow_params = json.dumps(params),
    workflow_type = "CWL",
    workflow_type_version = "v1.0",
    verbose=False
)
run_id

sending to https://cgc-ga4gh-api.sbgenomics.com/ga4gh/wes/v1/runs
Workflow Parameters
"{\"project\": \"forei/ismb-tutorial\", \"name\": \"Tutorial run 1 test via WES - header only\", \"inputs\": {\"output_header_only\": true, \"include_header\": true, \"in_alignments\": {\"path\": \"drs://cgc-ga4gh-api.sbgenomics.com/5832fef8507c17de5bfc5806\", \"class\": \"File\"}}}"
BODY
{
   "workflow_url": [
      null,
      "sbg://forei/ismb-tutorial/samtools-view-1-9-cwl1-0",
      "text/plain"
   ],
   "workflow_params": [
      null,
      "{\"project\": \"forei/ismb-tutorial\", \"name\": \"Tutorial run 1 test via WES - header only\", \"inputs\": {\"output_header_only\": true, \"include_header\": true, \"in_alignments\": {\"path\": \"drs://cgc-ga4gh-api.sbgenomics.com/5832fef8507c17de5bfc5806\", \"class\": \"File\"}}}",
      "application/json"
   ],
   "workflow_engine_params": [
      null,
      null,
      "application/json"
   ],
   "workflow_type": [
      null,
      "CWL",
      "text/

'a5e6968e-72ea-4846-ad72-010878fb5d94'

In [10]:
cl.get_task_status(run_id)

Get request sent to: https://cgc-ga4gh-api.sbgenomics.com/ga4gh/wes/v1/runs/a5e6968e-72ea-4846-ad72-010878fb5d94


'COMPLETE'

#### Adjust a parameter of the run

Using the desciption of the app on the Seven Bridges Platform
identify the parameter that directs samtools view to only output the count of matching records
https://cgc.sbgenomics.com/u/forei/ismb-tutorial/apps/#forei/ismb-tutorial/samtools-view-1-9-cwl1-0

Alter the details in the following copy of the previous run
* Edit the parameters section below to set the value of the parameter you have identified to True.
* Delete the other parameters from the previous run.
* Enter a task name that will help you identify the task

In [8]:
task_name2 = "samtools view count only"

params2 = {
"project": project_name,
"name": task_name2,
"inputs": {
    "count_alignments": True,
        "in_alignments": {
          "path": drs_uris[0],
          "class": "File"
        }
    }
}

#### Submit the revised task and make a note of the run_id

In [11]:
run_id2 = cl.run_generic_workflow(
    workflow_url=sam_view_app,
    workflow_params = json.dumps(params2),
    workflow_type = "CWL",
    workflow_type_version = "v1.0",
    verbose=False
)
run_id2

sending to https://cgc-ga4gh-api.sbgenomics.com/ga4gh/wes/v1/runs
Workflow Parameters
"{\"project\": \"forei/ismb-tutorial\", \"name\": \"samtools view count only\", \"inputs\": {\"count_alignments\": true, \"in_alignments\": {\"path\": \"drs://cgc-ga4gh-api.sbgenomics.com/5832fef8507c17de5bfc5806\", \"class\": \"File\"}}}"
BODY
{
   "workflow_url": [
      null,
      "sbg://forei/ismb-tutorial/samtools-view-1-9-cwl1-0",
      "text/plain"
   ],
   "workflow_params": [
      null,
      "{\"project\": \"forei/ismb-tutorial\", \"name\": \"samtools view count only\", \"inputs\": {\"count_alignments\": true, \"in_alignments\": {\"path\": \"drs://cgc-ga4gh-api.sbgenomics.com/5832fef8507c17de5bfc5806\", \"class\": \"File\"}}}",
      "application/json"
   ],
   "workflow_engine_params": [
      null,
      null,
      "application/json"
   ],
   "workflow_type": [
      null,
      "CWL",
      "text/plain"
   ],
   "workflow_type_version": [
      null,
      "v1.0",
      "text/plain"
  

'60d1acc0-13f3-4647-82dc-2948cec56d4f'

#### Noting the name of the variable in which the id of the new run was executed, write a line to check the status of the run

In [16]:
cl.get_task_status(run_id2)

Get request sent to: https://cgc-ga4gh-api.sbgenomics.com/ga4gh/wes/v1/runs/60d1acc0-13f3-4647-82dc-2948cec56d4f


'COMPLETE'

## Getting the results - via DRS
Once the first run is complete, further steps can use DRS to obtain the file output from the workflow.

In [18]:
runLog = cl.get_run_log(run_id)
runLog['outputs']

Get request sent to: https://cgc-ga4gh-api.sbgenomics.com/ga4gh/wes/v1/runs/a5e6968e-72ea-4846-ad72-010878fb5d94


{'reads_not_selected_by_filters': None,
 'alignement_count': None,
 'out_alignments': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/62bcbd19f08fea4770666e94',
  'basename': '_17_G20479.HCC1143_1M.aligned.header.sam',
  'nameext': '.sam',
  'class': 'File',
  'nameroot': '_17_G20479.HCC1143_1M.aligned.header'}}

In [19]:
results_drs_uri = runLog['outputs']['out_alignments']['path']
results_drs_uri

'drs://cgc-ga4gh-api.sbgenomics.com/62bcbd19f08fea4770666e94'

We'll pass over the question of how one would determine which DRS server that URI needs to be sent to because
* In this case it's fairly obvious - it's the CGC DRS Server
* We want to get something up and working

In [20]:
from fasp.loc import sbcgcDRSClient
drsClient = sbcgcDRSClient(SB_API_KEY_PATH, 's3')

### DRS GetObject
Here's how we then get details of the file. Note that here only the id portion of the DRS URI is being passed. It is the job of a metaresolver to look at that URI and to determine where to send the id. As noted, we are passing up on the opportunity to use a metaresolver and extracting the bare id as follows

In [21]:
# get the id part of the URI
out_alignments_drs_id = results_drs_uri.split('/')[-1]
print(f"Getting {out_alignments_drs_id} from DRS Client")
fileDetails = drsClient.get_object(out_alignments_drs_id)
fileDetails

Getting 62bcbd19f08fea4770666e94 from DRS Client


{'id': '62bcbd19f08fea4770666e94',
 'name': '_17_G20479.HCC1143_1M.aligned.header.sam',
 'size': 3598,
 'checksums': [{'type': 'etag',
   'checksum': '159acae2ce81efee40f0f89ee58f2f95-1'}],
 'self_uri': 'drs://cgc-ga4gh-api.sbgenomics.com/62bcbd19f08fea4770666e94',
 'created_time': '2022-06-29T20:59:05Z',
 'updated_time': '2022-06-29T20:59:05Z',
 'mime_type': 'application/json',
 'access_methods': [{'type': 's3',
   'region': 'us-east-1',
   'access_id': 'aws-us-east-1'}]}

In [22]:
url = drsClient.get_access_url(out_alignments_drs_id,'s3')

### Warning - the results files are approx 700-800Mb

### Downloading the file
Now we can use the url obtained to download the file. We'll create a small function to encapsulate the download.

In [23]:
import requests
import os
def download(url, file_path):
    with open(os.path.expanduser(file_path), "wb") as file:
        response = requests.get(url)
        file.write(response.content)

In [24]:
fullPath = '~/Downloads/' + fileDetails['name']
download(url, fullPath)

### Repeat the steps above to retrieve the results of the second run

In [25]:
runLog = cl.get_run_log(run_id2)
runLog['outputs']

Get request sent to: https://cgc-ga4gh-api.sbgenomics.com/ga4gh/wes/v1/runs/60d1acc0-13f3-4647-82dc-2948cec56d4f


{'reads_not_selected_by_filters': None,
 'alignement_count': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/62bcbec3f08fea4770666edc',
  'basename': '_7_G20479.HCC1143_1M.aligned.count.txt',
  'nameext': '.txt',
  'class': 'File',
  'nameroot': '_7_G20479.HCC1143_1M.aligned.count'},
 'out_alignments': None}

In [26]:
results_drs_uri = runLog['outputs']['alignement_count']['path']
results_drs_uri

'drs://cgc-ga4gh-api.sbgenomics.com/62bcbec3f08fea4770666edc'

In [27]:
# get the id part of the URI
alignment_count_drs_id = results_drs_uri.split('/')[-1]
print(f"Getting {alignment_count_drs_id} from DRS Client")
fileDetails = drsClient.get_object(alignment_count_drs_id)
fileDetails

Getting 62bcbec3f08fea4770666edc from DRS Client


{'id': '62bcbec3f08fea4770666edc',
 'name': '_7_G20479.HCC1143_1M.aligned.count.txt',
 'size': 8,
 'checksums': [{'type': 'etag',
   'checksum': 'f5c995e0afde0c351122a317dde6d44d-1'}],
 'self_uri': 'drs://cgc-ga4gh-api.sbgenomics.com/62bcbec3f08fea4770666edc',
 'created_time': '2022-06-29T21:06:11Z',
 'updated_time': '2022-06-29T21:06:11Z',
 'mime_type': 'application/json',
 'access_methods': [{'type': 's3',
   'region': 'us-east-1',
   'access_id': 'aws-us-east-1'}]}

In [28]:
url = drsClient.get_access_url(alignment_count_drs_id,'s3')

In [29]:
fullPath = '~/Downloads/' + fileDetails['name']
download(url, fullPath)