# Using HealthOmics Workflow & Runs
### The goal of this notebook is to get you acquainted with HealthOmics Storage.

#### If you complete this notebook you will have:
+ Created a HealthOmics Workflow
+ Created a HealthOmics run group
+ Run the methylseq workflow


## Prerequisites
#### Python requirements
+ Python >= 3.8
#### Packages:
+ boto3 >= 1.26.19
+ botocore >= 1.29.19
#### AWS requirements
+ AWS CLI
+ You will need the AWS CLI installed and configured in your environment. Supported AWS CLI versions are:
    - AWS CLI v2 >= 2.9.3 (Recommended)
    - AWS CLI v1 >= 1.27.19
    - AWS Region

<div class="alert alert-block alert-info">
<b>NOTE:</b> AWS HealthOmics only allows importing data within the same region. AWS HealthOmics is currently available in Oregon (us-west-2), N. Virginia (us-east-1), Dublin (eu-west-1), London (eu-west-2), Frankfurt (eu-central-1), and Singapore (ap-southeast-1).</div>

## Getting Started
### Step 1. Import libraries

In [54]:
#Import necessary libraries and python SDK
from datetime import datetime
import json
import os
import time

import boto3
import botocore.exceptions

In [82]:
bucket_name = "nigms-scrnaseq-bucket-demo"
bucket_name_out = bucket_name+"-out"
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name
workflow_name = 'scrnaseq-workflow-test-john'
# We will use this as the base name for our role and policy
omics_iam_name = 'SageMaker_HealthOmics_test_john'


### Step 2. Create Input and Output S3 Bucket
HealthOmics run inputs and outputs must be stored to a S3 bucket.

In [83]:
!aws s3 mb s3://$bucket_name_out

make_bucket: nigms-scrnaseq-bucket-demo-out


In [84]:
!aws s3 mb s3://$bucket_name

make_bucket: nigms-scrnaseq-bucket-demo


### Step 3. Stage and package Workflow into .zip Folder

Clone base repos

In [85]:
!git clone https://github.com/nf-core/scrnaseq --branch 2.3.0 --single-branch

Cloning into 'scrnaseq'...
remote: Enumerating objects: 5147, done.[K
remote: Counting objects: 100% (1473/1473), done.[K
remote: Compressing objects: 100% (243/243), done.[K
remote: Total 5147 (delta 1317), reused 1230 (delta 1230), pack-reused 3674 (from 1)[K
Receiving objects: 100% (5147/5147), 35.50 MiB | 24.73 MiB/s, done.
Resolving deltas: 100% (3192/3192), done.
Note: switching to 'cfada12a6f5d773a76d5f1793d70661e95852c53'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false



In [86]:
!git clone https://github.com/CBIIT/omx-ecr-helper

Cloning into 'omx-ecr-helper'...
remote: Enumerating objects: 139, done.[K
remote: Counting objects: 100% (139/139), done.[K
remote: Compressing objects: 100% (90/90), done.[K
remote: Total 139 (delta 66), reused 104 (delta 44), pack-reused 0 (from 0)[K
Receiving objects: 100% (139/139), 111.28 KiB | 10.12 MiB/s, done.
Resolving deltas: 100% (66/66), done.


In [87]:
!git clone https://github.com/aws-samples/amazon-omics-tutorials.git

Cloning into 'amazon-omics-tutorials'...
remote: Enumerating objects: 2412, done.[K
remote: Counting objects: 100% (645/645), done.[K
remote: Compressing objects: 100% (354/354), done.[K
remote: Total 2412 (delta 302), reused 559 (delta 266), pack-reused 1767 (from 1)[K
Receiving objects: 100% (2412/2412), 101.76 MiB | 32.45 MiB/s, done.
Resolving deltas: 100% (907/907), done.
Updating files: 100% (1444/1444), done.


## Copy namespace file

In [88]:
!cp ./omx-ecr-helper/lib/lambda/parse-image-uri/public_registry_properties.json scrnaseq/namespace.config

## Generate omics.config

In [89]:
!python3 amazon-omics-tutorials/utils/scripts/inspect_nf.py \
--output-manifest-file scrnaseq/scrnaseq_230_docker_image_manifest.json \
-n scrnaseq/namespace.config \
--output-config-file scrnaseq/conf/omics.config \
--region $region \
scrnaseq/

Creating container image manifest: scrnaseq/scrnaseq_230_docker_image_manifest.json
Creating nextflow config file: scrnaseq/conf/omics.config


In [90]:
!aws stepfunctions start-execution\
    --state-machine-arn arn:aws:states:$region:$account_id:stateMachine:omx-container-puller\
    --input file://scrnaseq/scrnaseq_230_docker_image_manifest.json

{
    "executionArn": "arn:aws:states:us-east-1:664418964547:execution:omx-container-puller:8f423fbb-831e-4a09-9d94-7a9361f290bd",
    "startDate": 1727805698.087
}


In [105]:
!echo "includeConfig 'conf/omics.config'" >> scrnaseq/nextflow.config 

### Step 4. Create parameter-description.json file
Create a *.json* file and paste the content below into the file.

```json
{
    "input": {"description": "Samplesheet with sample locations.",
                "optional": false},
    "protocol" : {"description": "10X Protocol used: 10XV1, 10XV2, 10XV3",
                "optional": false},
    "aligner": {"description": "choice of aligner: alevin, star, kallisto",
            "optional": false},
    "whitelist": {"description": "Optional whitelist if 10X protocol is not used.",
            "optional": true},
    "gtf": {"description": "S3 path to GTF file",
            "optional": false},
    "fasta": {"description": "S3 path to FASTA file",
            "optional": false}
}
```

In [106]:
with open('parameter-description.json',"w") as f:
    f.write(json.dumps({
        "input": {"description": "Samplesheet with sample locations.",
                    "optional": False},
        "protocol" : {"description": "10X Protocol used: 10XV1, 10XV2, 10XV3",
                    "optional": False},
        "aligner": {"description": "choice of aligner: alevin, star, kallisto",
                "optional": False},
        "whitelist": {"description": "Optional whitelist if 10X protocol is not used.",
                "optional": True},
        "gtf": {"description": "S3 path to GTF file",
                "optional": False},
        "fasta": {"description": "S3 path to FASTA file",
                "optional": False}
    }))

### Step 5. Stage the Workflow
For the purposes of this demo, we will use the following policy and trust policy that are rather permissiv. You will need to customize permissions as required.

In [107]:
!zip -r scrnaseq-workflow.zip scrnaseq

updating: scrnaseq/ (stored 0%)
updating: scrnaseq/workflows/ (stored 0%)
updating: scrnaseq/workflows/scrnaseq.nf (deflated 78%)
updating: scrnaseq/subworkflows/ (stored 0%)
updating: scrnaseq/subworkflows/local/ (stored 0%)
updating: scrnaseq/subworkflows/local/input_check.nf (deflated 60%)
updating: scrnaseq/subworkflows/local/fastqc.nf (deflated 57%)
updating: scrnaseq/subworkflows/local/alevin.nf (deflated 65%)
updating: scrnaseq/subworkflows/local/align_cellranger.nf (deflated 63%)
updating: scrnaseq/subworkflows/local/align_universc.nf (deflated 60%)
updating: scrnaseq/subworkflows/local/kallisto_bustools.nf (deflated 66%)
updating: scrnaseq/subworkflows/local/mtx_conversion.nf (deflated 61%)
updating: scrnaseq/subworkflows/local/starsolo.nf (deflated 60%)
updating: scrnaseq/.prettierrc.yml (stored 0%)
updating: scrnaseq/.devcontainer/ (stored 0%)
updating: scrnaseq/.devcontainer/devcontainer.json (deflated 65%)
updating: scrnaseq/main.nf (deflated 75%)
updating: scrnaseq/nextfl

In [None]:
#if zip file is > 4mb move to bucket you created during ECR setup
!aws s3 cp scrnaseq-workflow.zip s3://<yourbucket>/workshop/scrnaseq-workflow.zip

In [108]:
!aws s3 cp scrnaseq-workflow.zip s3://$bucket_name/demo_workflow/scrnaseq-workflow.zip

upload: ./scrnaseq-workflow.zip to s3://nigms-scrnaseq-bucket-demo/demo_workflow/scrnaseq-workflow.zip


### Step 6. Create Workflow using zipped workflow and parameters-description.json


In [109]:
!aws omics create-workflow \
    --name $workflow_name \
    --definition-uri s3://$bucket_name/demo_workflow/scrnaseq-workflow.zip \
    --parameter-template file://parameter-description.json  \
    --engine NEXTFLOW

{
    "arn": "arn:aws:omics:us-east-1:664418964547:workflow/2169252",
    "id": "2169252",
    "status": "CREATING",
    "tags": {}
}


In [None]:
#see workflow and make sure status is Active
!aws omics list-workflows --name $workflow_name

## Retrieve Workflow ID

In [111]:
client = boto3.client('omics')
workflow_id = client.list_workflows(
    type='PRIVATE',
    name=workflow_name,
)['items'][0]['id']

### Step 7. Setup Inputs
We need an input.json that specifies what our input parameters are

In [112]:
with open('input.json',"w") as f:
    f.write(json.dumps({
        "input": "s3://aws-genomics-static-us-east-1/workflow_migration_workshop/nfcore-scrnaseq-v2.3.0/samplesheet-2-0.csv",
        "protocol": "10XV2",
        "aligner": "star",
        "fasta": "s3://aws-genomics-static-us-east-1/workflow_migration_workshop/nfcore-scrnaseq-v2.3.0/GRCm38.p6.genome.chr19.fa",
        "gtf": "s3://aws-genomics-static-us-east-1/workflow_migration_workshop/nfcore-scrnaseq-v2.3.0/gencode.vM19.annotation.chr19.gtf"
}))

### Step 8. Setup new role
For the purposes of this demo, we will use the following policy and trust policy that restricts usage to only the required s3 buckets. You will need to customize permissions as required.

In [98]:
# Define demo policies
omics_demo_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::"+bucket_name+"/*",
                "arn:aws:s3:::"+bucket_name_out+"/*",
                "arn:aws:s3:::aws-genomics-static-us-east-1/workflow_migration_workshop/nfcore-scrnaseq-v2.3.0/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::"+bucket_name,
                "arn:aws:s3:::"+bucket_name_out+"/*",
                "arn:aws:s3:::aws-genomics-static-us-east-1/workflow_migration_workshop/nfcore-scrnaseq-v2.3.0"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::"+bucket_name+"/*",
                "arn:aws:s3:::"+bucket_name_out+"/*",
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogStreams",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:"+region+":"+account_id+":log-group:/aws/omics/WorkflowLog:log-stream:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup"
            ],
            "Resource": [
                "arn:aws:logs:"+region+":"+account_id+":log-group:/aws/omics/WorkflowLog:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchCheckLayerAvailability"
            ],
            "Resource": [
                "arn:aws:ecr:"+region+":"+account_id+":repository/*"
            ]
        }
    ]
}

scrnaseq_workflow_demo_trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "omics.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": account_id
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:omics:"+region+":"+account_id+":run/*"
                }
            }
        }
    ]
}

In [99]:
# Create the iam client
iam = boto3.resource('iam')

# Check if the role already exists; if not, create it
try:
    role = iam.Role(omics_iam_name)
    role.load()
    
except botocore.exceptions.ClientError as ex:
    if ex.response["Error"]["Code"] == "NoSuchEntity":
        #Create the role with the corresponding trust policy
        role = iam.create_role(
            RoleName=omics_iam_name, 
            AssumeRolePolicyDocument=json.dumps(scrnaseq_workflow_demo_trust_policy))
        
        #Create policy
        policy = iam.create_policy(
            PolicyName='{}-policy'.format(omics_iam_name), 
            Description="Policy for AWS HealthOmics demo",
            PolicyDocument=json.dumps(omics_demo_policy))
        
        #Attach the policy to the role
        policy.attach_role(RoleName=omics_iam_name)
    else:
        print('Something went wrong, please retry and check your account settings and permissions')

In [100]:
#Retrieve the role arn, which grants AWS HealthOmics the proper permissions to access the resources it needs in your AWS account.
def get_role_arn(role_name):
    try:
        iam = boto3.resource('iam')
        role = iam.Role(role_name)
        role.load()  # calls GetRole to load attributes
    except botocore.exceptions.ClientError:
        print("Couldn't get role named %s."%role_name)
        raise
    else:
        print(role.arn)
        return role.arn

In [101]:
#Print role name and role arn to be used in store creation and upload
role_arn = get_role_arn(omics_iam_name)

arn:aws:iam::664418964547:role/SageMaker_HealthOmics_test_john


## Step 9. Running 

In [None]:
!aws omics start-run --workflow-id [workflow id] \
     --role-arn [role arn] \
     --name [workflow name] \
     --parameters [input parameter JSON File] \
     --output-uri [s3 bucket output]

In [113]:
!aws omics start-run \
  --name scrnaseq_john_workshop_test_run_1 \
  --role-arn $role_arn \
  --workflow-id $workflow_id \
  --parameters file://input.json \
  --output-uri s3://$bucket_name

{
    "arn": "arn:aws:omics:us-east-1:664418964547:run/5530249",
    "id": "5530249",
    "status": "PENDING",
    "tags": {},
    "uuid": "f2c924a7-8436-64bd-65a6-9e4033ce6d5e",
    "runOutputUri": "s3://nigms-scrnaseq-bucket-demo/5530249"
}


In [114]:
!aws omics list-runs

{
    "items": [
        {
            "arn": "arn:aws:omics:us-east-1:664418964547:run/5530249",
            "id": "5530249",
            "status": "PENDING",
            "workflowId": "2169252",
            "name": "scrnaseq_john_workshop_test_run_1",
            "creationTime": "2024-10-01T18:19:29.24713Z",
            "storageType": "STATIC"
        },
        {
            "arn": "arn:aws:omics:us-east-1:664418964547:run/3102853",
            "id": "3102853",
            "status": "RUNNING",
            "workflowId": "7679861",
            "name": "scrnaseq_john_workshop_test_run_1",
            "creationTime": "2024-10-01T18:02:41.28628Z",
            "startTime": "2024-10-01T18:14:06.58400Z",
            "storageType": "STATIC"
        },
        {
            "arn": "arn:aws:omics:us-east-1:664418964547:run/7764429",
            "id": "7764429",
            "status": "FAILED",
            "workflowId": "8364307",
            "name": "fastqc_demo_test_run_2",
            "creati

In [None]:
#The import can take up to 5 minutes to complete. We can wait for it to complete using a waiter.
print(f"waiting for job {ref_import_job['id']} to complete")
try:
    # Find Runs Waiter
    waiter = omics.get_waiter('reference_import_job_completed')
    waiter.wait(referenceStoreId=ref_import_job['referenceStoreId'], id=ref_import_job['id'])

    print(f"job {ref_import_job['id']} complete")
except botocore.exceptions.WaiterError as e:
    print(f"job {ref_import_job['id']} FAILED:")
    print(e)