# RNA-Seq Analysis using Snakemake and AWS Healthomics

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial uses a popular workflow manager called [Nextflow](https://www.nextflow.io) run via [AWS Healthomics](https://aws.amazon.com/healthomics/). If you completed the other tutorials in this repo, you will see that it is similar to Tutorial 2, but instead of running Snakemake locally, we switch to Nextflow and run it using Healthomics in a serverless manner. 

Before begining this tutorial, if you do not have ECR setup in AWS Sagemaker notebook, please click [here](https://github.com/NIGMS/AWS-HealthOmics-Module-Template/blob/scrnaseq_demo/healthomics_ecr_setup_scrnaseq.ipynb) to set that up.

## Prerequisites
#### Python requirements
+ Python >= 3.8
#### Packages:
+ boto3 >= 1.26.19
+ botocore >= 1.29.19
#### AWS requirements
+ AWS CLI
+ You will need the AWS CLI installed and configured in your environment. Supported AWS CLI versions are:
    - AWS CLI v2 >= 2.9.3 (Recommended)
    - AWS CLI v1 >= 1.27.19
    - AWS Region

<div class="alert alert-block alert-info">
<b>NOTE:</b> AWS HealthOmics only allows importing data within the same region. AWS HealthOmics is currently available in Oregon (us-west-2), N. Virginia (us-east-1), Dublin (eu-west-1), London (eu-west-2), Frankfurt (eu-central-1), and Singapore (ap-southeast-1).</div>

## Getting Started
### Step 1. Import libraries

##### Import relevant libraries


In [None]:
# Import necessary libraries and python SDK
from datetime import datetime
import json
import os
import time

import boto3
import botocore.exceptions

For AWS bucket naming conventions, please click [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html).

In [None]:
bucket_name = <REPLACE with bucket name>
bucket_name_out = bucket_name+"-out"
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name
workflow_name = <REPLACE with workflow name>
# We will use this as the base name for our role and policy
omics_iam_name = <REPLACE with omics IAM name>

### Step 2: Create a new S3 bucket to store input and output files and enable AWS Healthomics
Note that your bucket has to be globally unique, so make sure you don't just copy the example here or it won't work

In [None]:
# This will use the bucket name variable from above
!aws s3 mb s3://$bucket_name

### Step 3: Review input files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a AWS S3 Storage Bucket that we made publicly accessible. All other files needed to run the pipeline are also hosted in this public bucket, and will be staged at runtime by Nextflow. To view the locations of all these files, view the `nextflow.config`. You can modify any of these paths as desired, and you could also create a new samplesheet.csv if you want to point the pipeline to different samples. The samplesheet can be stored locally or in a S3 bucket.

In [None]:
# # If downloading locally, downloaded necessary data otherwise skip this step
# !aws s3 cp s3://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/ data/ --recursive

### Step 4. Stage and package Workflow into .zip Folder

Clone base repos

In [None]:
!git clone https://github.com/nf-core/rnaseq --branch 3.11.0 --single-branch

In [None]:
!git clone https://github.com/aws-samples/amazon-omics-tutorials.git

In [None]:
# # If you do not have the omx-ecr-helper github repository in your current repository
# # clone the repository
# !git clone https://github.com/CBIIT/omx-ecr-helper

In [None]:
# Copy namespace.config file to generate omics.config file
!cp ./omx-ecr-helper/lib/lambda/parse-image-uri/public_registry_properties.json rnaseq/namespace.config

## Generate omics.config

In [None]:
# Generate manifest and omics.config files
!python3 amazon-omics-tutorials/utils/scripts/inspect_nf.py \
--output-manifest-file rnaseq/rnaseq_docker_image_manifest.json \
-n rnaseq/namespace.config \
--output-config-file rnaseq/conf/omics.config \
--region $region \
rnaseq/

In [None]:
# Pull containers from manifest file generated in last step into ECR
!aws stepfunctions start-execution\
    --state-machine-arn arn:aws:states:$region:$account_id:stateMachine:omx-container-puller\
    --input file://rnaseq/rnaseq_docker_image_manifest.json

In [None]:
# Write omics.config statement to bottom of file. This should only be run once otherwise multiple statements will be added to nextflow.config file
!echo "includeConfig 'conf/omics.config'" >> rnaseq/nextflow.config

### Step 5. Create parameter-description.json file
Run the code cell below to write the following *.json* formatted content to a *parameter-description.json* file.

```json
{
    "input": {"description": "(string)Path to comma-separated file containing information about the samples in the experiment. Samplesheet with sample locations.",
                "optional": false},
    "ecr_registry": {
        "description": "(string)Name of ECR private registry containing docker containers.",
        "optional": false},
    "aligner": {"description": "(string[star_salmon|star_rsem|hisat2]:star_salmon)Specifies the alignment algorithm to use.choice of aligner: alevin, star, kallisto",
            "optional": true},
    "gtf": {"description": "(string)Path to GTF annotation file. S3 path to GTF file",
            "optional": true},
    "fasta": {"description": "(string)Path to FASTA genome file. S3 path to FASTA file",
            "optional": false}
}
```

In [None]:
with open('parameter-description.json',"w") as f:
    f.write(json.dumps({
    "input": {"description": "(string)Path to comma-separated file containing information about the samples in the experiment. Samplesheet with sample locations.",
                "optional": False},
    "ecr_registry": {
        "description": "(string)Name of ECR private registry containing docker containers.",
        "optional": False},
    "aligner": {"description": "(string[star_salmon|star_rsem|hisat2]:star_salmon)Specifies the alignment algorithm to use.choice of aligner: alevin, star, kallisto",
            "optional": True},
    "gtf": {"description": "(string)Path to GTF annotation file. S3 path to GTF file",
            "optional": True},
    "fasta": {"description": "(string)Path to FASTA genome file. S3 path to FASTA file",
            "optional": False}
    }))

### Step 6. Stage the Workflow
Zip the contents of the workflow directory and copy it to an S3 bucket. If the zipped folder is >4Mb than it is required to move it to an S3 bucket.

In [None]:
!zip -r rnaseq-workflow.zip rnaseq 

In [None]:
# Upload to S3 bucket
!aws s3 cp rnaseq-workflow.zip s3://$bucket_name/rnaseq-workflow.zip

### Step 7. Create Workflow using zipped workflow and parameters-description.json

In [None]:
!aws omics create-workflow \
    --name $workflow_name \
    --definition-uri s3://$bucket_name/rnaseq-workflow.zip \
    --parameter-template file://parameter-description.json  \
    --engine NEXTFLOW

In [None]:
# See workflow and make sure status is Active
!aws omics list-workflows --name $workflow_name

Retrieve Workflow ID and create workflow_name variable to be passed to start_run command

In [None]:
client = boto3.client('omics')
workflow_id = client.list_workflows(
    type='PRIVATE',
    name=workflow_name,
)['items'][0]['id']

### Step 8. Setup Inputs
Write *input.json* file that specifies input parameter values. Here were are retrieving inputs from your own S3 buckets, however, inputs can also be passed in from public S3 buckets or reference and genome stores that you have setup on the account. In each case just provide the appropriate uri for the given input.

In [None]:
with open('input.json',"w") as f:
    f.write(json.dumps({
        "input": "s3://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/samplesheet.csv",
        "ecr_registry": account_id + ".dkr.ecr."+region+".amazonaws.com/"+workflow_name
}))

### Step 9. Setup new role
For the purposes of this demo, we will use the following policy and trust policy that restricts usage to only the required S3 buckets. You will need to customize permissions as required.

In [None]:
# Define demo policies
omics_demo_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::"+bucket_name+"/*",
                "arn:aws:s3:::"+bucket_name_out+"/*",
                "arn:aws:s3:::nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/*" 
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::"+bucket_name,
                "arn:aws:s3:::"+bucket_name_out+"/*",
                "arn:aws:s3:::nigms-sandbox/me-inbre-rnaseq-pipelinev2/data"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::"+bucket_name+"/*",
                "arn:aws:s3:::"+bucket_name_out+"/*",
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogStreams",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:"+region+":"+account_id+":log-group:/aws/omics/WorkflowLog:log-stream:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup"
            ],
            "Resource": [
                "arn:aws:logs:"+region+":"+account_id+":log-group:/aws/omics/WorkflowLog:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchCheckLayerAvailability"
            ],
            "Resource": [
                "arn:aws:ecr:"+region+":"+account_id+":repository/*"
            ]
        }
    ]
}

rnaseq_workflow_demo_trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "omics.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": account_id
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:omics:"+region+":"+account_id+":run/*"
                }
            }
        }
    ]
}

In [None]:
# Create the iam client
iam = boto3.resource('iam')

# Check if the role already exists; if not, create it
try:
    role = iam.Role(omics_iam_name)
    role.load()
    
except botocore.exceptions.ClientError as ex:
    if ex.response["Error"]["Code"] == "NoSuchEntity":
        # Create the role with the corresponding trust policy
        role = iam.create_role(
            RoleName=omics_iam_name, 
            AssumeRolePolicyDocument=json.dumps(rnaseq_workflow_demo_trust_policy))
        
        # Create policy
        policy = iam.create_policy(
            PolicyName='{}-policy'.format(omics_iam_name), 
            Description="Policy for AWS HealthOmics demo",
            PolicyDocument=json.dumps(omics_demo_policy))
        
        # Attach the policy to the role
        policy.attach_role(RoleName=omics_iam_name)
    else:
        print('Something went wrong, please retry and check your account settings and permissions')

In [None]:
# Retrieve the role arn, which grants AWS HealthOmics the proper permissions to access the resources it needs in your AWS account.
def get_role_arn(role_name):
    try:
        iam = boto3.resource('iam')
        role = iam.Role(role_name)
        role.load()  # calls GetRole to load attributes
    except botocore.exceptions.ClientError:
        print("Couldn't get role named %s."%role_name)
        raise
    else:
        print(role.arn)
        return role.arn

In [None]:
# Print role name and role arn to be used in store creation and upload
role_arn = get_role_arn(omics_iam_name)

### Step 10. Start the run

In [None]:
!aws omics start-run \
  --name <REPLACE with run name> \
  --role-arn $role_arn \
  --workflow-id $workflow_id \
  --parameters file://input.json \
  --output-uri s3://$bucket_name_out

### STEP 2: Install mambaforge and nextflow
First install mambaforge, then use mamba to install nextflow. Skip this as needed if you have already completed this step.

In [None]:
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
# Add to your path, do this every time you restart your kernel
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [None]:
# Install Nextflow
! mamba install -y -c conda-forge -c bioconda nextflow

### STEP 4: Modify config file to allow your project to interact with Google Batch
Create and modify your own config file to include a 'gbatch' profile block to tell Nextflow to submit the job to Google Batch instead of running locally.

The config file allows nextflow to utilize excecuters like Google Batch. In this tutorial the config files is named 'nextflow.config'. Make sure you open this file and update the <VARIABLES> that are account specific. In this case will will only modify the <PROJECT> with your Project ID. We will specify an outdir and work directory on the command line at run time. 

Make sure that your region is a region included in the Google Batch!
Specify the machine type you would like to use, ensuring that there is enough memory and cpus for the workflow. In this case 16 CPUs is plenty (Otherwise Google Batch will automatically use 1 CPU).
```
profiles{
  gbatch{
      process.executor = 'google-batch'
      google.location = 'us-central1'
      google.region  = 'us-central1'
      google.project = '<YOUR_PROJECT>'
      process.machineType = 'c2-standard-16'
     }
}
```
Note: Make sure your working directory and output directory are different! Google Batch creates temporary file in the working directory within your bucket that do take up space so once your pipeline has completed succesfully feel free to delete the temporary files.

### STEP 5: Submit Nextflow Job to Google Batch

A few things to note here: 
+ --input points to a samplesheet in GS. We could also point to a local samplesheet. This just tells Nextflow where to get the fastq files. 
+ The profile comes from nextflow.config. It tells the pipeline what to use as execution environment (conda, singularity, or docker) and then you give it a compute environment (in this case gbatch, but if left blank would run locally). 
+ We specify an outdir. This can point to a local folder if run locally, but since we are using the serverless Google Batch, we need to point the output to a bucket. 
+ We specify a work dir. Like the outdir, this can be local if run locally, but needs to be in a bucket when running with Batch. 
+ If you need to rerun your pipeline, you can always add `-resume` and it will search the workdir and not rerun any processes that you have already run. 

In [None]:
%%time
! nextflow run main.nf --input gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/samplesheet.csv  -profile docker,gbatch  --outdir gs://$BUCKET/outdir/ -w gs://$BUCKET/work/

### STEP 9: Report the top 10 most highly expressed genes in the samples.

Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [None]:
!gsutil ls gs://$BUCKET/outdir/data/quant

In [None]:
!gsutil cp -r gs://$BUCKET/outdir/data/quant .

View the top 10 most highly expressed genes in the double lysogen sample.


In [None]:
%%bash
for samp in quant/*/quant.sf; 
    do echo $samp; 
    sort -nrk 5,5 quant/*/quant.sf | head -10; 
    done

### STEP 10: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
%%bash
for samp in quant/*/quant.sf; do echo $samp; 
    echo Name    Length  EffectiveLength TPM     NumReads;
    grep 'BB28_RS16545' quant/*/quant.sf; 
    done