# RNA-Seq Analysis using Nextflow and AWS Batch

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial uses a popular workflow manager called [Nextflow](https://www.nextflow.io) run via [AWS Batch](https://aws.amazon.com/batch/). If you completed the other tutorials in this repo, you will see that it is similar to Tutorial 2, but instead of running Snakemake locally, we switch to Nextflow and run it using Batch. 

AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. 

If you prefer to skip manual deployment and deploy automatically in the cloud, click the Launch Stack button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. 

[![Launch Stack](images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml )


<div style="border: 1px solid #ffe69c; padding: 0px; border-radius: 4px;">
  <div style="background-color: #fff3cd; padding: 5px; font-weight: bold;">
    <i class="fas fa-exclamation-triangle" style="color: #664d03;margin-right: 5px;"></i><a style="color: #664d03">Before using AWS Batch </a>
  </div>
  <p style="margin-left: 5px;">
Before begining this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to <b>manually</b> set those up please click <a href="https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md">here</a> to set that up.
  </p>
</div>



## Prerequisites
#### Python requirements
+ Python >= 3.8

#### AWS requirements
+ Please ensure you have a VPC, subnets, and security group set up before running this tutorial.
+ Role with AdministratorAccess, AmazonSageMakerFullAccess, S3 access and AWSBatchServiceRole.
+ Instance Role with AmazonECS_FullAccess, AmazonEC2ContainerRegistryFullAccess, and S3 access.
+ If you do not have the required set-up for AWS Batch please follow this tutorial [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/zbyosufzai-awsbatch-1/notebooks/AWSBatch/Intro_AWS_Batch.ipynb#install_nextflow). ***When making the instance role, make another for SageMaker notebooks with the following permissions: AdminstratorAccess, AmazonEC2ContainerRegistryFullAccess, AmazonECS_FullAccess, AmazonS3FullAccess, AmazonSageMakerFullAccess, and AWSBatchServiceRole.***
It is recommended that specific permission to folders are added through inline policy. An example of the JSON is below:

<pre>
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowSageMakerS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:CreateBucket"
            ],
            "Resource": [
                "arn:aws:s3:::batch-bucket",
                "arn:aws:s3:::batch-bucket/*",
                "arn:aws:s3:::nigms-sandbox-healthomics",
                "arn:aws:s3:::nigms-sandbox-healthomics/*",
                "arn:aws:s3:::ngi-igenomes",
                "arn:aws:s3:::ngi-igenomes/*"
            ]
        }
    ]
}
</pre>
For AWS bucket naming conventions, please click [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html).

### Step 0. Setting up AWS Batch
AWS Batch manages the provisioning of compute environments (EC2, Fargate), container orchestration, job queues, IAM roles, and permissions. We can deploy a full environment either:
- Automatically using a preconfigured AWS CloudFormation stack (**recommended**)
- Manually by setting up roles, queues, and buckets
The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. 

If you prefer to skip manual deployment and deploy automatically in the cloud, click the **Launch Stack** button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. 

[![Launch Stack](images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml )

### Step 1. Install required dependencies, update paths and create a new S3 Bucket to store input and output files (if needed)
After setting up an AWS CloudFormation stack, we need to let the nextflow workflow to know where are those resrouces by providing the configuration:
<div style="border: 1px solid #e57373; padding: 0px; border-radius: 4px;">
  <div style="background-color: #ffcdd2; padding: 5px; ">
    <i class="fas fa-exclamation-triangle" style="color: #b71c1c;margin-right: 5px;"></i><a style="color: #b71c1c"><b>Important</b> - Customize Required</a>
  </div>
  <p style="margin-left: 5px;">
After successfull creation of your stack you must attatch a new role to SageMaker to be able to submit batch jobs. Please following the the following steps to change your SageMaker role:<br>
<ol> <li>Navigate to your SageMaker AI notebook dashboard (where you initially created and launched your VM)</li> <li>Locate your instance and click the <b>Stop</b> button</li> <li>Once the instance is stopped: <ul> <li>Click <b>Edit</b></li> <li>Scroll to the "Permissions and encryption" section</li> <li>Click the IAM role dropdown</li> <li>Select the new role created during stack formation (named something like <b>aws-batch-nigms-SageMakerExecutionRole</b>)</li> </ul> </li> 
<li>Click <b>Update notebook instance</b> to save your changes</li> 
<li>After the update completes: <ul> <li>Click <b>Start</b> to relaunch your instance</li> <li>Reconnect to your instance</li> <li>Resume your work from this point</li> </ul> </li> </ol>

<b>Warning:</b> Make sure to replace the <b>stack name</b> to the stack that you just created. <code>STACK_NAME = "your-stack-name-here"</code>
  </p>
</div>

In [None]:
# define a stack name variable
STACK_NAME = "aws-batch-nigms-test1"

In [None]:
import boto3
# Get account ID and region 
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

In [None]:
# Set variable names 
# These variables should come from the Intro AWS Batch tutorial (or leave as-is if using the launch stack button)
BUCKET_NAME = f"{STACK_NAME}-batch-bucket-{account_id}"
AWS_QUEUE = f"{STACK_NAME}-JobQueue"
INPUT_FOLDER = 'nigms-sandbox/ovarian-cancer-example-fastqs'
AWS_REGION = region

#### Install dependencies
Installs Nextflow and Java, which are required to execute the pipeline. In environments like SageMaker, Java is usually pre-installed. But if you're running outside SageMaker (e.g., EC2 or local), you’ll need to manually install it.

In [None]:
# Install Nextflow
! mamba install -y -c conda-forge -c bioconda nextflow

<details>
<summary>Install Java and Nextflow if needed in other systems</summary>
If using other system other than AWS SageMaker Notebook, you might need to install java and nextflow using the code below:
<br> <i># Install java</i><pre>
    sudo apt update
    sudo apt-get install default-jdk -y
    java -version
    </pre>
    <i># Install Nextflow</i><pre>
    curl https://get.nextflow.io | bash
    chmod +x nextflow
    ./nextflow self-update
    ./nextflow plugin update
    </pre>
</details>

In [None]:
# replace batch bucket name in nextflow configuration file
! sed -i "s/aws-batch-nigms-batch-bucket-/$BUCKET_NAME/g" nextflow.config
# replace job queue name in configuration file 
! sed -i "s/aws-batch-nigms-JobQueue/$AWS_QUEUE/g" nextflow.config
# replace the region placeholder with the region you are in 
! sed -i "s/aws-region/$AWS_REGION/g" nextflow.config

### Step 2. Enable AWS Batch for the nextflow script 
Run the pipeline in a cloud-native, serverless manner using AWS Batch. AWS Batch offloads the burden of provisioning and managing compute resources. When you execute this command:
- Nextflow uploads tasks to AWS Batch. 
- AWS Batch pulls the necessary containers.
- Each process/task in the pipeline runs as an isolated job in the cloud.

In [None]:
# Run nextflow script with parameters 
! ./nextflow run main.nf --input s3://$INPUT_FOLDER/data/raw_fastqSub/samplesheet.csv -profile docker,awsbatch -c nextflow.config --awsqueue  $AWS_QUEUE --awsregion $AWS_REGION

### STEP 3: Report the top 10 most highly expressed genes in the samples.

In [None]:
# View output files in folder
! aws s3 ls s3://$BUCKET_NAME/nextflow_output/data/quant --recursive | cut -c32-

In [None]:
# Copy output to here 
! aws s3 sync s3://$BUCKET_NAME/nextflow_output/data/quant quant

View the top 10 most highly expressed genes in the double lysogen sample.

In [None]:
%%bash
for samp in quant/*/quant.sf; 
    do echo $samp; 
    sort -nrk 5,5 quant/*/quant.sf | head -10; 
    done

### STEP 4: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
%%bash
for samp in quant/*/quant.sf; do echo $samp; 
    echo Name    Length  EffectiveLength TPM     NumReads;
    grep 'BB28_RS16545' quant/*/quant.sf; 
    done

## Conclusion: Why Use AWS Batch?
<table border="1" cellpadding="8" cellspacing="0">
  <thead>
    <tr>
      <th>Benefit</th>
      <th>Explanation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Scalability</strong></td>
      <td>Process large MeRIP-seq datasets with multiple jobs in parallel</td>
    </tr>
    <tr>
      <td><strong>Reproducibility</strong></td>
      <td>Ensures the exact same Docker containers and config are used every time</td>
    </tr>
    <tr>
      <td><strong>Ease of Management</strong></td>
      <td>No need to manually manage EC2 instances or storage mounts</td>
    </tr>
    <tr>
      <td><strong>Integration with S3</strong></td>
      <td>Input/output seamlessly handled via S3 buckets</td>
    </tr>
  </tbody>
</table>

Running on AWS Batch is ideal when your dataset grows beyond what your local notebook or server can handleor when you want reproducible, cloud-native workflows that are easier to scale, share, and manage.

## Clean Up the AWS Environment

Once you've successfully run your analysis and downloaded the results, it's a good idea to clean up unused resources to avoid unnecessary charges.

#### Recommended Cleanup Steps:

- **Delete Output Files from S3 (Optional)**  
    If you've downloaded your results locally and no longer need them stored in the cloud.
- **Delete the S3 Bucket (Optional)**    
  To remove the entire bucket (only do this if you're sure!)
- **Shut Down AWS Batch Resources (Optional but Recommended):**    
  If you used a CloudFormation stack to set up AWS Batch, you can delete all associated resources in one step (⚠️ Note: Deleting the stack will also remove IAM roles and compute environments created by the template.):
  + Go to the <a href="https://console.aws.amazon.com/cloudformation/">AWS CloudFormation Console</a>
  + Select your stack (e.g., <code>aws-batch-nigms-test1</code>)
  + Click Delete
  + Wait for all resources (compute environments, roles, queues) to be removed
  
<div style="border: 1px solid #659078; padding: 0px; border-radius: 4px;">
  <div style="background-color: #d4edda; padding: 5px; font-weight: bold;">
    <i class="fas fa-lightbulb" style="color: #0e4628;margin-right: 5px;"></i><a style="color: #0e4628">Tips</a>
  </div>
  <p style="margin-left: 5px;">
It’s always good practice to periodically review your <b>EC2 instances</b>, <b>ECR containers</b>, <b>S3 storage</b>, and <b>CloudWatch logs</b> to ensure no stray resources are incurring charges.
  </p>
</div>