
![Biofilm image](../images/Biofilm_Website_2.png)

# SubModule #6: Running workflows at scale with AWS Batch


## Overview

One of the greatest benefits that cloud computing affords you is the ability to pay only for the resources that you need and only as long as you need them. In the previous submodules we have relied on an individual VM that we manually turn on at the beginning of our work and off at the end. If we forgot to turn the VM off at the end of our work, we could incur an unecessarily large bill. Similarly, we might have certain stages of our analysis that require huge computational resources and others that require very little. Rather than stopping and resizing the VM at each stage, we can instead rely on the AWS Batch scheduler to reserve the resources required for each analysis. The resources are automatically shut down and deleted at the end, so you don't pay any additional cost.

In the previous submodules, many of the steps had to be run one at a time in the correct order. That is traditionally how bioinformatics pipelines have been execute. The arrivale of workflow managers like [Nextflow](https://www.nextflow.io/) have changed that. Nextflow allows developers to chain many commands together into a mature pipeline. The flow input and output files throughout the pipeline governs the order of execution to ensure that a particular step is not launched if its input has not yet been generated by a previous step. Nextflow has aggregated a growing number of curate workflows into a library of commonly used pipelines called [nf-core](https://nf-co.re/). From nf-core, you can easily launch workflows to run canonical tools for processes like RNA-Seq, methyl-seq, and in our case, 16s metagenomic sequencing. For this submodule, we will run a classic 16s metagenomics analysis suite using nf-core's [ampliseq](https://nf-co.re/ampliseq) pipeline. We have made some adjustments from the original scripts so that it can run on AWS Batch, so we will pull it from our Cloud Storage Bucket rather than launch it from GitHub.

## Learning Objectives:
- Understand the benefit of using schedulers like AWS Batch
- Identify the key components of a AWS Batch profile in a Nextflow config file
- Launch an end-to-end nf-core pipeline on AWS Batch

## Prerequisites

**Software**

* **Nextflow:** For running the `ampliseq` pipeline.

**APIs**

* **Amazon S3**  The notebook extensively uses `aws s3` commands, indicating it needs access to Amazon S3 for downloading input files (`*.gz`, `ampliseq`) and potentially for storing intermediate and output files from the Nextflow pipeline.

* **AWS Batch Compute Environment and Job Queue:** You must have an AWS Batch compute environment and job queue configured. The CloudFormation template automates this. You can set up one manually following the instructions in the link provided in the notebook, but using the template is *recommended* for ease of setup.

## Get Started

### Install Nextflow

Run the following cell to install nextflow version `23.10.0`.

In [None]:
! mamba create  -n nextflow -c bioconda nextflow=23.10.0 -y
! mamba install -n nextflow ipykernel -y

<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Alert: </b> Remember to change your kernel to <b>conda_nextflow</b> to run nextflow.
</div>

### Download data

We need to download some data and our updated version of the ampliseq pipeline. For this submodule, we are using a human fecal microbiome sample from an ulcerative colitis experiment. Information about the sample can be found on its [SRA listing](https://www.ncbi.nlm.nih.gov/sra/?term=SRR24091844).

In [None]:
! aws s3 cp s3://nigms-sandbox/nosi-usd-biofilms/  Core_Dataset_Prep/  --exclude "*" --include "*.fastq.gz"  --recursive
! aws s3 cp s3://nigms-sandbox/nosi-usd-biofilms/ampliseq  ./ampliseq/    --recursive

### AWS Batch Setup

AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. 

If you prefer to skip manual deployment and deploy automatically in the cloud, click the Launch Stack button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. 

[![Launch Stack](../images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml)


Before beginning this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to **manually** set those up please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md) to set that up.

Below is the `aws` profile that we have added to the nextflow.config file. Config files specify an enormous range of parameters that Nextflow uses to run your workflow. You can see where we have specified AWS Batch as the executor for the pipeline as well as lines that specify `AWS Batch queue`, `AWS region`, `working directory`, and `output directory`. It is important that you update these prior to launching your run in the cell below.

```
aws {
process {
        executor = 'awsbatch'
        queue = 'nextflow-batch-job-queue'  // Name of your Job queue
        container = 'quay.io/nextflow/rnaseq-nf:v1.1'
        
}
workDir = 's3://your_bucket_name/rna-tmp/'    // path of your working directory
params.outdir = 's3://your_bucket_name/rna-outputs/'   // path of your output directory

fusion.enabled = true
wave.enabled = true
aws.region = 'us-east-1' // YOUR AWS REGION

}
```

### Run `ampliseq` pipeline

In [None]:
! nextflow run nf-core/ampliseq -r 2.6.0 \
    -profile aws \
    -c ampliseq/nextflow.config \
    --input "Core_Dataset_Prep/" \
    --FW_primer GTGYCAGCMGCCGCGGTAA \
    --RV_primer GGACTACNVGGGTWTCTAAT \
    --skip_taxonomy "true" \
    --skip_cutadapt "true"

After the pipeline completes, you can see the output of each stage in the directory you specify as `params.outdir` in the config file. You can easily see the output files generated by fastqc, dada2, and others neatly categorized into its own directory. As you scale the size and number of metagenomic runs that you execute, using AWS Batch with Nextflow will vastly improve the efficiency and reproducibility of your analysis.

# Conclusion

You can see from the progress updates above that each step of the pipeline was given a process name and allocated to appropriate resources. Once that process was done, those resources were shut down and deleted so you don't have to continue paying for them. Output files and logging information was stored safely in the storage buckets we specified in the congif file. Use resource managers like AWS Batch is a great way to scale your workflows as your datasets get bigger.

## Clean up

Remember to stop your notebook instance when you are done!