# Introduction to AWS Batch

**Skill Level:** Beginner

## Overview

This tutorial will walk you through the process of setting up and using AWS Batch to run bioinformatic pipelines using Nextflow. By the end, you'll have a functional AWS Batch environment and understand the basics of job submission. 

AWS Batch is a fully managed service that enables you to run batch computing workloads on the AWS Cloud. It's designed to handle large-scale, compute-intensive tasks efficiently without the need to manage the underlying infrastructure. Here's an overview of AWS Batch along with its key benefits

- **Simplified workload management** by eliminating the need to install and manage batch computing software
- **Cost optimization** by automatically scaling resources based on job requirements, AWS Batch helps reduce costs by ensuring you only pay for the compute resources you actually use and delete the instances once the job is done.
- **Scalability and flexibility** by handle workloads of any scale, from a few jobs to hundreds of thousands, automatically managing the required infrastructure. It supports various job types, including single-node and multi-node parallel jobs, allowing you to run diverse workloads.
- **No upfront commitments**, there are no upfront costs or commitments required to use AWS Batch.
- **Monitoring and logging** by providing built-in monitoring and logging capabilities through integration with AWS CloudWatch.

## Learning Objectives

In this tutorial you will learn:
- How to set up AWS Batch via the console
- Run a batch job via the console
- Run a batch job using Nextflow

## Prerequisites <a id='prereq'></a>

Please ensure you have a VPC, subnets, and security group set up before running this tutorial.

Role necessary for this tutorial is the **'AWSServiceRoleForBatch'** role, with the following policies:
- AdministratorAccess
- AmazonSageMakerFullAccess
- AWSBatchServiceRole

You must also have a **"Instance role"**. To create an instance role follow the instructions below:

Create IAM role
- Naviagate to IAM in the console
- In the left navigation pane, click on "Roles"
- Click the "Create role" button
- Under "Trusted entity type", select "AWS Service"
- Select "EC2" as the use case
- Click "Next"

Attach Permissions Policies
- Search for and select the following policies:
    - AmazonECS_FullAccess
    - AmazonEC2ContainerRegistryFullAccess
- Click "Next"
Name and Review the Role
- Give your role a name (e.g., "AWSBatchInstanceRole")
- Review the role details
- Click "Create role"

## Pricing

If you are following this tutorial in one sitting it will cost ~$2.00. Completing the process in multiple sessions or using a method different from the tutorial may result in increased costs.

## Get Started

### Creating a Compute Environment

Naviate to **'AWS Batch'** in the console. Within AWS Batch on the left side menu navigate to **'Compute environments'**, select **'Create'**. AWS Batch gives you three options for a compute environment:

![batch1](../../docs/images/aws_batch_1.png)

- **AWS Fargate**: Is a serverless compute engine for containers, it eliminates the need to manage underlying infrastructure. Automatically scales and manages compute resources. Ideal for running containerized applications without server management.

- **Amazon EC2 (Elastic Compute Cloud)**: Provides resizable compute capacity in the cloud and offers a wide variety of instance types optimized for different use cases. Gives you full control over the underlying virtual machines. Allows customization of operating systems, network, and storage configurations. Suitable for a broad range of workloads, from web servers to high-performance computing.

- **Amazon EKS (Elastic Kubernetes Service)**: Managed Kubernetes service. Simplifies the deployment, management, and scaling of containerized applications using Kubernetes. Provides a fully managed control plane for Kubernetes clusters. Ideal for organizations already using Kubernetes or requiring its advanced orchestration capabilities.


For this tutorial we will be working with EC2, select 'EC2'. 
- Select "Managed" for Orchestration type. 
- Enter a name for your compute environment
- Under service role select "AWSServiceRoleForBatch"
- Under Instance role select an available instance role (if you don't already have one review the instruction [here](#prereq))
- Click 'Next'

![batch2](../../docs/images/aws_batch_2.png)

Now we will configure our compute environment to use **Spot Instances** to save on costs. Spot instances are spare EC2 capacity that is available for less than the On-Demand price. We have set the fields in the image below to default. Click Next.

![batch3](../../docs/images/aws_batch_3.png)

For Network Configuration select your VPC, subnets, and security groups you would like to utilize. This will allow AWS Batch to create instances that can communicate with each other and have access to acceptable networks.

![batch4](../../docs/images/aws_batch_4.png)

The last step is to review all the configuration made to your compute environment. Once you are satisfied click "Create compute environment".

![batch5](../../docs/images/aws_batch_5.png)

### Creating a Job Queue

Now that we have created a compute environment lets create a **job queue**. Job queues help Batch to stay organized by holding jobs until they can be scheduled to run in a compute environment.

In the AWS Batch console, go to the left side menu, click "Job queues" and click "Create".
   - Set orchestration type to EC2
   - Give your queue a name and set its priority. for this tutorial we have se it to '1000' to have the highest priority 
   - Associate the compute environment you created in the pervious step.
   - Review and create the job queue.

![batch6](../../docs/images/aws_batch_6.png)

### Applying Permissions

For this step we are enabling AWS Batch permissions on EC2 clusters without this our jobs will not run.
- On the left side menu under 'Control settings' click 'Permissions'
- Next to Container insights click 'Edit'
- Using the toggles select with compute environment should have these permissions and click 'Save changes' 

![batch7](../../docs/images/aws_batch_7.png)

Now that you have set the permissions head back to the compute environment console and ensure that your environment is flagged as valid. 

![batch8](../../docs/images/aws_batch_8.png)

### Install Nextflow

In [None]:
#Run if you don't have Java installed
! sudo apt update
! sudo apt-get install default-jdk -y
! java -version

In [None]:
#Install nexflow, make it exceutable, and update it
! curl https://get.nextflow.io | bash
! chmod +x nextflow
! ./nextflow self-update#add nextflow to your path! sudo mv $PWD/nextflow /usr/local/bin/

### Nextflow 101

Nextflow interacts with many different files to have a proper working workflow:

- **Main file:** The main file is a .nf file that holds the processes and channels describing the input, output, a shell script of your commands, workflow which acts like a recipe book for nextflow, and/or conditions. For snakemake users this is equivalent to 'rules'.
- **Process:** Contains channels and scripts that can be executed in a Linux server like bash commands.
- **Channel:** Produces ways through which processes communicate to each other for example input and output are channels of value that point the process to where data is or should be located.
- **Config file:** The .config file contains parameters, and multiple profiles. Each profile can contain a different executor type (e.g. LS API, conda, docker, etc.), memory or machine type, output directory, working directory and more!

### Run a nextflow 'Hello World' process locally

We are going to first run Hello World locally using the nextflow scirpt named hello.nf.

It should look like this (add this code to a file named hello.nf):

```
#!/usr/bin/env nextflow
nextflow.enable.dsl=2 

params.str = 'Hello World'

process sayHello {
  input:
  val str

  output:
  stdout

  """
  echo $str > hello.txt
  cat hello.txt
  """
}
workflow {
  sayHello(params.str) | view
}
```

In [None]:
#run your nextflow script
! ./nextflow run hello.nf --str 'Hello!'

### Submitting a AWS Batch Nextflow Job

Create a bucket to store our outputs.

In [None]:
BUCKET_NAME = 'ENTER_BUCKET_NAME'

In [None]:
!aws s3 mb s3://$BUCKET_NAME

Create and modify your own config file to include a 'awsbatch' profile block to tell Nextflow to submit the job to AWS Batch instead of running locally

The config file allows nextflow to utilize executers like AWS Batch. In this tutorial the config files is named 'nextflow.config'. Make sure you open this file and update the <BUCKET_NAME> that are account specific.

Make sure that your region is a region included in the AWS Batch!
Specify your working directory bucket and output directory bucket.


```
plugins {
    id 'nf-amazon'
}

profiles {
    awsbatch {
        process {
            executor = 'awsbatch'
            queue = 'test'
            container = 'quay.io/nf-core/ubuntu:22.04'
            
        }
        workDir = 's3://<BUCKET_NAME>/nextflow_env/'
        params.outdir = 's3://<BUCKET_NAME>/nextflow_output/'
        
        fusion.enabled = true
        wave.enabled = true
        aws.region = 'us-east-1'

    }
}
```

**Note:** Make sure your working directory and output directory are different! AWS Batch creates temporary file in the working directory within your bucket that do take up space so once your pipeline has completed successfully feel free to delete the temporary files.

In [None]:
! ./nextflow run nf-core/methylseq -profile test,awsbatch -c nextflow.config 

### Optional: Listing nf-core tools with docker and viewing their commands

Using the command below you can see all the bioinformatic pipelines that nf-core holds and their versions/lastest releases.

In [None]:
! docker run nfcore/tools list

You can view commands for methylseq (or any other specified nf-core tool) by using the [--help] flag.

In [None]:
! ./nextflow run nf-core/methylseq --help

### Monitor Job Progress

   - After submitting the job, you can monitor its progress in the "Jobs" section of the AWS Batch console.
   - Check the job's status, logs, and any output as needed.

## Conclusion

You've now successfully set up an AWS Batch environment, created a job definition, and submitted a job via Nextflow. This basic workflow can be expanded to handle more complex batch processing tasks, leveraging AWS Batch's scalability and management features.

## Clean Up

Delete your bucket, compute environment, any unsuccessful jobs via the console. If you are using any Jupyter notebooks though AWS please ensure that you stop or delete the notebook to avoid accruing costs.