# Snakemake on AWS ParallelCluster 

**Skill Level: Beginner**

## Overview

### AWS ParallelCluster 
AWS ParallelCluster is a fully managed tool that simplifies the creation, management, and deployment of High-Performance Computing (HPC) clusters on the AWS cloud. It automates the setup and running of HPC clusters. Users can define their cluster configurations, including instance types, storage options, install scripts, and advanced networking settings like VPC, subnets, and security groups.

AWS ParallelCluster integrates with  HPC schedulers like Slurm, offering robust job scheduling and resource management. 

### Snakemake and the `pcluster-Slurm` plugin

Snakemake is a workflow manager that simplifies the process of creating and executing complex data analysis pipelines. It uses a Python-based language to define workflows and automate the execution of tasks. 

Snakemake workflows can be deployed seamlessly on AWS ParallelCluster using the `pcluster-Slurm` plugin. This plugin enables the use of Slurm via AWS ParallelCluster as an executor for Snakemake workflows. 

## Learning Objectives 

We hope that by the end of this tutorial, you will
1. Understand the concepts of AWS ParallelCluster for High-Performance Computing.
2. Know how to set up and configure an AWS ParallelCluster environment.
3. Gain a basic understanding of Snakemake as a workflow manager and learn how to define and execute workflows using Snakemake's syntax.
4. Discover how to integrate Snakemake workflows with AWS ParallelCluster using the pcluster-Slurm plugin.

## Prerequisites

* Access to a VPC network through your AWS account
* Access to the AWS ParallelCluster UI 

Please follow the installation instructions for the ParallelCluster UI provided here: [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/Install_AWSParallelCluster.md). These instructions will guide you through the necessary steps to create a CloudFormation Stack through which you can access the AWS ParallelCluster UI. 

Additionally, we urge you to check out the documents within the `docs/` folder of the repository for more bioinformatics and Gen AI tutorials.

Once you have created the Cloud Formation Stack for the PCUI, navigate to the user interface URL. It will look like this:![alt text](../../docs/images/pcui.png)


## Pricing 

When using AWS ParallelCluster, you only pay for the services utilized when creating and operating clusters, including computing, storage, and CloudFormation costs. A full list of services that AWS ParallelCluster may use can be found [here](https://docs.aws.amazon.com/parallelcluster/latest/ug/aws-services-v3.html).

The PCUI is built on a serverless architecture and can be used through a free tier for most instances. For more information on the costs associated with the PCUI, please refer to the documentation found [here](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-costs-v3.html).

## Get Started

### Create a Cluster 
Let's create a cluster within the ParallelCluster environment.

![alt text](../../docs/images/create-cluster.png)

1. In the PCUI Clusters view, choose **Create cluster** > **Step by step**.
2. In Cluster, **Name**, enter a name for your cluster.
3. Choose a **VPC** from the available options and choose Next. CloudLab users will have access to pre-configured VPC networks.
4. In **Head node**, choose Add **SSM session**. This will allow you to access the head node through the **`Shell`** button. Change the instance type of your head node to **t2.xlarge**. 
5. In **Queues**, provide a name and subnet for your queue.
6. In **Compute resources**, choose 1 for **Static nodes** and select **c5n.large** as the instance type for your compute resources. 
7. In Storage, choose Next.
8. In Cluster configuration, review the cluster configuration YAML and choose **Dry run** to 
validate it.
1. Choose **Create** to create your cluster, based on the validated configuration.
2.  After a few seconds, the PCUI automatically navigates you back to Clusters, where you can
monitor the cluster creation status and Stack events.
1.  Choose **Details** to see cluster details, such as the version and status.
2.  Choose **Instances** to see the list of Amazon EC2 instances and status.
3.  Choose **Stack events** to view cluster stack events, and a AWS Management Console link to the
CloudFormation stack that creates the cluster.
1.  In Details, after cluster creation completes, choose **View YAML** to view or download the cluster configuration YAML file.

This is the YAML file for the cluster described above: 

```yaml
Imds:
  ImdsSupport: v2.0
HeadNode:
  InstanceType: t2.xlarge
  Imds:
    Secured: true
  LocalStorage:
    RootVolume:
      VolumeType: gp3
      Size: 50
  Networking:
    SubnetId: subnet-0be4c22d8137b8085
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  Ssh:
    KeyName: Snakemake-cluster-key-pair
Scheduling:
  Scheduler: Slurm
  SlurmQueues:
    - Name: queue-1
      AllocationStrategy: lowest-price
      ComputeResources:
        - Name: queue-1-cr-1
          Instances:
            - InstanceType: c5n.large
          MaxCount: 1
          MinCount: 1
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      Networking:
        SubnetIds:
          - subnet-0be4c22d8137b8085
  SlurmSettings: {}
Region: us-east-1
Image:
  Os: alinux2
````

### Set up your head node

After cluster creation completes, click on the **Shell** button to access the cluster head node. We will be working in the Parallel Cluster shell throughout the rest of the tutorial, all commands will be entered there.


1. Switch to user and create a working directory

In [None]:
#Enter in Parallel Cluster shell 

sudo -su ssm-user 
`cd ~` 
`mkdir workdir`

2. Install mamba. This is required as we will be executing Snakemake using mamba. 

In [None]:
#Enter in Parallel Cluster shell 

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda install mamba -c conda-forge
mamba --version

3. Install Snakemake and the Snakemake ParallelCluster plugin. 

**Note:** the PCluster plugin requires Snakemake > 8.0.0

In [None]:
#Enter in Parallel Cluster shell 
pip3 install Snakemake==8.25.5
pip3 install Snakemake-executor-plugin-pcluster-Slurm

Alternatively, you may use conda to install Snakemake using the following command: 

In [None]:
conda install bioconda::Snakemake==8.25.5

### Submitting a "Hello World" job to the Slurm cluster using `sbatch`


You can submit jobs to the Slurm cluster using a Slurm script and the sbatch command for submission. 

1. Create a Slurm script. This example runs prints "Hello World" in an output file, then appends the file with the name of the compute node the task ran on. 

To create files in the Parallel Cluster shell we are using the text editor Vim.

In [None]:
#to create the file use
vim hello-world.Slurm

```bash 
#hello-world.Slurm

#!/bin/bash
#SBATCH --job-name=hello-world
#SBATCH --output=hello-world.out
#SBATCH --error=hello-world.err
#SBATCH --ntasks=1
#SBATCH --time=00:01:00

echo "Hello, World!" > ~/workdir/hello-world.out

## Print the hostname of the node the job ran on
echo "This job ran on node: $(hostname)" >> /home/workdir/scripts/hello-world.out
```

Once you are done editing your file enter `:wq` then press 'enter' to save all your changes.

2. Submit the job using an `sbatch` command 

In [None]:
sbatch hello-world.Slurm

### Snakemake workflow files

When running a Snakemake workflow, it is common to organize the workflow dependencies in the following structure: 

```bash
Project Folder
│
├── Snakefile
│
├── config.yml
│
├── environment.yml
│
├── data
│   ├── file 1
│   └── file 2...
```
**Snakefile:** A Snakefile is the main file used in Snakemake to define a workflow. The commands to be executed, the input and output files and the dependencies of each step are defined as rules in this file. This file must be present in the working directory; if named Snakefile, Snakemake will automatically recognize it as the workflow definition file. If named differently, you must use the -s flag to specify the file. 

**config.yaml:** The config.yaml file is used to store configuration parameters that can be easily accessed and utilized throughout the workflow. This file allows you to define various settings, paths, parameters, and other variables that your Snakemake rules might need.

**environment.yml:** The environment.yml file defines the software environment required to run the Snakemake workflow include package names and versions. 



### Snakefile Structure

The required sections of a Snakefile include the Rule definitions and the rule all. 

**Rules:** 

* Rules are the building blocks of a Snakemake workflow. Each rule describes how to create one or more output files from input files. 
* The commands that must be run, any scripts, and environment parameters can be defined in the rule definition

**Rule all:**

* This rule tells Snakemake what the end products should be when the workflow is complete. It is usually the first rule that is defined. 

**Input and Output:** 

* These keywords specify the files that are inputs to and outputs from a rule. 
* The input keyword lists the files needed to run the rule, and the output keyword lists the files that will be produced by the rule. 
* The order of execution of rules are determined through the input and output fields. 

Example: 

```python
rule rule_1:
    input: 
        "input_1.txt
    output:
        "output_1.txt"

rule rule_2: 
    input: 
        "output_1.txt"
    output: 
        "output_2.txt" 
```

**Shell Command:** 

The shell keyword is used to specify the shell command that will be executed to produce the output files.

**Snakefile Example** 

A Snakefile can look like this: 

```bash
# Snakefile

# Rule all: specifies the final target of the workflow
rule all:
    input:
        "output_2.txt"

# Rule 1: processes input_1.txt to produce output_1.txt
rule rule_1:
    input:
        "input_1.txt"
    output:
        "output_1.txt"
    shell:
        "cat {input} > {output}"

# Rule 2: processes output_1.txt to produce output_2.txt
rule rule_2:
    input:
        "output_1.txt"
    output:
        "output_2.txt"
    shell:
        "wc -l {input} > {output}"

```

### Submitting a "Hello World" script using Snakemake, Slurm and the `pcluster-Slurm` plugin

Now that we've covered the basics of writing a script in Snakemake, let's create and run our first workflow! 

#### Create and Run the workflow
1. Create a Snakefile within a project directory. 


In [None]:
mkdir hello-world-Snakemake
vim Snakefile

Add the following content to the file: 

```python
#Snakefile
rule all:
    input:
        "output.txt"

rule example_rule:
    output:
        "output.txt"
    shell:
        """
        echo 'Hello, World!' > {output}
        """
```



Once you are done be sure to type `:wq` then press enter to save your changes

**Snakefile Breakdown** 
* This Snakemake workflow defines a rule all that looks for a file called "output.txt" 
* In the rule definition, the shell keyword is used to echo "Hello, World!" in the output file. 

2. Execute the workflow using the Snakemake command, specifying `pcluster-Slurm` as the executor.


When submitting a Snakemake workflow to the Slurm scheduler integrated in AWS ParallelCluster, you can submit the workflow by using the `--executor` flag and specifying the `pcluster-Slurm` plugin.

In [None]:
Snakemake --executor pcluster-Slurm 

### Submitting a bioinformatics Snakemake workflow to the Slurm cluster

In this example, we will use Snakemake and the pcluster-Slurm plugin to run a Bioinformatics pipeline. 


#### Download the input data

The input data consists of raw fastq files. Use the curl command to download the data from a public NIGMS google storage bucket.

In [None]:
#run in parallel cluster terminal

#Navigate to working directory 
mkdir ~/workdir/bioinfo-workflow
cd ~/workdir/bioinfo-workflow

#Make data directory 
mkdir ./data 
cd /data

#Run curl commands 
curl -o SRR13349122_1.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_1.fastq
curl -o SRR13349122_2.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349122_2.fastq
curl -o SRR13349123_1.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349123_1.fastq
curl -o SRR13349123_2.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349123_2.fastq
curl -o SRR13349128_1.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_1.fastq
curl -o SRR13349128_2.fastq https://storage.googleapis.com/nigms-sandbox/me-inbre-rnaseq-pipelinev2/data/raw_fastqSub/SRR13349128_2.fastq

#### Create an `environment.yml` file

In [None]:
# create a file
vi environment.yml

Add the following contents to the environment file:

```yaml
name: bioinformatics-test
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - bwa
  - samtools
  - bcftools
  - matplotlib
  - pandas
  - pysam
```

#### Create a `config.yaml` file

In [None]:
vi config.yaml

Add the following contents to the config file:

```yaml
conda_env: "envs/environment.yml"
```

#### Snakefile Breakdown:

In this section, we'll breakdown what the Snakemake workflow will accomplish by looking at teh sections of the Snakefile. 

**Overview:** 
The Snakefile maps a set of fastq files to a reference genome, sorts and indexes the mapped reads and finally runs variant calling on the mapped reads.

**Configuration:** 
* `configfile: "config.yaml"` specifies the configuration file you have created
* `SAMPLES = ["A", "B"]` defines the samples to be processed.

**Workflow:**
* **all:** Specifies the final output files required to complete the workflow.
* Bioinformatics rules 
  * **bwa_index:** Indexes the reference genome file (data/genome.fa) for alignment.
  * **bwa_map:** Maps the sequencing reads (data/samples/{sample}.fastq) to the indexed genome and converts the output to BAM format.
  * **samtools_sort:** Sorts the BAM files generated from the mapping step.
  * **samtools_index:** Indexes the sorted BAM files for faster access.
  * **bcftools_call:** Calls genetic variants from the sorted and indexed BAM files.
  * **plot_quals:** Generates a plot of the quality of the called variants.
  

#### Make the Snakefile 

Create a Snakefile and add the contents described below to it.

In [None]:
#create the Snakemake file
vi Snakefile

```python
#Snakefile
configfile: "config.yaml"

SAMPLES = ["A", "B"]

rule all:
    input:
        expand("mapped_reads/{sample}.bam", sample=SAMPLES),
        expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
        "calls/all.vcf"

rule bwa_index:
    input:
        "data/genome.fa"
    output:
        "data/genome.fa.bwt"
    conda:
        config["conda_env"]
    shell:
        """
        bwa index {input}
        """

rule bwa_map:
    input:
        genome="data/genome.fa",
        fastq="data/samples/{sample}.fastq",
        index="data/genome.fa.bwt"
    output:
        "mapped_reads/{sample}.bam"
    conda:
        config["conda_env"]
    shell:
        """
        bwa mem {input.genome} {input.fastq} > mapped_reads/{wildcards.sample}.sam
        samtools view -Sb mapped_reads/{wildcards.sample}.sam > {output}
        rm mapped_reads/{wildcards.sample}.sam
        """

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    conda:
        config["conda_env"]
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    conda:
        config["conda_env"]
    shell:
        "samtools index {input}"

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    conda:
        config["conda_env"]
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | bcftools call -mv - > {output}"

```

#### Execute the workflow 

Execute the workflow using the **Snakemake** command, specifying **pcluster-Slurm** as the executor and **conda** as the environment management system. Here we are adding two new flags:

- **--use-conda --conda-frontend mamba**: This flag tells Snakemake to use mamba environments for managing dependencies. When this flag is used, Snakemake will look for environment.yaml files specified in the workflow rules and create mamba environments accordingly. 

- **-j**: This flag specifies the number of jobs (or threads) to run in parallel.

In [None]:
#enter in the Parallel Cluster shell
Snakemake --executor pcluster-Slurm --use-conda --conda-frontend mamba -j 5

## Conclusion 

Congratulations! You have successfully learned how to set up and manage High-Performance Computing (HPC) clusters on AWS using AWS ParallelCluster and you've run a Snakemake workflow on this cluster. 

## Clean Up: 
- To stop all compute nodes select, Navigate to the PCUI and select Stop fleet.
- To clean up, in the Clusters view, select the cluster, and choose Actions, Delete cluster.

## References: 
* [AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/)
* [Snakemake Documentation](https://Snakemake.readthedocs.io/en/stable/)
* [Snakemake `pcluster-Slurm` plugin](https://Snakemake.github.io/Snakemake-plugin-catalog/plugins/executor/pcluster-Slurm.html)
