# The WGBS data analysis tutorial 4 
# Run nf-core/methylseq using AWS Batch

## Overview
For real-world datasets, the sequence file sizes are usually too large to process using a single virtual machine (SageMaker notebook), or take a long time. In this tutorial, we will show how to run a nf-core/methyseq pipeline to process WGBS data using the AWS Batch.  

<img src="images/notebook4.png" width="900" />

The **[AWS Batch](https://www.nextflow.io/docs/latest/aws.html#aws-batch)** is a managed compute service that allows the execution of containerized workloads in the Amazon Web Services (AWS) infrastructure. It provides a simple way to execute a series of containerized tasks on AWS. The most common use case when using AWS Batch is to run an existing tool or custom script that reads and writes files, typically to and from Amazon Simple Storage Service (S3). Nextflow provides built-in support for AWS Batch, which allows the seamless deployment of a Nextflow pipeline in the cloud, offloading the process executions through the AWS service. AWS Batch can run independently over hundreds or thousands of these files.

## Learning Objectives

* **Understand and utilize AWS Batch for large-scale WGBS data analysis:** Learn how to overcome limitations of single notebook instances by leveraging AWS Batch's managed compute service for processing large datasets.

* **Set up a Nextflow IAM role and configure notebook permissions:** Learn to create and configure an IAM role with the necessary permissions to interact with AWS services (Batch, S3) from a Jupyter Notebook environment in SageMaker. This includes understanding and applying the principle of least privilege and enabling required service endpoints.

* **Configure Nextflow for AWS Batch execution:** Learn to create and modify a Nextflow configuration file to specify AWS Batch as the executor, define AWS region, compute environment (including instance type, spot/on-demand settings, and resource allocation) working directory, and output directory in Amazon S3. Understand the implications of choosing different instance types and their resource implications, including understanding the cost-effectiveness of spot instances.

* **Run the nf-core/methylseq pipeline on AWS Batch:** Learn to execute the nf-core/methylseq pipeline using AWS Batch, including specifying pipeline version, profiles, and config files. Understand the differences in runtime between small test datasets and larger real-world datasets.

* **Process real-world WGBS datasets using nf-core/methylseq and AWS Batch:** Gain practical experience downloading and processing a real-world WGBS dataset using the `fasterq-dump` tool and creating a samplesheet for the pipeline.

* **Troubleshoot large-scale pipeline execution:** Learn to identify and resolve common issues such as out-of-memory errors by adjusting instance types and memory allocation within the Nextflow config file and AWS Batch settings. Understand how to optimize pipeline configurations for resource usage.

* **Interpret pipeline execution reports:** Learn to use the pipeline's execution timeline and report to understand resource utilization and identify potential bottlenecks for optimization.
* **Manage Amazon S3 buckets:** Learn to create and manage Amazon S3 buckets for storing input data and pipeline outputs. Understand how to grant appropriate permissions to the IAM role using IAM policies.

## Prerequisites

* **Mamba:** Used for package management to install `nextflow` and `sra-tools`.
* **Nextflow:**  The workflow management system used to run the nf-core/methylseq pipeline.
* **nf-core/methylseq:** The bioinformatics pipeline for processing WGBS data.  Specific version is mentioned (e.g., `-r 2.4.0` and `-r 2.6.0`).
* **AWS Batch Compute Environment and Job Queue:** You must have an AWS Batch compute environment and job queue configured. The CloudFormation template automates this. You can set up one manually following the instructions in the link provided in the notebook, but using the template is *highly recommended* for ease of setup.
* **Amazon S3 Bucket:** An Amazon S3 bucket is crucial for storing input data, intermediate files, and results due to the significant size of WGBS datasets. The notebook guides you through its creation. The notebook utilizes the bucket, with subdirectories like `workdir` and `output`,  for data storage when using AWS Batch or other processing services.
* **Sample Sheet:** A CSV file containing information about the samples (fastq files).  The notebook creates this from example SRA data or needs a user-provided one for larger studies.
* **Reference Genome:**  The notebook uses GRCh38 (human genome) as an example, but another reference genome can be specified.  The correct files must be available or the pipeline must know where to find them.

## Get Started

### AWS Batch Setup

AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. 

If you prefer to skip manual deployment and deploy automatically in the cloud, click the Launch Stack button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. 

[![Launch Stack](images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml )


Before begining this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to **manually** set those up please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md) to set that up.

### Create an Amazon S3 Bucket

You can create a customized bucket to use to store the output from the pipeline. Bucket names must be **globally unique** across all AWS accounts, including those outside of your organization. 

<b>Note:</b> We will use bucket name "nextflow-bucket-test" as an example in the following steps, but you need to create your own bucket to store the data. Please replace it with your own bucket name.


In [None]:
# make a S3 bucket
! aws s3 mb nextflow-bucket-test # replace it with your own bucket name

### Install Nextflow

In [None]:
! mamba install -c bioconda nextflow -y

In [None]:
! nextflow self-update

### Create and modify your own config file to include a 'aws' profile block

The config file allows Nextflow to utilize executors like Google Batch. Below is an example config file to run a Nextflow job using AWS Batch:  
```bash
plugins {
    id 'nf-amazon'
}

profiles {
    aws {
        process {
            executor = 'awsbatch'
            queue = 'nextflow-batch-job-queue'
            container = 'nfcore/methylseq'
            
        }
        workDir = 's3://nextflow-bucket-test/nextflow_env/'
        params.outdir = 's3://nextflow-bucket-test/nextflow_output/' 
        
        fusion.enabled = true
        wave.enabled = true
        aws.region = 'us-east-1'

    }
}
```  
There are some fields that you need to define or pay attention to:
- **Executor**. To run the job using AWS Batch, the executor must be defined here using: `process.executor = 'awsbatch'` 
- **Data storage**. For a full-scale dataset, make sure you create the bucket ahead of time and a directory in your specified bucket to store the input, output, and intermediate files. The `workDir` define the working directory for intermediate file to store. You can also define the input files and output path using the parameter `params.input` and `params.outdir` to specify your working directory bucket and output directory bucket
    - If not defined, the work directory and output directory with be in your local notebook directory named `work`, and `results`. This is risky, since the intermediate and final outputs can be too large to store in the notebook instance.
- **Region**. Make sure that your region is a region included in AWS Batch.

__Note:__ Best practices are to make sure your working directory (`workDir`) and output directory (`outdir`) are **different**! AWS Batch creates temporary files in the working directory within your bucket that do take up space. So once your pipeline has completed successfully, feel free to delete the temporary files.

### Download and **Test** nf-core/methylseq using AWS Batch

The `test` profile (`-profile test`) uses a small dataset allowing you to ensure the workflow works with your config file without long run times. Ensure you include:
- Version of the nf-core tool [-r]
- Location of the config file [-c]

In [None]:
! nextflow run nf-core/methylseq -r 2.4.0 -profile test,aws -c ./docs/aws-batch.config

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
<b>Note</b>: The <code>preseq</code> process may failed but ignored in the pipeline. This won't affect the output results. The preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment.
    </div>

This nf-core/methyseq test profile takes about 20 minutes to finish. When compared with the test profile running time (about 3 minutes) from Tutorial 3, we can see that there is extra time needed for Nextflow to talk to Google Batch, Cloud Storage, and VMs. It is not worthwhile for a small dataset, but this time difference can be ignored when running large datasets that need more computational resources. In other words, Google Batch works well for coarse-grained workloads i.e. long-running jobs. It’s not suggested to use this feature for pipelines spawning many short-lived tasks.

In [None]:
# Empty the bucket for the next part
! aws s3 rm --recursive s3://nextflow-bucket-test/

## An Example of a Real World Dataset<a name="REAL"/>

1. Install SRA-tools
2. Download the Data
3. Modify the config file
4. Run the job

#### Install SRA-tools

The **[SRA Toolkit](https://github.com/ncbi/sra-tools/wiki)** and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives. Here we use `mamba` to install `sra-tools`:

In [None]:
! mamba install -c bioconda sra-tools=3.0.5 -y

#### Download the Data

The data was from [Molaro, Antoine, et al. Cell 146.6 (2011): 1029-1041](https://www.sciencedirect.com/science/article/pii/S0092867411009421) and [Laurent, Louise, et al. " Genome research 20.3 (2010): 320-331](https://genome.cshlp.org/content/20/3/320.full). During germ cell and preimplantation development, mammalian cells undergo nearly complete reprogramming of DNA methylation patterns. The studies profiled the methylomes of human and chimp sperm as a basis for comparison to methylation patterns of embryonic stem cells (ESCs).   
<img src="images/4_data_graph.jpg" width="300" />

We use one sample from human sperm and one sample from ESCs as examples to demonstrate the workflow here.

Use `fasterq-dump` to download data from SRA using accession numbers. The data will be store at `Tutorial_4/sra_download`:

In [None]:
! fasterq-dump --threads 4 --progress SRR306435 SRR033942 -O Tutorial_4/sra_download

Remove the temporary output directory from running `fasterq-dump`:

In [None]:
! rm -rf fasterq.tmp.*

Compress the files.

In [None]:
#faster way to compress the files
! pigz Tutorial_4/sra_download/SRR*

#### Create a samplesheet (located in Tutorial_3) to provide all sample information

**Format:**    
sample, fastq1, fastq2    
sample1,sample1_R1.fastq,sample1_R2.fastq    
control1,control1_R1.fastq,control1  

In [None]:
# Pandas DataFrame by lists of dicts.
import pandas as pd
 
# Initialize data to lists.
samples = [{'sample': 'SRR033942', 'fastq_1': 'Tutorial_4/sra_download/SRR033942_1.fastq.gz', 'fastq_2': 'Tutorial_4/sra_download/SRR033942_2.fastq.gz'},
        {'sample': 'SRR306435', 'fastq_1': 'Tutorial_4/sra_download/SRR306435_1.fastq.gz', 'fastq_2': 'Tutorial_4/sra_download/SRR306435_2.fastq.gz'}
       ]
 
# Creates DataFrame.
df2 = pd.DataFrame(samples)
 
# Print the data
df2

Export dataframe to CSV file.

In [None]:
df2.to_csv('Tutorial_4/samplesheet.csv', index=False)

#### Run methylseq using AWS Batch

If not defined in the config file, you can always use command line parameters:
- `-r` pipeline version
- `-profile` profile to use ('gcb' was defined in docs/aws-batch.config)
- `-c` config file
- `--fasta` the reference sequences, usually the reference genome. Here we use human assembly GRCh38 as the reference genome. 
- `--clip_r1` instructs Trim Galore to remove certein number of bps from the 5' end of read 1
- `--tracedir` defines a local diretory to save the pipeline information

There will be some pipeline information saved to the default `nextflow_output` directory. So please make sure the directory is empty before running the pipeline.

In [None]:
! rm -rf Tutorial_4/methyseq_sperm

! nextflow run nf-core/methylseq \
    -profile aws \
    -r 2.6.0 \
    -c ./docs/aws-batch.config  \
    --input 'Tutorial_4/samplesheet.csv' \
    --fasta s3://nigms-sandbox/references/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/genome.fa \
    --clip_r1 2 \
    --tracedir 'Tutorial_4/methyseq_sperm/pipeline_info' \
    -resume

In [None]:
# Remove the working file diretory
! aws s3 rm --recursive s3://nextflow-bucket-test/nextflow_env

#### Check to see if files are in your output directory bucket

The output files should be saved in your bucket's `nextflow_output` directory. You can list the results directory to see the file structures. You can also copy the files to your local directory to view them. For example, the MultiQC report file is located at `s3://nextflow-bucket-test/nextflow_output/multiqc/bismark/multiqc_report.html`. Let's copy and view it using the commands below:

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
<b>Note:</b> Please <b>replace</b> the "nextflow-bucket-test" in commands below with your own bucket name.
</div>

In [None]:
# List the output files/directories in the results folder
! aws s3 ls s3://nextflow-bucket-test/nextflow_output/

# Copy the multiQC output multiqc_report.html to local notebook:
! aws s3 cp s3://nextflow-bucket-test/nextflow_output/multiqc/bismark/multiqc_report.html .

# View the MultiQC output HTML file:
from IPython.display import IFrame
IFrame(src='multiqc_report.html', width=900, height=600)

There are two files (`execution_timeline.html` and `execution_report.html`) about the pipeline running information will be saved in the null/pipeline_info directory locally in the notebook, which can provide detailed information about the running time for each process and the their resource usages. This can provide more insights for potential optimizations. 

In [None]:
#use the file names listed in here
! ls null/pipeline_info

In [None]:
from IPython.display import IFrame
IFrame(src='null/pipeline_info/execution_timeline_2025-02-04_19-39-39.html', width=800, height=600)

In [None]:
from IPython.display import IFrame
IFrame(src='null/pipeline_info/execution_report_2025-02-04_19-39-39.html', width=800, height=600)

### Configuration of a Full-scale Dataset - Troubleshooting

For a full-scale WGBS study, the sequencing data size can range from several hundred GBs to several TBs. For example, the data we downloaded in this tutorial: GSE30340 and GSE19418, both have many runs with the size add up to several hundred GBs. Given the large data files, the storage and memory can become an issue when running the pipeline as instructed in this tutorial. 

#### Download the data

There are several options that we can use to download the data:
1. Download the data in a notebook. You need to make sure that the disk size you assigned to the notebook is enough for the data that you want to download. Also, when you use `prefetch` from SRA toolkit, there is a default maximum download-size of 5G; you will need to increase that limit.  
2. Cloud Data Delivery Service. SRA has created a cloud data delivery service to deliver the source files and other file types from NCBI cold storage buckets to individual data consumers' buckets in AWS and GCP. This service is provided for both public and authorized access (dbGaP) data. [More detailed information here](https://www.ncbi.nlm.nih.gov/sra/docs/data-delivery/).
3. Upload to the S3 bucket directly. You can upload the data to the Amazon S3 bucket directly from your local computer, HPC, or service server using the `aws s3` tool.

#### Troubleshooting the nf-core pipeline

If the nf-core pipeline does not complete successfully, you can refer to the [troubleshooting](https://nf-co.re/usage/troubleshooting) page that nf-core provided for more information. For our tutorial here, the most likely reasons that the pipeline fails are:
- service account is not set up correctly
- file paths are not correct
- memory or storage issues for large dataset

If you have a command exit status of 104, 134, 137, 139, 143, 247, the most probable cause is an "out of memory" issue. To solve the memory issue, you need to increase the memory limit in the configuration file for the process that fails. For example:   
```
profiles {
    aws {
        process {
            withName: qualimap {
                instanceType = 'ml.m5.2xlarge'  // Or another suitable instance type
                cpus = 16
                memory = 64.GB
            }
        }
    }
}
```

In AWS, the memory is also limited by the [notebook instance type](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-instance-types.html) you select to run the process. For example, if you choose `ml.m5.2xlarge` then the memory is limited to 32GB. You can change the machine types to increase the memory. 

#### Optimize nf-core/methylseq configuration

The nf-core/methylseq workflow contains multiple processes, and the requirements of computational and memory resources for each process vary a lot. For better performance or billing purposes, you can change the configuration for each process. You can check the default settings for each process at the pipeline's [base.config](https://github.com/nf-core/methylseq/blob/master/conf/base.config) file.  
As an example of running a 12 sample WGBS data, [docs/optimization_example.config](docs/optimization_example.config) was the config file that finish processing these 24 fastq files (pair-end, averge size 325M reads per fastq file, ) using Google Batch in less than 30 hours.

## Conclusion

This Jupyter Notebook demonstrated how to leverage AWS Batch within the Nextflow framework for efficient WGBS data analysis using the nf-core/methylseq pipeline.  We addressed the limitations of processing large datasets on single virtual machines by outlining the necessary steps for setting up a Nextflow service account, configuring an Amazon S3 bucket, and creating a Nextflow config file tailored for AWS Batch execution. The tutorial progressed from a test run with a small dataset to a more realistic example using real-world SRA data, highlighting the process of data acquisition, samplesheet creation, and pipeline execution.  Finally, we provided crucial guidance on troubleshooting issues common to large-scale datasets, including memory management, and offered strategies for optimizing pipeline configuration for enhanced performance and cost-effectiveness.  The successful completion of this notebook empowers users to efficiently analyze sizable WGBS datasets using the scalability and resource management capabilities of AWS.

## Clean Up
**Remember to move to the next notebook or shut down your instance if you are finished.**