# The WGBS data analysis tutorial 3 - nf-core/methylseq

In this tutorial, we'll introduce the basics of Nextflow, the nf-core/methylseq pipeline and how to use them to run our example dataset in the Vertex AI notebook local environment:
- [Nextflow introduction](#Nextflow-introduction) - Introduction to Nextflow, write a simple Nextflow script
- [nf-core](#nf-core) - Introduction to nf-core
- [nf-core/methylseq](#nf-core/methylseq) - Install and run the pipeline
- [Understanding nf-core/methylseq Output](#Understanding-nf-core/methylseq-Output) - compare to the Bismark workflow

<img src="images/notebook3.png" width="700" />

__[Nf-core/methylseq](https://nf-co.re/methylseq/1.6.1)__ is a bioinformatics analysis pipeline used for Methylation (Bisulfite) sequencing data. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.

The pipeline is built using __[Nextflow](https://www.nextflow.io/)__, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with Docker containers making installation trivial and results highly reproducible.

<img src="images/3_nextflow_logo.png" width="300" />

## Nextflow introduction
Please visit https://www.nextflow.io/docs/latest/index.html for more information about Nextflow. 

Nextflow is a reactive workflow framework and a programming DSL (Nextflow defaults to DSL 2 if no version is specified explicitly) that eases the writing of data-intensive computational pipelines. Linux provides many simple but powerful command-line and scripting tools that, when chained together, facilitate complex data manipulations. Nextflow extends this approach, adding the ability to define complex program interactions and a high-level parallel computational environment based on the dataflow programming model. 

#### Basic concept

In practice a Nextflow pipeline script is made by joining together different **[processes](https://www.nextflow.io/docs/latest/process.html)**. Each process can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.). Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable) state. The only way they can communicate is via asynchronous FIFO queues, called __[channels](https://www.nextflow.io/docs/latest/channel.html)__ in Nextflow. Any process can define one or more channels as input and output.

Nextflow interacts with many different files to have a proper working workflow:

- __Main file__: The main file is a `.nf` file that holds the processes and channels describing the input, output, a shell script of your commands and workflow
- __Config file__: Pipeline default configuration properties are defined in a file named `nextflow.config` in the pipeline execution directory.The `.config` file contains parameters, and multiple profiles. Each profile can contain a different __executor__ type (e.g. LS API, conda, docker, etc.), memory or machine type, output directory, working directory and more. 
- __Docker file (optional)__: Contains dependencies and environments that is needed for the Nextflow workflow to run.
- __Schema file (optional)__: Schema files are optional and are structured .json files that contain information about the usage and commands that your workflow will execute.You might have seen this when you run a command along with the flag '--help'.

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  When using the nf-core/methylseq, or other nf-core pipelines, users do not need to create any of these files except the config file. And besides Docker, users can select other software dependency management tools (Docker, Singularity, Conda). 
</div>


#### Install Nextflow

The commands below will install Nextflow using `mamba` in this notebook instance. If you haven't installed mamba, You can use `! conda install mamba -n base -c conda-forge -y` to install mamba using conda, as described in [Tutorial 1](tutorial_1-bismark.ipynb#Mamba). 

Now, let's install `Nextflow`:

In [None]:
! mamba install -c bioconda "nextflow=22.04" -y

#### Your first Nextflow script: 'Hello World'

- Create a `hello.nf` file in the terminal
- Be sure to include _#!/usr/bin/env nextflow_ and _nextflow.enable.dsl=2_ at the top of your script
- Add a process that is named `sayHello`. This process will catch the input as a string (or use the pre-defined str 'Hello World' in no input provided) and write the string to a file then print the content of that file. The file will be in the current working directory `<work>`
- At the end write the order of your workflow
    - For our example we are running the sayHello process and the final output is printed by the `view` operator

It should look something like this:

```bash
#!/usr/bin/env nextflow
nextflow.enable.dsl=2 

params.str = 'Hello World'

  process sayHello {
  input:
  val str

  output:
  stdout

  """
  echo $str > hello.txt
  cat hello.txt
  """
}
workflow {
  sayHello(params.str) | view
}
```

Execute the script by entering the following command. It will output how the process is executed and the final output.

In [None]:
! nextflow run docs/hello.nf -work-dir "Tutorial_3/work"

There is no output directory, however, there should be an intermediate output file `hello.txt` generated in the `Tutorial_3/work` directory. If the work directory is not defined by `-work-dir` or `-w`, then Nextflow will created a directory called `work` automatically for its intermediate outputs. The sub-directory name should be the one in front of the process in the brackets. For example, Tutorial_3/work/18/e5f57536aff7b3193cc3b6d8175266/hello.txt

#### Pipeline parameters
Pipeline parameters are simply declared by prepending to a variable name the prefix `params`, separated by dot (`.`) character. Their value can be specified on the command line by prefixing the parameter name with a double dash character, i.e. `--paramName`

For example, we can run the previous script specifying a different input string parameter (`--str`), as shown below: 

In [None]:
! nextflow run docs/hello.nf --str "Good morning!" -work-dir "Tutorial_3/work"

#### Configuration

The configuration file can be used to define which executor to use, the process’s environment variables, pipeline parameters etc. When a pipeline script is launched, Nextflow looks for configuration files in multiple locations. Since each configuration file can contain conflicting settings, the sources are ranked to decide which settings to are applied. All possible configuration sources are reported below, listed in order of **priority**:

- Parameters specified on the command line (`--something value`)
- Parameters provided using the `-params-file` option
- Config file specified using the `-c` my_config option
- The config file named `nextflow.config` in the current directory
- The config file named `nextflow.config` in the workflow project directory
- The config file `$HOME/.nextflow/config`
- Values defined within the pipeline script itself (e.g. `main.nf`)

<div class="alert alert-block alert-info">
    <i class="fa fa-lightbulb-o" aria-hidden="true"></i>
    <b>Tip: </b> Modify and resume</div>
<p>Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the processes that are actually changed will be re-executed. The execution of the processes that are not changed will be skipped and the cached result used instead. After saving the changes (.nf or .config), save the file with the same name and execute it by adding the <code>-resume</code> option to the command line. But remember if you wish to use the `-resume` function, then the previous runs should not be cleaned up, especially the work directories.

This helps a lot when testing or modifying part of your pipeline without having to re-execute it from scratch.


## <a name="nf-core" /><img src = "images/3_nf-core-logo.png" width= "300" />

__[nf-core](https://nf-co.re/)__ is a community effort to collect a curated set of analysis pipelines built using Nextflow. And now, 69 pipelines (Sep 2022) are currently available as part of nf-core. We can use the following command to check the list of available pipelines:

In [None]:
! docker run nfcore/tools list

## nf-core/methylseq

#### Introduction

"**nf-core/methylseq** is a bioinformatics analysis pipeline used for Methylation (Bisulfite) sequencing data. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results. The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with Docker containers making installation trivial and results highly reproducible." - [https://github.com/nf-core/methylseq](https://github.com/nf-core/methylseq)

The basic steps from this pipeline are similar to the Bismark pipeline introduced in tutorial 1, but with more quality control steps:  
> <img src = "images/3_methylseq_steps.png" width = "801" />

#### Quick Start

To run nf-core/methylseq, you need to:
- [X] Install Nextflow
- [X] Install any of **Docker**, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (Docker comes with the instance, no need to install)
- [ ] Download the pipeline and test it on a minimal dataset with a single command
- [ ] Start running your own analysis!

A typical methylseq command looks like this: `nextflow run nf-core/methylseq --input '*_R{1,2}.fastq.gz' -profile docker`. There are many parameters that you can add or change to the pipeline. Please see a detailed list of these options by using the `--help` flag (the output directory only save some pipeline info here, can be delete later):

In [None]:
! nextflow run nf-core/methylseq -r 1.6.1 --help --outdir "Tutorial_3/help_dir" -work-dir "Tutorial_3/work"
! rm -rf Tutorial_3/help_dir

#### Download the nf-core/methylseq pipeline and test it on a minimal dataset with a single command

This step will download the pipeline and its dependencies in the local compute instance. And then the **test** profile uses a small dataset allowing you to ensure the workflow works with your config file without long run times.

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
<b>Note: output directory should NOT exist yet.</b> If not defined, the default output directory is <b>'results'</b>, and the output directory cannot exist before running the pipeline, or you will get an error message. Remove the output directory if this is not your first time to run it.
</div>

In [None]:
# remove diretory 'Tutorial_3/test_results' if it exists
! rm -rf Tutorial_3/test_results

# download and test nf-core/methylsesq
! nextflow run nf-core/methylseq -r 1.6.1 -profile test,docker \
    --outdir "Tutorial_3/test_results" -work-dir "Tutorial_3/work"

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
<b>Note: failed process preseq</b>. You may notice that the process <code>preseq</code> fails in this test run. Preseq is used to predict and estimate the complexity of a genomic sequencing library. The preseq process often fails, but it can be ignored since it does not affect the final output files. 
</div>

#### Run nf-core/methylseq pipeline with your own data

In this step, we will run the example dataset we downloaded in tutorial 1 using methylseq. If you haven't downloaded the dataset and its reference, please read the instructions ["Importing the example dataset"](tutorial_1-bismark.ipynb#Importing-the-example-dataset).

The parameters that need to be provided are:
- `--fasta` - the reference sequences, usually the reference genome. In our example, we use the sequences from mouse chromosome 6 to be the reference sequence. You can use `--genome` instead to provide the iGenomes reference, such as "GRCh38" for human. For all the ready-to-use genomes, please visit https://support.illumina.com/sequencing/sequencing_software/igenome.html.
- `--input` - to specify the location of your input fastq files. Please note the following requirement:
    1. The path must be enclosed in quotes
    2. The path must have at least one * wildcard character
    3. When using the pipeline with paired end data, the path must use {1,2} notation to specify read pairs. Note, in Jupyter notebooks, you need to replace the {} to \[\] for the system to interpret the name correctly
- `--outdir` - where to save the pipeline output files. The default: `./results`. In our example, we change the output directory to `Tutorial_3/methylseq_results`. **Note:** this directory cannot exist before the running of the pipeline, or you will get error saying it already exists. So you can remove the directory first using `rm -rf` if this is not the first time you run this step.
- `-w` - working directory
- `max_cpu` and `max_memory` - these parameters act as a cap, to prevent Nextflow from going over what is possible on your system. For the compute node default settings we selected when creating the notebook, the maximum CPUs we can use is 4, and maximum memory is 15GB

Now, run nf-core/methylseq pipeline with these settings (it will take ~30 minutes):

In [None]:
! rm -rf Tutorial_3/methylseq_results

! nextflow run nf-core/methylseq -r 1.6.1 -profile docker -work-dir 'Tutorial_3/work' \
    --fasta 'Tutorial_1/ref_genome/Mus_musculus.GRCm39.dna.chromosome.6.fa' \
    --input 'Tutorial_1/fastq/*_R[1,2].fastq.gz' \
    --outdir 'Tutorial_3/methylseq_results' \
    --max_cpus 4 \
    --max_memory 12.GB

## Understanding nf-core/methylseq Output

#### Two output directories

Unless otherwise specified, there will be two output directories from the nf-core/methylseq pipeline: `work` and `results`. `work` is the working directory, where the intermediate files will be saved. `results` is where the outputs from the pipeline will be saved and it can be changed using the `--outdir` in the command. In our example above, the work directory is `Tutorial_3/work` and the results directory is `Tutorial_3/methylseq_results`.

The results from each step will be in separate sub-directories in the results directory. In our example, they are saved to `Tutorial_3/methylseq_results`, and the all the resulting directories/files are saved as shown in the figure below:

> <img src="images/3_methylseq_output.png" alt="nf-core/methylseq output" width="800"/>

More detailed explanation can be found in the `Tutorial_3/methyseq_results/pipeline_info/results_description.html`. You might need to use `Ctrl` and click the links inside this HTML to open the web pages in the browser.

In [None]:
from IPython.display import IFrame
IFrame(src='Tutorial_3/methylseq_results/pipeline_info/results_description.html', width=1000, height=400)

#### Compare the results from two pipelines

For example, you can find the summary of the Bismark step in the `results/bismark_summary`, which should be very similar to the Bismark summary we generated ([step 8](tutorial_1-bismark.ipynb#Step-8.-Generate-report-and-summary)) from running the Bismark workflow in **tutorial 1**:

In [None]:
# Bismark results from this tutorial - methylseq
from IPython.display import IFrame
IFrame(src='Tutorial_3/methylseq_results/bismark_summary/bismark_summary_report.html', width=1000, height=400)

In [None]:
# Bismark results from tutorial 1 - step 8
IFrame(src='Tutorial_1/bismark/bismark_summary_report.html', width=1000, height=400)

#### Files for downstream analysis

As mentioned in the tutorial 1, the final output will be used for other downstream analysis can be found in the output directory:
- `bismark_methylation_calls/bedGraph`   
Methylation statuses in bedGraph format, with 0-based genomic start and 1- based end coordinates.
- `bismark_methylation_calls/methylation_coverage`   
Coverage text file summarizing cytosine methylation values.

## Cleaning up

Transfer useful results back to Google Cloud Storage, and remove intermediate files. In this tutorial, the most important output files are the methylation profiles (.bedgraph.gz) of each sample. We will use these data to identify differentially methylated regions using metilene in the next tutorial.

For example, you can upload/copy all the .bedgraph.gz files to the bucket you created in Cloud Storage by using:     
`! gsutil cp Tutorial_3/methylseq_results/bismark_methylation_calls/bedGraph/*.bedGraph.gz gs://BUCKET_NAME/methylseq_results`

You can also delete the whole directory with all the files generated in this notebook using:  
`! rm -rf Tutorial_3/methylseq_results`

## Terms & Quiz

In [None]:
!pip install jupytercards --quiet
from jupytercards import display_flashcards
display_flashcards('quiz_files/f3.json')

In [None]:
!pip install jupyterquiz --quiet
from jupyterquiz import display_quiz
display_quiz('quiz_files/q3.json')

<div class="alert alert-block alert-danger"><i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Don't forget:</b> after finish running the notebook, stop the notebook in Vertex AI Workbench to avoid cost accumulation.
</div>