# The WGBS data analysis tutorial 2 - DMR identification

## Overview

This tutorial will introduce the DMR (differential methylated region) detection using metilene. __[metilene](https://www.bioinf.uni-leipzig.de/Software/metilene/)__ is a software tool to annotate differentially methylated regions (DMRs) and differentially methylated CpG sites (DMCs) from Methyl-seq data. Please see the [metilene user guide](https://www.bioinf.uni-leipzig.de/Software/metilene/Manual/) for more details.  
> **Input**: methylation profiles (.bed, .bedgragh, .cov.gz)  
> **Output**: DMR list (.bedgraph)

> <img src="images/notebook2.png" width="700" />

### Differentially methylated regions (DMRs) 
"After an initial analysis of global trends in a DNA methylation data set, the typical next step is the identification of differentially methylated regions (DMRs) that exhibit consistently different DNA methylation levels between sample groups (for example, cases versus controls). These DMRs can be as small as a single C or as large as an entire gene locus, depending on the biological question of interest and on the bioinformatic methods used for their identification. Although a single methylated CpG may occasionally be linked to gene expression regulation and may affect disease risk, the vast majority of DMRs reported in the literature fall within a size range of a few hundred to a few thousand bases. This range coincides with typical sizes of gene-regulatory regions, and it is widely believed that DMRs can control cell-type-specific transcriptional repression of an associated gene" - [Bock, Christoph.(2012)](https://www.nature.com/articles/nrg3273)

> <img src="images/2_dmr.png" width="500" />

## Learning Objectives

* **Identify Differentially Methylated Regions (DMRs):**  The core objective is to understand the concept of DMRs and how to detect them in Whole-Genome Bisulfite Sequencing (WGBS) data.

* **Use the metilene software tool:** The notebook focuses on using `metilene` for DMR identification, covering installation, input file preparation, and execution of the tool with different modes (de novo and within known features).

* **Prepare input data for metilene:** This includes sorting `.bedGraph` files and using the `metilene_input.pl` script to create a suitable input file for `metilene`.

* **Interpret metilene output:** Learners will learn to understand the output format of `metilene` and use the `metilene_output.pl` script to filter and visualize the results.

* **Filter and visualize DMRs:** The notebook demonstrates how to filter DMRs based on criteria like q-value and the number of CpGs, and then visualize the results using basic statistics plots and IGV.

* **Understand different DMR detection approaches:** The tutorial contrasts de novo DMR detection with DMR detection within known genomic features (e.g., CpG islands).

* **Utilize IGV for visualization:**  The notebook shows how to load and visualize the DMR results, along with the original methylation data, in IGV for a more detailed examination.

* **Interpret global methylation patterns:** The notebook uses an example dataset to illustrate how to interpret global methylation differences between sample groups and how this relates to DMR identification.

## Prerequisites

**1. Software:**

* **mamba:** Used for installing `metilene` and its dependencies.  Specifically, `samtools` is mentioned as a dependency.
* **metilene:** The core software for DMR identification.
* **bedtools:** The `sortBed` command from bedtools is used to sort input files.
* **Perl:** The `metilene_input.pl` script is written in Perl.
* **IGV (Integrative Genomics Viewer):** Used for visualization of the results (though not strictly required for the analysis itself).

**2. Data:**

* **Methylation profiles:** The notebook expects input methylation profiles in `.bedGraph.gz` format, generated by a previous step (likely using Bismark, as referenced).  These files contain methylation data for different samples.
* **Optional:  Annotation file:** A BED file containing known genomic features (e.g., CpG islands) is used in one analysis mode (`cpgIslandExt_mm39_chr6.bed` in the example).
* **Reference genome:** The notebook specifies `mm39` (mouse genome build 39) as the reference genome. IGV requires this information for visualization.

## Get Started
### Introduction to metilene

metilene is a software tool to annotate differentially methylated regions (DMRs) and differentially methylated CpG sites (DMCs) from Methyl-seq data. metilene accounts for intra-group variances and offers different modes de-novo DMR detection, DMR detection within a known set of genomic features, and DMC detection. Various biological data can be used, metilene works with Whole-Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS), and any other input data, as long as absolute (methylation) levels and genomic coordinates are provided. metilene uses a circular binary segmentation and a 2D-KS test to call DMRs. Adjusted p-values are calculated using the Bonferroni correction.

### Install metilene

If you've followed the previous Bismark tutorial (tutorial_1-bismark.ipynb), metilene should already be installed in the instance. If not, please use the following command to install metilene: `! mamba install -y -c conda-forge -c bioconda samtools metilene`

### Generate input files for metilene

Quick start, to do a typical de-novo annotation of DMRs, the command looks like: `metilene -a g1 -b g2 methylation-file`, while the input **methylation-file** containing all methylation data is a **SORTED** tab-separated file with the following format and header:  
| chr | pos | g1_xxx | g1_xxx | \[...\] | g2_xxx | g2_xxx | \[...\] |
| --- | --- | --- | --- | --- | --- | --- | --- |

where the first column refers to the chromosome, the second column to the genomic position of the CpG and all following columns to the absolute methylation ratio. All ratio columns are dedicated to the group described by the prefix in their header, e.g., g1 or g2. Options `-a` and `-b` indicate the groups that are considered. 

To separate the results from the last tutorial, create a new directory called `Tutorial_2` first, and all the results generated from this tutorial will be saved in this directory:

In [None]:
# Create a directory called Tutorial_2
! mkdir -p Tutorial_2

# Copy the methylation data from tutorial 1 to the directory Tutorial_2
! cp Tutorial_1/bismark/*bedGraph.gz Tutorial_2

#### Sort .bedGraph files generated by the Bismark pipeline

Since metilene requires a sorted input file, we need to sort our methylation data using command `sortBed` from bedtools:

In [None]:
# Sort all the methylation files
! for file_name in Tutorial_2/*bedGraph.gz; do \
    sortBed -i ${file_name} > ${file_name//_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz/.bedgraph}; \
done;

#### Prepare input files for metilene
metilene offers an easy way to generate an appropriate input file (`methylation-file`) containing all methylation rates, using a Perl script: `metilene_input.pl`. This step will generate a sorted tab-separated input file from multiple bed-files. 

Below is the parameter list of `metilene_input.pl`:

> <img src = "images/metilene_input_param.png" width="700" />

In [None]:
! metilene_input.pl \
    --in1 Tutorial_2/SRX202087.bedgraph,Tutorial_2/SRX271141.bedgraph \
    --in2 Tutorial_2/SRX202088.bedgraph,Tutorial_2/SRX271142.bedgraph \
    --h1 serum \
    --h2 2i \
    -o Tutorial_2/metilene.input

View the top 30 entries from metilene input file (`metilene.input`) just generated:

In [None]:
! head -30 Tutorial_2/metilene.input 

### Run metilene

The goal for metilene is to annotate differentially methylated regions. There are three options (mode, `-f`) to define the regions: 
- **DMR de-novo annotation**. The default mode of metilene annotates DMRs de-novo without using any prior information on genomic features. 
- **DMR annotation in known features** - genomic features such as promoter regions, or CpG island need to be provided.
- **DMC annotation** (not used in this tutorial) - metilene offers the possibility to test each CpG for differential methylation. Statistical tests (KS-test and Mann-Whitney-U test) are calculated for each CpG site, and corresponding p-values are reported in the output.

Below is the complete parameter list when running `metilene`:
> <img src="images/metilene_param.png" width="700" />

The output for the de-novo DMR annotation mode consists of a bed-like format:
| chr | start | stop | q-value | mean methylation difference | #CpGs | p (MWU) | p (2D KS) | mean g1 | mean g2 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |   

While "mean g1" and "mean g2" refer to the absolute mean methylation level for the corresponding segment in both groups, the difference is given in the 5th column. Single CpGs are not tested using the 2D KS-test. Here, q-values are (per default) Bonferroni adjusted based on MWU-test p-values (since version 0.2-8, Benjamini-Hochberg (FDR) can be chosen instead of Bonferroni adjustment). All outputs are unsorted when using multiple threads. 

#### **de-novo annotation** (option 1)

"The default mode of metilene annotates DMRs de-novo without using any prior information on genomic features, e.g., promoter regions. Here a fast circular binary segmentation approach on the mean difference signal of both groups is used (Siegmund, 1986; Olshen et al., 2004). After additional filter steps are passed, potential DMRs are tested using a two-dimensional Kolmogorov-Smirnov-Test (KS-test)(Fasano and Franceschini, 1987). DMRs are finally tested through the Mann-Whitney-U test." (From [metilene user guide](https://www.bioinf.uni-leipzig.de/Software/metilene/Manual/#5_dmr_de-novo_annotation))

> <img src="images/2_metilene_workflow.png" width="600">

When running metilene, using `-a` and `-b` to specify the group to compare, and the output files are **sorted** based on the 1st and 2nd fields (chromosome and start position):

In [None]:
! metilene -a serum -b 2i Tutorial_2/metilene.input | sort -V -k1,1 -k2,2n > Tutorial_2/metilene_denovo.output

#### **DMR annotation in known features**  (option 2)

"Instead of annotating de-novo DMRs, metilene can be used to find significant DMRs within a given group of genomic features. Here, the first step calling the circular binary segmentation algorithm is skipped. Instead, statistical tests are performed for each feature, and corresponding p-values are reported in the output. Use the "-B bedfile" option to define windows through a bedfile SORTED equally to the data input file." (From [metilene user guide](https://www.bioinf.uni-leipzig.de/Software/metilene/Manual/#6_dmr_annotation_in_known_features))

An example of the external annotation file (CpG islands) can be found at [Table browser - UCSC Genome Browser](
https://genome.ucsc.edu/cgi-bin/hgTables), where users can retrieve and export data from the Genome Browser annotation track database. This file is included in the data provided in this tutorial, named `cpgIslandExt_mm39_chr6.bed`:
> <img src="images/external_annotation_file.png" width="800"/>

To run metilene with the known features, using `-f 2` to define the mode (1: de-novo, 2: pre-defined regions, 3: DMCs). And using `-B` to provide a SORTED (equally to the input data) bed file containing regions for mode 2: 

In [None]:
! metilene -a serum -b 2i -f 2 -B Tutorial_1/fastq/cpgIslandExt_mm39_chr6.bed Tutorial_2/metilene.input | sort -V -k1,1 -k2,2n > Tutorial_2/metilene_cgi.output

#### Filter output file and plot basic DMR statistics
An easy way to filter the already called DMRs is offered by `metilene_output.pl`. It will create some basic statistic plots characterizing your DMRs, i.e., distribution of DMR differences, DMR length in nucleotides and #CpGs, DMR differences vs. q-values, mean methylation group 1 vs. mean methylation group 2 and DMR length in nucleotides vs. length in CpGs. DMRs can by filtered by q-value, #CpGs, length in nucleotides and mean methylation difference. 

There are 3 files produced: 
1. bedGraph file containing the methylation difference for each DMR
2. basic statistic pdf 
3. filtered bedGraph-like file, containing all information already in the metilene output. 
Please see [here](https://www.bioinf.uni-leipzig.de/Software/metilene/Manual/#filter_output_file_and_plot_basic_dmr_statistics) for detailed parameter list and their default values.

The following code will filter the metilene output files from the previous steps (option 1 `metilene_denovo.output` and option 2 `metilene_cgi.output`), based on the default settings: adjust p-value < 0.05, number of CpGs >=10:

In [None]:
! metilene_output.pl -q Tutorial_2/metilene_denovo.output -o Tutorial_2/denovo -a serum -b 2i
! metilene_output.pl -q Tutorial_2/metilene_cgi.output -o Tutorial_2/cgi -a serum -b 2i

In our test example dataset, the de-novo annotation methods have 683 DMRs (`Tutorial_2/denovo_qval.0.05.bedgraph`), while the number of differentially methylated CpG islands as defined by the UCSC genome annotations is 2 (`Tutorial_2/cgi_qval.0.05.bedgraph`).

### Visualization using IGV

In our example, there were only six CpG islands (see track `CGI (UCSC)`) are defined based and two of them can be considered as DMRs using the default threshold (adjust p-value <0.05). However, if you use the de-novo annotation, there are more DMRs (683 out of 812) that can be detected. You can view the results using IGV:

In [None]:
import igv_notebook
igv_notebook.init()

# reference genome
b = igv_notebook.Browser(
    {
        "genome": "mm39",
        "locus": "chr6:4,300,000-5,800,000"
    }
)

# load unfiltered CGI output 
b.load_track({    "name": "CGI - all",
    "path": "Tutorial_2/metilene_cgi.output",
    "format": "bed",
    "type": "annotation",
    "color": "black"
})

# load filtered DMRs using both options
b.load_track({
    "name": "CGI - filtered",
    "path": "Tutorial_2/cgi_qval.0.05.bedgraph",
    "format": "bedgraph",
    "type": "wig",
    "color": "red"
})
b.load_track({
    "name": "denovo_DMR - filtered",
    "path": "Tutorial_2/denovo_qval.0.05.bedgraph",
    "format": "bedgraph",
    "type": "wig",
    "color": "red"
})

# load the methylation levels from 4 samples
b.load_track({
    "name": "SRX271141_bed",
    "path": "Tutorial_2/SRX271141_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz",
    "mode": "bisulfite" ,
    "color": "orange"
})
b.load_track({
    "name": "SRX202087_bed",
    "path": "Tutorial_2/SRX202087_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz",
    "mode": "bisulfite" ,
    "color": "orange"
})

b.load_track({
    "name": "SRX271142_bed",
    "path": "Tutorial_2/SRX271142_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz",
    "mode": "bisulfite" ,
    "color": "blue"
})
b.load_track({
    "name": "SRX202088_bed",
    "path": "Tutorial_2/SRX202088_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz",
    "mode": "bisulfite" ,
    "color": "blue"
})

You can zoom in to see more details in a specified region, for example `b.search('chr6:4720,000-4800,000')`:  

<img src = "images/2_igv_results.png" width="1000" />

Our example is not the best illustration of DMRs, since 2i ESCs (SRX271142, SRX202088) are **globally** hypomethylated compared to conventional ESCs maintained in serum (RX271141, SRX202087). That is why when using the de-novo annotation, there were many DMRs can be detected (track `denovo_DMR`). But if examine closely, we can see how the methylation levels are different in the CpG islands or de-novo defined regions.

## Terms & Quiz

In [None]:
! pip install jupytercards --quiet
from jupytercards import display_flashcards
display_flashcards('../quiz_files/f2.json')

In [None]:
! pip install jupyterquiz --quiet
from jupyterquiz import display_quiz
display_quiz('../quiz_files/q2.json')

<div class="alert alert-block alert-danger">
    <b>Don't forget:</b> after finish running the notebook, <b>stop</b> the notebook in Vertex AI Workbench to avoid cost accumulation.
</div>

## Conclusion

This tutorial demonstrated the use of metilene for identifying differentially methylated regions (DMRs) from WGBS data.  We explored both de novo DMR annotation and DMR annotation within known CpG islands, highlighting the differences in results obtained using these two approaches.  The analysis showcased the workflow from preparing sorted input files using `metilene_input.pl` to running metilene and subsequently filtering and visualizing the results using `metilene_output.pl` and IGV.  While our example dataset illustrated global methylation differences rather than localized DMRs, the tutorial provided a comprehensive guide to utilizing metilene for DMR identification, emphasizing the importance of data preparation and the interpretation of results in the context of the biological question.  The interactive quiz further reinforced key concepts learned throughout the tutorial. Remember to stop your Vertex AI Workbench instance to avoid unnecessary costs.

## Clean Up

Remember to move to the next notebook or shut down your instance if you are finished.