vignettes/vin1_Introduction_to_coMethDMR_geneBasedPipeline.Rmd

---
title: "Introduction to coMethDMR"
author: "Lissette Gomez, Gabriel Odom, Tiago Chedraoui Silva, Lanyu Zhang, and Lily Wang"
date: "March 2021"
output: 
    BiocStyle::html_document:
    toc_float: true
    toc: true
    df_print: paged
    code_download: false
    toc_depth: 3
editor_options:
  chunk_output_type: inline
vignette: >
    %\VignetteIndexEntry{"Introduction to coMethDMR"} 
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

`coMethDMR` is an R package that identifies genomic regions that are both co-methylated and differentially
methylated in Illumina array datasets. Instead of testing all CpGs within a genomic region, `coMethDMR` carries out an additional step that selects co-methylated sub-regions first without using any outcome information. Next, `coMethDMR` tests association between methylation within the sub-region and continuous phenotype using a random coefficient mixed effects model, which models both variations between CpG sites within the region and differential methylation simultaneously. 


# Quick start

## Installation 

The latest version can be installed by

```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("coMethDMR")
```

After installation, the `coMethDMR` package can be loaded into R using:
```{r load_comethDMR, message=FALSE}
library(coMethDMR)
```

If you are running the `coMethDMR` package for the first time, you will also need to download supplemental data sets from the `sesameData` package. This happens automatically for the 450k and EPIC arrays when you load `coMethDMR` for the first time. If you experience errors in this process, you may have your cache in an older default location. Please follow these steps to update where these genomic data are stored: <https://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/ExperimentHub.html#default-caching-location-update>.

## Datasets

### Example Methylation Data
The input of `coMethDMR` are methylation beta values. We assume quality control and normalization of the methylation dataset have been performed, by R packages such as `minfi` or `RnBeads`. For illustration, we use a subset of prefrontal cortex methylation data (GEO GSE59685) from a recent Alzheimer's disease epigenome-wide association study which was described in Lunnon et al. (2014). This example dataset contains beta values for 8552 CpGs on chromosome 22 for a random selection of 20 subjects. 


```{r}
data(betasChr22_df)
betasChr22_df[1:5, 1:5]
```

If you have used either the [minfi](https://bioconductor.org/packages/release/bioc/vignettes/minfi/inst/doc/minfi.html) or [RnBeads](https://bioconductor.org/packages/release/bioc/vignettes/RnBeads/inst/doc/RnBeads.pdf) packages to pre-process your methylation data, we now show a quick example of the code necessary to create a methylation data frame similar to `betasChr22_df` above (note that `dataObject_minfi` and `dataObject_RnBeads` are objects containing the pre-processed DNA methylation data as returned by the `minfi::` or `RnBeads::` packages, respectively):

```
###  minfi  ###
betas_df <- as.data.frame(
  minfi::getMethSignal(dataObject_minfi, what = "Beta")
)

###  RnBeads  ###
betas_df <- as.data.frame(
  RnBeads::meth(dataObject_RnBeads, row.names = TRUE)
)
```

### Example Response Data
The corresponding phenotype dataset included variables `stage` (Braak AD stage), `subject.id`, `slide` (batch effect), `Sex`, `Sample` and `age.brain` (age of the brain donor). Please note the phenotype file needs to have a variable called "Sample" that will be used by coMethDMR to link to the methylation dataset. 

```{r}
data(pheno_df)
head(pheno_df)
```


## A quick work through of coMethDMR 
For illustration, suppose we are interested in identifying co-methylated genomic regions associated with AD stages (`stage` treated as a linear variable). Here we demonstrate analysis of genomic regions mapped to CpG islands on chromosome 22. However the workflow can be similarly conducted for other types of genomic regions. See details in Section 2.1 below for gene based pipeline that tests genic and intergenic regions. 

There are several steps: (1) obtain CpGs located closely (see details in Section 2.1 below) in genomic regions mapped to CpG islands, (2) identify co-methylated regions, and (3) test co-methylated regions against the outcome variable AD stage. 

For the first step, we use the following commands: 

```{r}
CpGisland_ls <- readRDS(
  system.file(
    "extdata",
    "CpGislandsChr22_ex.rds",
    package = 'coMethDMR',
    mustWork = TRUE
    )
)
```

Here, `CpGisland_ls` is a list of 20 items, with each item of the list including a group of CpG probe IDs located closely within a particular CpG island region. Section 2.1 discusses how to import additional types of genomic regions. 

Next, we identify co-methylated regions based on Mvalues.  
```{r CoMethAllRegions}
system.time(
  coMeth_ls <- CoMethAllRegions(
    dnam = betasChr22_df,
    betaToM = TRUE, # converts beta to m-values
    method = "pearson",
    CpGs_ls = CpGisland_ls,
    arrayType = "450k",
    returnAllCpGs = FALSE, 
    output = "CpGs",
    nCores_int = 1
  )
)
# ~15 seconds

coMeth_ls
```

`coMeth_ls` is list with that contains groups of CpG probeIDs corresponding to co-methylated regions. Three comethylated regions were identified in this example. 

If we want to look at co-methylation within the first co-methylated region: 

```{r message=FALSE}
WriteCorrPlot <- function(beta_mat){

  require(corrplot)
  require(coMethDMR)

  CpGs_char <- row.names(beta_mat)

  CpGsOrd_df <- OrderCpGsByLocation(
    CpGs_char, arrayType = c("450k"), output = "dataframe"
  )

  betaOrdered_mat <- t(beta_mat[CpGsOrd_df$cpg ,])

  corr <- cor(
    betaOrdered_mat, method = "spearman", use = "pairwise.complete.obs"
  )

  corrplot(corr, method = "number", number.cex = 1, tl.cex = 0.7)
}

# subsetting beta values to include only co-methylated probes
betas_df <- subset(
  betasChr22_df, 
  row.names(betasChr22_df) %in% coMeth_ls[[1]]
)

WriteCorrPlot(betas_df)
```

Next, we test these co-methylated regions against `stage` using a random coefficient model (more details in section 2.3 below).  

Some messages are generated during mixed models fitting, which are saved to the file specified by `outLogFile`. The interpretations of these messages can be found in the FAQs at the end of this document (see Section 3, item (1) and (2)).  

```{r lmmTestAllRegions_1}
out_df <- lmmTestAllRegions(
  betas = betasChr22_df,
  region_ls = coMeth_ls,
  pheno_df,
  contPheno_char = "stage",
  covariates_char = NULL,
  modelType = "randCoef",
  arrayType = "450k"
  # generates a log file in the current directory
  # outLogFile = paste0("lmmLog_", Sys.Date(), ".txt")
)

out_df

```
Here `out_df` is a data frame of genomic regions, with corresponding p-values and false discovery rate (FDRs) from the random coefficient mixed model. 

We can annotate these results by adding corresponding genes and probes mapped to the genomic regions. 

```{r AnnotateResults, message = FALSE}
system.time(
  outAnno_df <- AnnotateResults(
    lmmRes_df = out_df,
    arrayType = "450k"
  )
)
# ~12 seconds

outAnno_df
```

To further examine the significant regions, we can also extract individual CpG p-values within these significant regions. For example, for the most significant region `chr22:18268062-18268249`, 
```{r CpGsInfoOneRegion}
outCpGs_df <- CpGsInfoOneRegion(
 regionName_char = "chr22:18268062-18268249",
 betas_df = betasChr22_df,
 pheno_df = pheno_df,
 contPheno_char = "stage",
 covariates_char = NULL,
 arrayType = "450k"
)

outCpGs_df

```

These CpGs mapped to intergenic regions, so there are no gene names associated with the probes. For genic regions such as `chr22:19709548-19709755`, we would have results such as the following:

```{r message = FALSE}
# library("GenoGAM")
# NOTE 2022-03-28: what are we using this package for again?
```

```{r CpGsInfoOneRegion_2}
CpGsInfoOneRegion(
  regionName_char = "chr22:19709548-19709755",
  betas_df = betasChr22_df,
  pheno_df = pheno_df,
  contPheno_char = "stage",
  covariates_char = NULL,
  arrayType = "450k"
)
```


# Details of `coMethDMR` workflow

## Genomic regions tested in gene based pipeline

Genomic regions on the Illumina arrays can be defined based on their relations to genes or CpG Islands. To reduce redundancy in the tested genomic regions, we recommend first testing genic and intergenic regions, then add annotations to each genomic region for their relation to CpG islands. 

In `coMethDMR` package, for 450k arrays, the relevant genomic regions to be analyzed are in files `450k_Gene_3_200.RDS` and `450k_InterGene_3_200.rds`. For EPIC arrays, the relevant genomic regions are in files `EPIC_Gene_3_200.rds` and `EPIC_InterGene_3_200.rds`. These additional data sets are available at <https://github.com/TransBioInfoLab/coMethDMR_data>.

These files were created using the function `WriteCloseByAllRegions`, briefly, for genic regions, within each gene, we identified clusters of CpGs located closely (i.e. the maximum separation between any two consecutive probes is 200bp; `maxGap = 200`), and we required each cluster to have at least 3 CpGs (`minCpGs = 3`). For intergenic regions, we identified clusters CpGs similarly for each chromosome. To extract clusters of close-by CpGs from pre-defined genomic regions with different values of `maxGap` and `minCpGs`, the `WriteCloseByAllRegions` function can be used. 

The pre-computed genomic regions can be accessed using the following commands. For geneic regions in 450k arrays,  

```{r}
gene_ls <- readRDS(
  system.file(
    "extdata",
    "450k_Gene_3_200.rds",
    package = 'coMethDMR',
    mustWork = TRUE
    )
)
```

Here `gene_ls` is a list, with each item containing a character vector of CpGs IDs for a particular region in a gene.

Vignette # 2 illustrates how to leverage parallel computing via `BiocParallel` R package to make gene-based analysis fast. 

## When there are co-variate variables in dataset to consider

Before identifying co-methylated clusters, we recommend removing uninteresting technical and biological effects, so that the resulting co-methylated clusters are only driven by the biological factors we are interested in. This can be accomplished using the `GetResiduals` function. 

For example, the following script computes residuals from linear model 
`Mvalues ~ age.brain + sex + slide` (note that we use only a subset of the Chromosome 22 CpG islands for computing time consideration).

```{r reduce_data}
Cgi_ls <- readRDS(
  system.file(
    "extdata",
    "CpGislandsChr22_ex.rds",
    package = 'coMethDMR',
    mustWork = TRUE
    )
)
betasChr22_df <-
  betasChr22_df[rownames(betasChr22_df) %in% unlist(Cgi_ls)[1:20], ]
```

```{r GetResiduals}
resid_mat <- GetResiduals(
  dnam = betasChr22_df,
  # converts to Mvalues for fitting linear model 
  betaToM = TRUE, 
  pheno_df = pheno_df,
  covariates_char = c("age.brain", "sex", "slide")
)
```

## Algorithm for identifying co-methylated regions

Within each genomic region, coMethDMR identifies contiguous and co-methylated CpGs sub-regions without using any outcome information. To select these co-methylated sub-regions, we use the `rdrop` statistic, which is the correlation between each CpG with the sum of methylation levels in all other CpGs. The default is `rDropThresh_num = 0.4`. We recommend this setting based on our simulation study. Note that higher `rDropThresh_num` values lead to fewer co-methylated regions.

Again, for illustration, we use CpG islands. For example, if we are interested in identifying co-methylated sub-region within the first genomic region in `Cgi_ls`:  

```{r CoMethAllRegions_2}
Cgi_ls <- readRDS(
  system.file(
    "extdata",
    "CpGislandsChr22_ex.rds",
    package = 'coMethDMR',
    mustWork = TRUE
    )
)

coMeth_ls <- CoMethAllRegions(
  dnam = resid_mat,
  betaToM = FALSE,
  method = "pearson",
  CpGs_ls = Cgi_ls[1],
  arrayType = "450k",
  returnAllCpGs = FALSE,
  output = "CpGs"
)

coMeth_ls

```

The results (`NULL`) indicate there is no co-methylated sub-region within the first genomic region. 

What about the other 19 regions? Next we look at a region (5th region in `Cgi_ls`) where there is a co-methylated sub-region: 
```{r CoMethAllRegions_3}
coMeth_ls <- CoMethAllRegions(
  dnam = resid_mat,
  betaToM = FALSE,
  CpGs_ls = Cgi_ls[5],
  arrayType = "450k",
  returnAllCpGs = FALSE, 
  output = "CpGs"
)

coMeth_ls
```

`coMeth_ls` is a list, where each item is a list of CpG probe IDs for a co-methylated sub-region. 

If we want to see the detailed output of the coMethDMR algorithm, that is, how the co-methylated region was obtained, we can specify `output = "dataframe"`: 

```{r CoMethAllRegions_4}
coMethData_df <- CoMethAllRegions(
  dnam = resid_mat,
  betaToM = FALSE,
  CpGs_ls = Cgi_ls[5],
  arrayType = "450k",
  returnAllCpGs = FALSE, 
  output = "dataframe"
) [[1]]

coMethData_df
```

`coMethData_df` provides the details on how the co-methylated region was obtained: Here `keep = 1` if `rDropThresh_num > 0.4` (i.e. a co-methylated CpG), and `keep_contigous` indicates if the probe is in a contiguous co-methylated region. Note that only the last 3 CpGs constitutes the co-methylated cluster.    


##  Models for testing genomic regions against a continuous phenotype 

To test association between a continuous phenotype and methylation values in a contiguous co-methylated region, two mixed models have been implemented in the function `lmmTestAllRegions`: a random coefficient mixed model (`modelType = "randCoef"`) and a simple linear mixed model (`modelType = "simple"`). 

The random coefficient mixed model includes both a systematic component that models the mean for each group of CpGs, and a random component that models how each CpG varies with respect to the group mean (random probe effects). It also includes random sample effects that model correlations between multiple probes within the same sample. 

More specifically, the random coefficient model is
`methylation M value ~ contPheno_char + covariates_char + (1|Sample) + (contPheno_char|CpG).`
The last term `(contPheno_char|CpG)` specifies both random intercepts and slopes for each CpG.

The simple linear mixed model includes all the terms in the random coefficient model except random probe effects.

The simple linear mixed model is

`methylation M value ~ contPheno_char + covariates_char + (1|Sample)`

To test one genomic region against the continuous phenotype `stage`, adjusting for `age.brain`: 
```{r lmmTestAllRegions_2, message = FALSE}
lmmTestAllRegions(
  betas = betasChr22_df,
  region_ls = coMeth_ls[1],
  pheno_df,
  contPheno_char = "stage",
  covariates_char = "age.brain",
  modelType = "randCoef",
  arrayType = "450k"
)
```

If we don't want to adjust for any covariate effect, we can set `covariates_char` to `NULL`. 

##  Analyzing a specific gene

Finally, we demonstrate `coMethDMR` analysis for a particular gene, for example the `ARFGAP3` gene. 

```{r}
data(betasChr22_df)
```

We assume that the user knows the set of probes corresponding to the gene of interest. If this is not the case, we provide two data sets which contain mappings from gene symbols to probe IDs for both 450k ("450k_CpGstoGene_min3CpGs.rds") and EPIC ("EPIC_CpGstoGene_min3CpGs.rds") arrays. These two data sets are available at: <https://github.com/TransBioInfoLab/coMethDMR_data/tree/main/data>.
```{r message = FALSE}
# list probes for this gene
ARFGAP3_CpGs_char <- c(
  "cg00079563", "cg01029450", "cg02351223", "cg04527868", "cg09861871",
  "cg26529516", "cg00539564", "cg05288033", "cg09367092", "cg10648908",
  "cg14570855", "cg15656623", "cg23778094", "cg27120833"
)

# list probes located closely on this gene
gene3_200 <- CloseBySingleRegion(
  CpGs_char = ARFGAP3_CpGs_char,
  arrayType = "450k",
  maxGap = 200,
  minCpGs = 3
)
CpGsOrdered_ls <- lapply(
  gene3_200,
  OrderCpGsByLocation,
  arrayType = "450k",
  output = "dataframe"
)
names(gene3_200) <- lapply(CpGsOrdered_ls, NameRegion)

# List of regions
gene3_200
```

Now that we have a list of regions, we can find the co-methylated regions within the gene and test their statistical significance:
```{r find_and_test_regions}
# co-methlyated region within the gene
coMeth_ls <- CoMethAllRegions(
  dnam = betasChr22_df,
  betaToM = TRUE,
  method = "pearson",
  CpGs_ls = gene3_200,
  arrayType = "450k",
  returnAllCpGs = FALSE,
  output = "CpGs"
)

coMeth_ls 

# test the co-methylated regions within the gene
results <- lmmTestAllRegions(
  betas = betasChr22_df,
  region_ls = coMeth_ls,
  pheno_df,
  contPheno_char = "stage",
  covariates_char = "age.brain",
  modelType = "randCoef",
  arrayType = "450k"
  # generates a log file in the current directory
  # outLogFile = paste0("lmmLog_", Sys.Date(), ".txt")
)
```

Before we inspect the results, we will annotate them:
```{r}
AnnotateResults(lmmRes_df = results, arrayType = "450k")
```


#  Frequently Asked Questions

(1) What happens when mixed model fails to coverge (i.e. the warning "Model failed to converge with..." is resulted for a particular genomic region)?
- In this case, the p-value for mixed model is set to 1. In our experiences with methylation datasets, genomic regions with strong signals typically converge. Convergence issues typically occurs when the amount of noise in data is high. 

(2) When fitting mixed models with `lmmTestAllRegions` function, What does the message "boundary (singular) fit" mean? 
- When mixed model is singular, at least one of the estimated variance components for intercepts or slopes random effects is 0, because there isn't enough variabilities in data to estimate the random effects. In this case, mixed model reduces to a fixed effects model. However, as our simulation studies have shown, the p-values obtained for these regions are still valid.


#  Reference

Lunnon K, Smith R, Hannon E, De Jager PL, Srivastava G, Volta M, Troakes C, Al-Sarraj S, Burrage J, Macdonald R, et al (2014) Methylomic profiling implicates cortical deregulation of ANK1 in Alzheimer's disease. Nat Neurosci 17:1164-1170.


# Session Information
```{r}
utils::sessionInfo()
```