-
Notifications
You must be signed in to change notification settings - Fork 1
/
vin1_Introduction_to_coMethDMR_geneBasedPipeline.Rmd
473 lines (356 loc) · 18 KB
/
vin1_Introduction_to_coMethDMR_geneBasedPipeline.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
---
title: "Introduction to coMethDMR"
author: "Lissette Gomez, Gabriel Odom, Tiago Chedraoui Silva, Lanyu Zhang, and Lily Wang"
date: "March 2021"
output:
BiocStyle::html_document:
toc_float: true
toc: true
df_print: paged
code_download: false
toc_depth: 3
editor_options:
chunk_output_type: inline
vignette: >
%\VignetteIndexEntry{"Introduction to coMethDMR"}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
`coMethDMR` is an R package that identifies genomic regions that are both co-methylated and differentially
methylated in Illumina array datasets. Instead of testing all CpGs within a genomic region, `coMethDMR` carries out an additional step that selects co-methylated sub-regions first without using any outcome information. Next, `coMethDMR` tests association between methylation within the sub-region and continuous phenotype using a random coefficient mixed effects model, which models both variations between CpG sites within the region and differential methylation simultaneously.
# Quick start
## Installation
The latest version can be installed by
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("coMethDMR")
```
After installation, the `coMethDMR` package can be loaded into R using:
```{r load_comethDMR, message=FALSE}
library(coMethDMR)
```
If you are running the `coMethDMR` package for the first time, you will also need to download supplemental data sets from the `sesameData` package. This happens automatically for the 450k and EPIC arrays when you load `coMethDMR` for the first time. If you experience errors in this process, you may have your cache in an older default location. Please follow these steps to update where these genomic data are stored: <https://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/ExperimentHub.html#default-caching-location-update>.
## Datasets
### Example Methylation Data
The input of `coMethDMR` are methylation beta values. We assume quality control and normalization of the methylation dataset have been performed, by R packages such as `minfi` or `RnBeads`. For illustration, we use a subset of prefrontal cortex methylation data (GEO GSE59685) from a recent Alzheimer's disease epigenome-wide association study which was described in Lunnon et al. (2014). This example dataset contains beta values for 8552 CpGs on chromosome 22 for a random selection of 20 subjects.
```{r}
data(betasChr22_df)
betasChr22_df[1:5, 1:5]
```
If you have used either the [minfi](https://bioconductor.org/packages/release/bioc/vignettes/minfi/inst/doc/minfi.html) or [RnBeads](https://bioconductor.org/packages/release/bioc/vignettes/RnBeads/inst/doc/RnBeads.pdf) packages to pre-process your methylation data, we now show a quick example of the code necessary to create a methylation data frame similar to `betasChr22_df` above (note that `dataObject_minfi` and `dataObject_RnBeads` are objects containing the pre-processed DNA methylation data as returned by the `minfi::` or `RnBeads::` packages, respectively):
```
### minfi ###
betas_df <- as.data.frame(
minfi::getMethSignal(dataObject_minfi, what = "Beta")
)
### RnBeads ###
betas_df <- as.data.frame(
RnBeads::meth(dataObject_RnBeads, row.names = TRUE)
)
```
### Example Response Data
The corresponding phenotype dataset included variables `stage` (Braak AD stage), `subject.id`, `slide` (batch effect), `Sex`, `Sample` and `age.brain` (age of the brain donor). Please note the phenotype file needs to have a variable called "Sample" that will be used by coMethDMR to link to the methylation dataset.
```{r}
data(pheno_df)
head(pheno_df)
```
## A quick work through of coMethDMR
For illustration, suppose we are interested in identifying co-methylated genomic regions associated with AD stages (`stage` treated as a linear variable). Here we demonstrate analysis of genomic regions mapped to CpG islands on chromosome 22. However the workflow can be similarly conducted for other types of genomic regions. See details in Section 2.1 below for gene based pipeline that tests genic and intergenic regions.
There are several steps: (1) obtain CpGs located closely (see details in Section 2.1 below) in genomic regions mapped to CpG islands, (2) identify co-methylated regions, and (3) test co-methylated regions against the outcome variable AD stage.
For the first step, we use the following commands:
```{r}
CpGisland_ls <- readRDS(
system.file(
"extdata",
"CpGislandsChr22_ex.rds",
package = 'coMethDMR',
mustWork = TRUE
)
)
```
Here, `CpGisland_ls` is a list of 20 items, with each item of the list including a group of CpG probe IDs located closely within a particular CpG island region. Section 2.1 discusses how to import additional types of genomic regions.
Next, we identify co-methylated regions based on Mvalues.
```{r CoMethAllRegions}
system.time(
coMeth_ls <- CoMethAllRegions(
dnam = betasChr22_df,
betaToM = TRUE, # converts beta to m-values
method = "pearson",
CpGs_ls = CpGisland_ls,
arrayType = "450k",
returnAllCpGs = FALSE,
output = "CpGs",
nCores_int = 1
)
)
# ~15 seconds
coMeth_ls
```
`coMeth_ls` is list with that contains groups of CpG probeIDs corresponding to co-methylated regions. Three comethylated regions were identified in this example.
If we want to look at co-methylation within the first co-methylated region:
```{r message=FALSE}
WriteCorrPlot <- function(beta_mat){
require(corrplot)
require(coMethDMR)
CpGs_char <- row.names(beta_mat)
CpGsOrd_df <- OrderCpGsByLocation(
CpGs_char, arrayType = c("450k"), output = "dataframe"
)
betaOrdered_mat <- t(beta_mat[CpGsOrd_df$cpg ,])
corr <- cor(
betaOrdered_mat, method = "spearman", use = "pairwise.complete.obs"
)
corrplot(corr, method = "number", number.cex = 1, tl.cex = 0.7)
}
# subsetting beta values to include only co-methylated probes
betas_df <- subset(
betasChr22_df,
row.names(betasChr22_df) %in% coMeth_ls[[1]]
)
WriteCorrPlot(betas_df)
```
Next, we test these co-methylated regions against `stage` using a random coefficient model (more details in section 2.3 below).
Some messages are generated during mixed models fitting, which are saved to the file specified by `outLogFile`. The interpretations of these messages can be found in the FAQs at the end of this document (see Section 3, item (1) and (2)).
```{r lmmTestAllRegions_1}
out_df <- lmmTestAllRegions(
betas = betasChr22_df,
region_ls = coMeth_ls,
pheno_df,
contPheno_char = "stage",
covariates_char = NULL,
modelType = "randCoef",
arrayType = "450k"
# generates a log file in the current directory
# outLogFile = paste0("lmmLog_", Sys.Date(), ".txt")
)
out_df
```
Here `out_df` is a data frame of genomic regions, with corresponding p-values and false discovery rate (FDRs) from the random coefficient mixed model.
We can annotate these results by adding corresponding genes and probes mapped to the genomic regions.
```{r AnnotateResults, message = FALSE}
system.time(
outAnno_df <- AnnotateResults(
lmmRes_df = out_df,
arrayType = "450k"
)
)
# ~12 seconds
outAnno_df
```
To further examine the significant regions, we can also extract individual CpG p-values within these significant regions. For example, for the most significant region `chr22:18268062-18268249`,
```{r CpGsInfoOneRegion}
outCpGs_df <- CpGsInfoOneRegion(
regionName_char = "chr22:18268062-18268249",
betas_df = betasChr22_df,
pheno_df = pheno_df,
contPheno_char = "stage",
covariates_char = NULL,
arrayType = "450k"
)
outCpGs_df
```
These CpGs mapped to intergenic regions, so there are no gene names associated with the probes. For genic regions such as `chr22:19709548-19709755`, we would have results such as the following:
```{r message = FALSE}
# library("GenoGAM")
# NOTE 2022-03-28: what are we using this package for again?
```
```{r CpGsInfoOneRegion_2}
CpGsInfoOneRegion(
regionName_char = "chr22:19709548-19709755",
betas_df = betasChr22_df,
pheno_df = pheno_df,
contPheno_char = "stage",
covariates_char = NULL,
arrayType = "450k"
)
```
# Details of `coMethDMR` workflow
## Genomic regions tested in gene based pipeline
Genomic regions on the Illumina arrays can be defined based on their relations to genes or CpG Islands. To reduce redundancy in the tested genomic regions, we recommend first testing genic and intergenic regions, then add annotations to each genomic region for their relation to CpG islands.
In `coMethDMR` package, for 450k arrays, the relevant genomic regions to be analyzed are in files `450k_Gene_3_200.RDS` and `450k_InterGene_3_200.rds`. For EPIC arrays, the relevant genomic regions are in files `EPIC_Gene_3_200.rds` and `EPIC_InterGene_3_200.rds`. These additional data sets are available at <https://github.com/TransBioInfoLab/coMethDMR_data>.
These files were created using the function `WriteCloseByAllRegions`, briefly, for genic regions, within each gene, we identified clusters of CpGs located closely (i.e. the maximum separation between any two consecutive probes is 200bp; `maxGap = 200`), and we required each cluster to have at least 3 CpGs (`minCpGs = 3`). For intergenic regions, we identified clusters CpGs similarly for each chromosome. To extract clusters of close-by CpGs from pre-defined genomic regions with different values of `maxGap` and `minCpGs`, the `WriteCloseByAllRegions` function can be used.
The pre-computed genomic regions can be accessed using the following commands. For geneic regions in 450k arrays,
```{r}
gene_ls <- readRDS(
system.file(
"extdata",
"450k_Gene_3_200.rds",
package = 'coMethDMR',
mustWork = TRUE
)
)
```
Here `gene_ls` is a list, with each item containing a character vector of CpGs IDs for a particular region in a gene.
Vignette # 2 illustrates how to leverage parallel computing via `BiocParallel` R package to make gene-based analysis fast.
## When there are co-variate variables in dataset to consider
Before identifying co-methylated clusters, we recommend removing uninteresting technical and biological effects, so that the resulting co-methylated clusters are only driven by the biological factors we are interested in. This can be accomplished using the `GetResiduals` function.
For example, the following script computes residuals from linear model
`Mvalues ~ age.brain + sex + slide` (note that we use only a subset of the Chromosome 22 CpG islands for computing time consideration).
```{r reduce_data}
Cgi_ls <- readRDS(
system.file(
"extdata",
"CpGislandsChr22_ex.rds",
package = 'coMethDMR',
mustWork = TRUE
)
)
betasChr22_df <-
betasChr22_df[rownames(betasChr22_df) %in% unlist(Cgi_ls)[1:20], ]
```
```{r GetResiduals}
resid_mat <- GetResiduals(
dnam = betasChr22_df,
# converts to Mvalues for fitting linear model
betaToM = TRUE,
pheno_df = pheno_df,
covariates_char = c("age.brain", "sex", "slide")
)
```
## Algorithm for identifying co-methylated regions
Within each genomic region, coMethDMR identifies contiguous and co-methylated CpGs sub-regions without using any outcome information. To select these co-methylated sub-regions, we use the `rdrop` statistic, which is the correlation between each CpG with the sum of methylation levels in all other CpGs. The default is `rDropThresh_num = 0.4`. We recommend this setting based on our simulation study. Note that higher `rDropThresh_num` values lead to fewer co-methylated regions.
Again, for illustration, we use CpG islands. For example, if we are interested in identifying co-methylated sub-region within the first genomic region in `Cgi_ls`:
```{r CoMethAllRegions_2}
Cgi_ls <- readRDS(
system.file(
"extdata",
"CpGislandsChr22_ex.rds",
package = 'coMethDMR',
mustWork = TRUE
)
)
coMeth_ls <- CoMethAllRegions(
dnam = resid_mat,
betaToM = FALSE,
method = "pearson",
CpGs_ls = Cgi_ls[1],
arrayType = "450k",
returnAllCpGs = FALSE,
output = "CpGs"
)
coMeth_ls
```
The results (`NULL`) indicate there is no co-methylated sub-region within the first genomic region.
What about the other 19 regions? Next we look at a region (5th region in `Cgi_ls`) where there is a co-methylated sub-region:
```{r CoMethAllRegions_3}
coMeth_ls <- CoMethAllRegions(
dnam = resid_mat,
betaToM = FALSE,
CpGs_ls = Cgi_ls[5],
arrayType = "450k",
returnAllCpGs = FALSE,
output = "CpGs"
)
coMeth_ls
```
`coMeth_ls` is a list, where each item is a list of CpG probe IDs for a co-methylated sub-region.
If we want to see the detailed output of the coMethDMR algorithm, that is, how the co-methylated region was obtained, we can specify `output = "dataframe"`:
```{r CoMethAllRegions_4}
coMethData_df <- CoMethAllRegions(
dnam = resid_mat,
betaToM = FALSE,
CpGs_ls = Cgi_ls[5],
arrayType = "450k",
returnAllCpGs = FALSE,
output = "dataframe"
) [[1]]
coMethData_df
```
`coMethData_df` provides the details on how the co-methylated region was obtained: Here `keep = 1` if `rDropThresh_num > 0.4` (i.e. a co-methylated CpG), and `keep_contigous` indicates if the probe is in a contiguous co-methylated region. Note that only the last 3 CpGs constitutes the co-methylated cluster.
## Models for testing genomic regions against a continuous phenotype
To test association between a continuous phenotype and methylation values in a contiguous co-methylated region, two mixed models have been implemented in the function `lmmTestAllRegions`: a random coefficient mixed model (`modelType = "randCoef"`) and a simple linear mixed model (`modelType = "simple"`).
The random coefficient mixed model includes both a systematic component that models the mean for each group of CpGs, and a random component that models how each CpG varies with respect to the group mean (random probe effects). It also includes random sample effects that model correlations between multiple probes within the same sample.
More specifically, the random coefficient model is
`methylation M value ~ contPheno_char + covariates_char + (1|Sample) + (contPheno_char|CpG).`
The last term `(contPheno_char|CpG)` specifies both random intercepts and slopes for each CpG.
The simple linear mixed model includes all the terms in the random coefficient model except random probe effects.
The simple linear mixed model is
`methylation M value ~ contPheno_char + covariates_char + (1|Sample)`
To test one genomic region against the continuous phenotype `stage`, adjusting for `age.brain`:
```{r lmmTestAllRegions_2, message = FALSE}
lmmTestAllRegions(
betas = betasChr22_df,
region_ls = coMeth_ls[1],
pheno_df,
contPheno_char = "stage",
covariates_char = "age.brain",
modelType = "randCoef",
arrayType = "450k"
)
```
If we don't want to adjust for any covariate effect, we can set `covariates_char` to `NULL`.
## Analyzing a specific gene
Finally, we demonstrate `coMethDMR` analysis for a particular gene, for example the `ARFGAP3` gene.
```{r}
data(betasChr22_df)
```
We assume that the user knows the set of probes corresponding to the gene of interest. If this is not the case, we provide two data sets which contain mappings from gene symbols to probe IDs for both 450k ("450k_CpGstoGene_min3CpGs.rds") and EPIC ("EPIC_CpGstoGene_min3CpGs.rds") arrays. These two data sets are available at: <https://github.com/TransBioInfoLab/coMethDMR_data/tree/main/data>.
```{r message = FALSE}
# list probes for this gene
ARFGAP3_CpGs_char <- c(
"cg00079563", "cg01029450", "cg02351223", "cg04527868", "cg09861871",
"cg26529516", "cg00539564", "cg05288033", "cg09367092", "cg10648908",
"cg14570855", "cg15656623", "cg23778094", "cg27120833"
)
# list probes located closely on this gene
gene3_200 <- CloseBySingleRegion(
CpGs_char = ARFGAP3_CpGs_char,
arrayType = "450k",
maxGap = 200,
minCpGs = 3
)
CpGsOrdered_ls <- lapply(
gene3_200,
OrderCpGsByLocation,
arrayType = "450k",
output = "dataframe"
)
names(gene3_200) <- lapply(CpGsOrdered_ls, NameRegion)
# List of regions
gene3_200
```
Now that we have a list of regions, we can find the co-methylated regions within the gene and test their statistical significance:
```{r find_and_test_regions}
# co-methlyated region within the gene
coMeth_ls <- CoMethAllRegions(
dnam = betasChr22_df,
betaToM = TRUE,
method = "pearson",
CpGs_ls = gene3_200,
arrayType = "450k",
returnAllCpGs = FALSE,
output = "CpGs"
)
coMeth_ls
# test the co-methylated regions within the gene
results <- lmmTestAllRegions(
betas = betasChr22_df,
region_ls = coMeth_ls,
pheno_df,
contPheno_char = "stage",
covariates_char = "age.brain",
modelType = "randCoef",
arrayType = "450k"
# generates a log file in the current directory
# outLogFile = paste0("lmmLog_", Sys.Date(), ".txt")
)
```
Before we inspect the results, we will annotate them:
```{r}
AnnotateResults(lmmRes_df = results, arrayType = "450k")
```
# Frequently Asked Questions
(1) What happens when mixed model fails to coverge (i.e. the warning "Model failed to converge with..." is resulted for a particular genomic region)?
- In this case, the p-value for mixed model is set to 1. In our experiences with methylation datasets, genomic regions with strong signals typically converge. Convergence issues typically occurs when the amount of noise in data is high.
(2) When fitting mixed models with `lmmTestAllRegions` function, What does the message "boundary (singular) fit" mean?
- When mixed model is singular, at least one of the estimated variance components for intercepts or slopes random effects is 0, because there isn't enough variabilities in data to estimate the random effects. In this case, mixed model reduces to a fixed effects model. However, as our simulation studies have shown, the p-values obtained for these regions are still valid.
# Reference
Lunnon K, Smith R, Hannon E, De Jager PL, Srivastava G, Volta M, Troakes C, Al-Sarraj S, Burrage J, Macdonald R, et al (2014) Methylomic profiling implicates cortical deregulation of ANK1 in Alzheimer's disease. Nat Neurosci 17:1164-1170.
# Session Information
```{r}
utils::sessionInfo()
```