-
Notifications
You must be signed in to change notification settings - Fork 5
/
Clipper.Rmd
323 lines (287 loc) · 18.8 KB
/
Clipper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
---
title: "Analyzing high-throughput data with Clipper"
author: "Xinzhou Ge, Yiling Chen"
date: "`r Sys.Date()`"
output:
html_document:
df_print: paged
toc: yes
pdf_document:
toc: yes
mainfont: Arial
header-includes: \usepackage[T1]{fontenc}
vignette: |
%\VignetteIndexEntry{Clipper: p-value-free FDR control on high-throughput data from two conditions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
---
## Introduction
The most common goal of analyzing high-throughput data is to contrast two conditions to reliably screen "interesting features", where "interesting" means "enriched" or "differential". Enriched features are defined as those that have higher expected measurements under the experimental/treatment condition than the background condition, i.e., the negative control. The detection of such enriched features is called enrichment analysis. For example, common enrichment analyses include calling protein-binding sites in a genome from chromatin immunoprecipitation sequencing (ChIP-seq) data and identifying peptides from mass spectrometry (MS) data. In contrast, differential features are defined as those that have different expected measurements (without measurement errors) between two conditions, and the detection of such differential features is called differential analysis. For example, popular differential analyses include the identification of differentially expressed genes (DEGs) from genome-wide gene expression data (e.g., microarray and RNA sequencing (RNA-seq) data) and differentially chromosomal interaction regions (DIRs) from Hi-C data.
This package provides an easy implementation of a general statistical framework Clipper, an p-value-free FDR control method for analyzing high-throughput biological data. It is applicable to both enrichment analysis and differential analysis. This vignette explains the general use of `Clipper` and then demonstrates its use in typical high-throughput analyses including DEG analysis of RNA-seq data, peak calling from ChIP-Seq data, peptide identification from mass spectrometry data, and DIR analysis from Hi-C data. The details and examples of all these analyses will be further introduced below.
**Note**: if you use `Clipper` in published research, please cite:
For questions regarding the use of `Clipper`, please post to https://github.com/JSB-UCLA/Clipper/issues.
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## General pipeline
Here we show the most basic applications of Clipper in enrichment and differential analysis. `Clipper requires a minimum of four inputs:
- `score.exp` the measurements from the experimental (treatment) condition
- `score.back` the measurements from the background (control) condition
- `analysis` the type of analysis, "enrichment" or "differential".
- `FDR` the target FDR threshold(s), set to 0.05 by default
We use `Clipper` to find interesting features by contrasting two measurement matrices, one from the experimental condition and one from the background condition. The inputs of `score.exp` and `score.back` should be numeric matrices. The rows of `score.back` and `score.exp` should match and represent the same feature, and their columns represent replicates (biological samples). For enrichment and differential analysis, set `analysis` to be `"enrichmentl"("e")` or `"differential"("d")` respectively.
We demonstrate the genral pepline of `Clipper` using the following simulated datasets: `exp_e` and `back_e` in example of enrichment analysis; and `exp_d` and `back_d` in example of differential analysis.
```{r setup}
library(Clipper)
```
### Enrichment analysis
For enrichment analysis, we use the simulated data `exp_e` and `back_e` as inputs of `score.back` and `score.exp`. In this dataset, there are 10,000 features and the first 1000 features are interesting. For the target FDR threshold(s), we can then use `Clipper` to perform enrichment analysis:
```{r}
#use three FDR thresholds: 0.01, 0.05, 0.1
exp_e[c(1:3, 1001:1003), ]
back_e[c(1:3, 1001:1003), ]
re1 <- Clipper(score.exp = exp_e, score.back = back_e, analysis = "enrichment",
FDR = c(0.01, 0.05, 0.1))
names(re1)
```
`Clipper` returns a list and its component `discovery` is a list of indices of identified interesting discoveries corresponding to the FDR threshold(s):
```{r}
#indices of identified interesting genes with the second input FDR threshold (0.05)
re1$discoveries[[2]][1:5]
```
We can then calculate the resulting false discovery proportion (FDP) and power:
```{r}
#FDP (the first 1000 genes are true positives)
trueid <- 1:1000
sum(!re1$discoveries[[1]] %in% trueid)/length(re1$discoveries[[1]])
#power
sum(re1$discoveries[[1]] %in% trueid)/length(trueid)
```
### Differential analysis
In a differential analysis, the two conditions can be treated equally thus you can assign either condition as background/experimental. We use the simulated data `exp_d` and `back_d` as inputs of `score.back` and `score.exp`. In this dataset, there are 10,000 features and the first 2000 features are interesting. For the target FDR threshold(s), we can then use `Clipper` to perform differential analysis:
```{r}
#use three FDR thresholds: 0.01, 0.05, 0.1
exp_d[c(1:3, 1001:1003), ]
back_d[c(1:3, 1001:1003), ]
re2 <- Clipper(score.exp = exp_d, score.back = back_d, analysis = "differential",
FDR = c(0.01, 0.05, 0.1))
names(re2)
```
`Clipper` still returns a list and its component `discovery` is a list of indices of identified interesting discoveries corresponding to the FDR threshold(s):
```{r}
#indices of identified interesting genes with the second input FDR threshold (0.05)
re2$discoveries[[2]][1:5]
```
We can then calculate the resulting false discovery proportion (FDP) and power:
```{r}
#FDP (the first 2000 genes are true positives)
trueid <- 1:2000
sum(!re2$discoveries[[1]] %in% trueid)/length(re2$discoveries[[1]])
#power
sum(re2$discoveries[[1]] %in% trueid)/length(trueid)
```
## DEG identification
Here we demonstrate how to use `Clipper` for DEG identification. set `analysis` to be `"differential"`. The input of `score.exp` and `score.back` should be gene expression matrices with genes as rows and replicates as columns. It is important to ensure the rows of `score.exp` and `score.back` represent the same set of genes in the same order. `Clipper` requires either `score.exp` or `score.back` to have at least two replicates to perform a differential analysis.
Users are recommended to normalize and log-tranform read count matrices before inputing them to `Clipper` (see examples below). The following example uses `edgeR` to preprocess raw read count matrices as an example. Users can also use their preferred normalization methods.
```{r, eval = F}
library(edgeR)
dataurl = 'http://www.stat.ucla.edu/~jingyi.li/data/Clipper/'
count1 = readRDS(url(paste0(dataurl, 'simcount1.rds')))
count2 = readRDS(url(paste0(dataurl, 'simcount2.rds')))
############ use edgeR to normalize ############
r1 = ncol(count1)
r2 = ncol(count2)
cond_idx = rep(2, r1 + r2)
cond_idx[1:r1] = 1
dat = cbind(count1, count2)
cond_idx = factor(cond_idx)
y <- DGEList(counts=dat,group=cond_idx)
y <- calcNormFactors(y)
count_norm = cpm(y)
```
Next we apply `Clipper` to preprocessed gene expression matrices. Log-transformation is recommended because empirical evidence indicates that log-transformed read counts are more likely to satisfy Clipper's assumption for FDR control than read counts without log-transformation.
```{r, eval = F}
##### apply clipper to log2-transformed matrices with target FDR = 0.05
re_clipper = Clipper(score.exp = log(base = 2, count_norm[,1:r1] + 1),
score.back = log(base = 2, count_norm[,-(1:r1)] + 1),
FDR = 0.05,
analysis = "differential")
head(re_clipper$discoveries[[1]]) ###### identified DEG indices at FDR = 0.05
# 676 2173 3103 3105 3484 3490
```
## DIR identification from Hi-C data
To demonstrate the use of `Clipper` to identify differential chromosomal interaction regions, we start with interaction matrices obtained from Hi-C data. Because Hi-C interaction matrices are often stored as sparse matrices, we need to format these matrices into the right form. Suppose we have two interaction matrices under the experimental condition and the another two under the background condition.
```{r, eval = F}
exp1 = readRDS(url(paste0(dataurl, 'interactionMatrix_exp1.rds')))
exp2 = readRDS(url(paste0(dataurl, 'interactionMatrix_exp2.rds')))
back1 = readRDS(url(paste0(dataurl, 'interactionMatrix_back1.rds')))
back2 = readRDS(url(paste0(dataurl, 'interactionMatrix_back2.rds')))
head(exp1)
# V1 V2
# 1:1 0e+00 0e+00 63537
# 1:2 0e+00 1e+06 48305
# 2:2 1e+06 1e+06 366193
# 1:3 0e+00 2e+06 4195
# 2:3 1e+06 2e+06 58452
# 3:3 2e+06 2e+06 344527
```
`exp1` is a interaction matrix of resolution $10^6$ base pairs. Each row represents the interaction intensity between one pair of regions: the first integer indicates the starting point of a region, the second integer indicates the starting point of another region, and the third numeric indicates the read counts between these two regions. Only pairs of regions with a non-zero read count are explicitly coded in this matrix; all other pairs have zero read counts.
`Clipper` treats pairs of chromosome regions as features to call DIRs. Therefore, it is important to make sure to match features when we generate input to `score.exp` and `score.back`.
```{r, eval = F}
### Check if features match
all(exp1[, 1:2] == exp2[, 1:2])
all(exp2[, 1:2] == back1[, 1:2])
all(back1[, 1:2] == back2[, 1:2])
### features are matched!
```
If not matched, users can use the following code to extract shared features and match them.
```{r, eval = F}
### generate a replicate
exp2_sim = exp2[sample(1:nrow(exp2), size = 0.9*nrow(exp2),replace = F),]
ls_m = list(exp1, exp2_sim, back1, back2)
features = lapply(ls_m, function(x){
feat = paste0(x[,1],':', x[,2])
# rownames(x) = feat
return(feat)
})
features_shared = Reduce("intersect",features)
ls_m_shared = mapply(function(m,f){
m[match(features_shared, f),]
}, m = ls_m, f = features, SIMPLIFY = F)
### ls_m_shared is a list that contains interaction matrices with matched features
```
Next we apply `Clipper` by collecting replicates under the experimental condition and the background condition.
```{r, eval = F}
re <- Clipper(score.exp = log(cbind(exp1[,3], exp2[,3])),
score.back = log(cbind(back1[,3], back2[,3])),
FDR = c(0.01, 0.05, 0.1),
analysis = "differential")
head(exp1[re$discoveries[[2]] ,1:2]) ## identified DIRs
# V1 V2
# 47:112 4.6e+07 1.11e+08
# 48:112 4.7e+07 1.11e+08
# 49:112 4.8e+07 1.11e+08
# 50:112 4.9e+07 1.11e+08
# 51:112 5.0e+07 1.11e+08
# 52:112 5.1e+07 1.11e+08
```
## Peak calling
Here we demonstrate how to apply `Clipper` as an add-on to MACS, a popular peak calling method, for a better FDR control. Similar workflows also apply if users prefer other peak calling method such as HOMER.
###Extract input data from existing peak calling software MACS
To implement `Clipper` for peak calling, users need to extract two outputs from MACS: a "$*$_peaks.narrowPeak" file and two "$*$_pileup.dbg" files, one for the experimental track and the other for the control track. The "$*$_peaks.narrowPeak" file contains candidate peaks identified by MACS with its second and third rows indicating the start and end points of the peaks. The "$*$_pileup.dbg" files contain the base-pair read coverages for genomic regions of interest. See the following code chunk for a glance of these files. See the end of this section for command line codes we used to generate the "$*$_peaks.narrowPeak" file and the two "$*$_pileup.dbg" files. In the following example, we use a synthetic ChIP-seq sequence as the experiemntal sample and a real control sample as background sample. The "$*$_pileup.dbg" files for these two samples are: `'experimental_treat_pileup.bdg'` and `'control_treat_pileup.bdg'`. The peaks called by MACS by contrasting these two samples are in `'twosample_peaks.narrowPeak'`.
```{r, eval = F}
experimental <- read.table(url(paste0(dataurl, 'experimental_treat_pileup.bdg')))
head(experimental)
# V1 V2 V3 V4
# 1 chr1 0 9852 0
# 2 chr1 9852 9913 1
# 3 chr1 9913 10150 2
# 4 chr1 10150 10175 1
# 5 chr1 10175 10211 2
# 6 chr1 10211 10256 1
control <- read.table(url(paste0(dataurl, 'control_treat_pileup.bdg')))
head(control)
# V1 V2 V3 V4
# 1 chr1 0 9811 0
# 2 chr1 9811 9819 1
# 3 chr1 9819 9896 2
# 4 chr1 9896 9947 3
# 5 chr1 9947 9997 4
# 6 chr1 9997 10106 5
macs2.peak <- read.table(url(paste0(dataurl, 'twosample_peaks.narrowPeak')))
head(macs2.peak)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 chr1 904144 905821 twosample_peak_1 155 . 9.98306 18.60009 15.51687 425
# 2 chr1 940135 943865 twosample_peak_2 475 . 21.38286 51.57176 47.52760 1770
# 3 chr1 1058992 1060607 twosample_peak_3 218 . 12.25136 25.12054 21.81555 402
# 4 chr1 1247855 1249009 twosample_peak_4 155 . 9.98306 18.60009 15.51687 278
# 5 chr1 1344798 1345386 twosample_peak_5 63 . 6.46653 9.08008 6.38249 250
# 6 chr1 1430427 1432027 twosample_peak_6 130 . 10.31231 16.01712 13.03906 1322
```
### Use Clipper to screen candidate peaks
We use the following codes to generate the input of `score.exp` and `score.back` from the two "$*$_pileup.dbg" files. The resulting vectors `s1` and `s2` contains the read coverages for each base pair. Then we supply `s1` to `score.exp`, `s2` to `score.back`, and use `Clipper` to perform enrichment analysis to obtain a threshold on per-base-pair contrast scores.
```{r, eval=F}
# create two vectors of read coverage, one for the experimental track and the other for the background back
s1 <- rep(experimental$V4, experimental$V3- experimental$V2 )
s2 <- rep(control$V4, control$V3- control$V2)
s1[(length(s1)+1):length(s2)] <- 0
# use Clipper
re <- Clipper(score.exp = matrix(s1, ncol = 1),
score.back = matrix(s2, ncol = 1),
analysis = "enrichment")
# the threshold on contrast scores
re$contrast.score.thre
# 6
```
We then use this threshold to screen the candidate peaks identified by MACS. We compare the median of per-base-pair contrast scores within each candidate peak with the threshold output by `Clipper`. Only peaks whose median per-base-pair contrast score is greater or equal to the threshold are identified as peaks. We don't directly use the discoveries output by `Clipper` becasuse they are base-pair features and do not form biologically meaningful peaks. Instead, we use better-defined peak regions by MACS.
```{r, eval=F}
### compute the median of per-base-pair contrast scores within each candidate peak
median.peak <- sapply(1:nrow(macs2.peak), function(i){
# start point of the candidate peak
start <- macs2.peak$V2[i]+1
# end point of the candiate peak
end <- macs2.peak$V3[i]
# the different contrast score is the difference between the two condiitons
return(median(s1[start:end] -(s2[start:end])))
})
# peaks identified by Clipper
clipper.peak <- macs2.peak[median.peak >= re$contrast.score.thre,]
```
### Generation of MACS output files
We start from two files applicable for MACS peak calling, such as two bam files: `experimental.bam` for experimental condition and `control.bam` for background/negative control condition. We use the following shell scripts to obtain the "$*$_peaks.narrowPeak" file and two "$*$_pileup.dbg" files. See links for tutorials of MACS.
```{bash, eval = F}
macs2 callpeak -t experimental.bam -c control.bam -f BAM -n twosample -B -q 1 --outdir results
macs2 callpeak -t experimental.bam -f BAM -n exp -B -q 1 --outdir results
macs2 callpeak -t control.bam -f BAM -n back -B -q 1 --outdir results
```
## Peptide identification as an add-on of a database search algorithm
The enrichment functionality of `Clipper` also allows it to control the FDR of peptide identification as an add-on to database search engines. Here we demonstrate this function using an example of Mascot, a popular database search method for peptide identification. Similar workflows also apply if users prefer other database search methods such as SEQUEST, Byonic and etc.
First we load a peptide-spectrum match (PSM) level sample output from Mascot with the target-decoy search strategy:
```{r, eval = F}
library(openxlsx)
temp = tempfile(fileext = ".xlsx")
dataURL <- paste0(dataurl, 'Archaea%20Mascot%20Targets%20100_FDR.xlsx')
download.file(dataURL, destfile=temp, mode='wb')
dat = read.xlsx(temp)
dataURL <- paste0(dataurl, 'Archaea%20Mascot%20DECOYS%20100_FDR.xlsx')
download.file(dataURL, destfile=temp, mode='wb')
dat_decoy = read.xlsx(temp)
# str(dat)
```
To apply `Clipper` to control FDR of identified PSMs, we treat spectra as features; given a spetrum, we treat its q-value from the target search as a measurement under the experimental condition and its q-value from the decoy search as a measurement under the background condition. Prior to applying Clipper, we use the following code to format search results:
```{r, eval = F}
### generate features, psms, and scores from the target search
query = paste( dat[, "Spectrum.File" ], dat[, "ScanNum"], sep = ':')
match = paste( dat[, "Spectrum.File"], dat[, "ScanNum"], dat[,"Sequence"],sep = ':')
score = -log(dat[, "Percolator.q-Value"] + 0.01)
dat = cbind.data.frame(query = query, match = match, score = score, stringsAsFactors = F)
### generate features, psms, and scores from the decoy search
query_decoy = paste(dat_decoy[, "Spectrum.File"], dat_decoy[, "ScanNum"],sep = ':')
match_decoy = paste(dat_decoy[, "Spectrum.File"], dat_decoy[, "ScanNum"],
dat_decoy[,"Sequence"],sep = ':')
score_decoy = -log(dat_decoy[, "Percolator.q-Value"] + 0.01)
### get rid of the false decoy matches
query_decoy = query_decoy[!match_decoy %in% match]
match_decoy = match_decoy[!match_decoy %in% match]
score_decoy = score_decoy[!match_decoy %in% match]
### re-order decoy queries and scores to match the target output
score_decoy = score_decoy[match(dat$query, query_decoy)]
match_decoy = match_decoy[match(dat$query, query_decoy)]
query_decoy = query_decoy[match(dat$query, query_decoy)]
### check percentage shared
mean(query_decoy %in% query)
length(score_decoy)
# 0.8422165
```
Our goal is to identify spectra whose q-value from the target search is smaller than that from the decoy search; thus we set `analysis` to be `"enrichment"`. Because a smaller q-value is considered better, we feed $-$log-transformed q-values into `Clipper`.
```{r, eval = F}
re = Clipper(score.exp = score,
score.back = score_decoy,
FDR = 0.01,
analysis = 'enrichment')
head(match[re$discoveries[[1]]]) ### identified matches in the form of filename:spectrum scan number: peptide sequence
# [1] "am190328_006.raw:1572:KYEESGKPR" "am190328_006.raw:1720:RADSMISDEKER"
# [3] "am190328_005.raw:1784:MMVEVAK" "am190328_006.raw:1807:YPESNYMHK"
# [5] "am190328_005.raw:1830:MDMVNYNQK" "am190328_006.raw:1852:MDMVNYNQK"
```