-
Notifications
You must be signed in to change notification settings - Fork 0
/
Cell_type_deconvolution_and_differential_analysis.Rmd
284 lines (219 loc) · 10.9 KB
/
Cell_type_deconvolution_and_differential_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: "Cell type deconvolution and differential analysis"
output:
rmarkdown::html_vignette
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
vignette: >
%\VignetteIndexEntry{Cell type deconvolution and differential analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
tidy = TRUE,
tidy.opts = list(width.cutoff = 95),
message = FALSE,
warning = FALSE,
error = FALSE,
fig.align = "center"
)
```
For this tutorial, we will firstly construct an integrated reference using single-cell RNA-seq (scRNA-seq) and single-nuclei RNA-seq (snRNA-seq) data from normal breast tissue samples.
Then, we will simulate artificial bulk RNA-seq samples using the constructed reference data. Predefined cell-type proportions will be used to introduce heterogeneity between groups for the simulated bulk samples.
We then use `OLS` method to deconvolute the bulk samples and predict cell-type proportions for each sample. Differential expression and pathway enrichment analysis will then be performed to identify perturbed genes and pathways. We will also be examining the effect of correcting for cell-type proportion differences on those downstream analysis.
# Load required libraries
```{r}
library(SCdeconR)
library(plotly)
# register backend for parallel processing
library(doFuture)
registerDoFuture()
plan("multisession", workers = 4)
```
# Construct integrated reference data
```{r}
# path to reference datasets
ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"),
paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2"))
# path to phenodata files
phenodata_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt"),
paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt"))
# construct integrated reference using harmony algorithm
refdata <- construct_ref(ref_list = ref_list,
phenodata_list = phenodata_list,
data_type = "cellranger",
method = "harmony",
group_var = "subjectid",
nfeature_rna = 50,
vars_to_regress = "percent_mt", verbose = FALSE
)
refdata
```
Here we set `nfeature_rna` to be 50 due to limited number of cells we included in the data. You might want to apply a higher cutoff for your own reference data.
<details>
<summary>**what does refdata looks like**</summary>
```{r}
# refdata is a seurat object with those metadata information.
# use ?refdata for further documentation
str(refdata@meta.data)
```
</details>
# Simulate artificial bulk RNA-seq samples based on integrated reference
Ideally, you should split the integrated reference data into training and testing, and use the training data to generate artificial bulk samples. The testing data is used in deconvolution step. However, for the sake of this tutorial, due to the small # of cells we included in the `refdata`, we will not split the data into training and testing.
```{r}
# create phenodata
phenodata <- data.frame(cellid = colnames(refdata),
celltypes = refdata$celltype,
subjectid = refdata$subjectid)
# Here is the number of cells per cell-type in refdata
table(refdata$celltype)
```
Next we generate two sets of bulk data that have different cell-type proportions.
```{r}
# equal proportions across cell-types
prop1 <- data.frame(celltypes = unique(refdata$celltype),
proportion = rep(0.125, 8))
# generate 20 artificial bulk samples with the above cell-type proportions
bulk_sim1 <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
phenodata = phenodata,
num_mixtures = 20,
prop = prop1,
replace = TRUE)
# high proportion of epithelial cells
prop2 <- data.frame(celltypes = unique(refdata$celltype),
proportion = c(0.8, 0.1, 0.1, rep(0, 5)))
# generate 20 artificial bulk samples with the above cell-type proportions
bulk_sim2 <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
phenodata = phenodata,
num_mixtures = 20,
prop = prop2,
replace = TRUE)
# combine the two datasets
bulk_sim <- list(cbind(bulk_sim1[[1]], bulk_sim2[[1]]),
cbind(bulk_sim1[[2]], bulk_sim2[[2]]))
```
<details>
<summary>**what does simulated bulk data looks like**</summary>
```{r}
# bulk_generator returns a list of two elements.
# the first element is simulated bulk RNA-seq data, with rows representing genes, columns representing samples.
# show first five samples
str(bulk_sim[[1]][,1:5])
```
```{r}
# the second element is cell type proportions used to simulate the bulk RNA-seq data, with rows representing cell types, columns representing samples.
# show first five samples
str(bulk_sim[[2]][,1:5])
```
</details>
# Perform cell-type deconvolution using OLS algorithm
```{r}
# as mentioned ealier, we use the same reference data for deconvolution.
# in practice, we recommend to split your reference data into training and testing, and use testing data for decovolution
decon_res <- scdecon(bulk = bulk_sim[[1]],
ref = GetAssayData(refdata, slot = "data", assay = "SCT"),
phenodata = phenodata,
filter_ref = TRUE,
decon_method = "OLS",
norm_method_sc = "LogNormalize",
norm_method_bulk = "TMM",
trans_method_sc = "none",
trans_method_bulk = "log2",
marker_strategy = "all")
```
<details>
<summary>**what does deconvolution result looks like**</summary>
```{r}
# scdecon returns a list of two elements
# the first element is a data.frame of predicted cell-type proportions, with rows representing cell types, columns representing samples.
str(decon_res[[1]])
```
```{r}
# the second element is a data.frame of fitting errors of the algorithm; first column represents sample names, second column represents RMSEs.
str(decon_res[[2]])
```
</details>
We can then generate a bar plot of predicted cell proportions across samples
```{r, fig.width=8, fig.height=6}
prop_barplot(prop = decon_res[[1]], interactive = FALSE)
```
# Perform differential expression analysis
```{r}
# prepare sampleinfo, group1 and group2 were the two bulk sets we simulated eariler.
sampleinfo <- data.frame(condition = rep(c("group1", "group2"), each =20))
row.names(sampleinfo) <- paste0("sample", 1:nrow(sampleinfo))
# prepare bulk samples
bulk <- bulk_sim[[1]]
# force data to be integers for DE purposes
for(i in 1:ncol(bulk)) storage.mode(bulk[,i]) <- "integer"
colnames(bulk) <- rownames(sampleinfo)
# prepare predicted cell-type proportions
prop = decon_res[[1]]
colnames(prop) <- colnames(bulk)
# perform DEA adjusting cell-type proportion differences
deres <- run_de(bulk = bulk,
prop = prop,
sampleinfo = sampleinfo,
control = "group1",
case = "group2",
de_method = "edgeR")
# perform DEA without adjusting cell-type proportion differences
deres_notadjust <- run_de(bulk = bulk,
prop = NULL,
sampleinfo = sampleinfo,
control = "group1",
case = "group2",
de_method = "edgeR")
```
Next, we can compare the effect of adjusting for cell-type proportion differences
```{r, fig.width=8, fig.height=6}
comparedeg_scatter(results1 = deres[[2]],
results2 = deres_notadjust[[2]],
result_names = c("adjust for cell proportion", "not adjust for cell proportion"),
fc_cutoff = 1.5,
pval_cutoff = 0.05,
pvalflag = TRUE,
interactive = FALSE)
```
As you can see, many false positive differential genes were detected without adjusting for cell proportion differences.
# Gene-set enrichment analysis
We also provide several visualization options for pathway enrichment results generated using [GSEA](https://www.gsea-msigdb.org/gsea/index.jsp) software. Note that those functions are not compatible with the [R implementation](https://github.com/GSEA-MSigDB/GSEA_R) of GSEA. Also remember to run `reformt_gmt()` to reformat the gene-set names before using GSEA.
```{r, eval=FALSE}
reformat_gmt(gmtfile = "/path/to/gmt/file/", outputfile = "/path/to/reformatted/gmt/file/", replace = TRUE)
## here is one example of reformatted gmt file for kegg pathway
gmt <- read_gmt(gmtfile = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/gmtfile/kegg.gmt"))
```
Prepare `.rnk` file.
```{r, eval=FALSE}
prepare_rnk(teststats = deres[[2]], outputfile = "/path/to/rnk/file/")
```
Use the reformatted `.gmt` file and `.rnk` file to run GSEA. You can download GSEA [here](https://www.gsea-msigdb.org/gsea/downloads.jsp). Do not use the R version implementation (our visualization functions are not compatible with GSEA R implementation).
We included GSEA output using differential genes w/wo adjusting for cell-type differences. We can now compare the differences between those results.
```{r, fig.height=6, fig.width=8}
comparegsea_scatter(gseares_path1 = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/results/ct_adjust/"),
gseares_path2 = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/results/ct_unadjust/"),
result_names = c("ct_adjust", "ct_unadjust"), nes_cutoff = 1.5, pval_cutoff = 0.1, pvalflag = FALSE, interactive = FALSE)
```
We can then generate summary plot of top enriched gene-sets.
```{r, fig.height=6, fig.width=8}
gsea_sumplot(gseares_path = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/results/ct_adjust/"),
pos_sel = c("GLYCOSPHINGOLIPID_BIOSYNTHESIS", "FRUCTOSE_AND_MANNOSE_METABOLISM", "HIF-1_SIGNALING_PATHWAY"),
neg_sel = c("P53_SIGNALING_PATHWAY", "RENIN_SECRETION", "LYSOSOME"), pvalflag = FALSE, interactive = FALSE)
```
Heatmap of two selected gene-sets for demonstration.
```{r, fig.height=6, fig.width=8}
gsea_heatmap(normdata = deres[[1]],
teststats = deres[[2]],
gmtfile = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/gmtfile/kegg.gmt"),
numgenes = 200,
gsname_up = "HIF-1_SIGNALING_PATHWAY",
gsname_down = "P53_SIGNALING_PATHWAY",
anncol = sampleinfo,
color = colorRampPalette(c("blue", "white", "red"))(100))
```
Random-walk plot of selected gene-sets
```{r, fig.height=6, fig.width=6}
gsea_rwplot(gseares_path = paste0(system.file("extdata", package = "SCdeconR"), "/gsea/results/ct_adjust/"),
gsname = "HIF-1_SIGNALING_PATHWAY",
class_name = "KEGG")
```