/
scMerge.Rmd
459 lines (316 loc) · 16.9 KB
/
scMerge.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
---
title: "An introduction to the scMerge package"
author:
- name: Yingxin Lin
affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
- name: Kevin Y.X. Wang
affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
output:
BiocStyle::html_document:
toc_float: true
package: BiocStyle
vignette: >
%\VignetteIndexEntry{scMerge}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
The scMerge algorithm allows batch effect removal and normalisation for single cell RNA-Seq data. It comprises of three key components including:
1. The identification of stably expressed genes (SEGs) as "negative controls" for estimating unwanted factors;
2. The construction of pseudo-replicates to estimate the effects of unwanted factors; and
3. The adjustment of the datasets with unwanted variation using a fastRUVIII model.
The purpose of this vignette is to illustrate some uses of `scMerge` and explain its key components.
# Loading Packages and Data
We will load the `scMerge` package. We designed our package to be consistent with the popular BioConductor's single cell analysis framework, namely the `SingleCellExperiment` and `scater` package.
```{r loading packages, warning = FALSE, message = FALSE}
suppressPackageStartupMessages({
library(SingleCellExperiment)
library(scMerge)
library(scater)
})
```
We provided an illustrative mouse embryonic stem cell (mESC) data in our package, as well as a set of pre-computed stably expressed gene (SEG) list to be used as negative control genes.
The full curated, unnormalised mESC data can be found [here](http://www.maths.usyd.edu.au/u/yingxinl/wwwnb/scMergeData/sce_mESC.rda). The `scMerge` package comes with a sub-sampled, two-batches version of this data (named "batch2" and "batch3" to be consistent with the full data) .
```{r subsampling scMergeData, eval = FALSE, echo = FALSE}
library(genefilter)
load("~/Downloads/sce_mESC.rda")
data("segList_ensemblGeneID", package = "scMerge")
set.seed(2019)
example_sce = sce_mESC[, sce_mESC$batch %in% c("batch2", "batch3")]
example_sce$batch = droplevels(example_sce$batch)
batch2Sampled = sample(colnames(example_sce[,example_sce$batch == "batch2"]), 100)
batch3Sampled = sample(colnames(example_sce[,example_sce$batch == "batch3"]), 100)
countsMat = SingleCellExperiment::counts(example_sce)
batchTest = rowFtests(countsMat, fac = example_sce$batch)
celltypeTest = rowFtests(countsMat, fac = factor(example_sce$cellTypes))
commonSegGenes = intersect(segList_ensemblGeneID$mouse$mouse_scSEG, rownames(sce_mESC))
keepGenes = unique(c(commonSegGenes,
rownames(batchTest)[rank(batchTest$p.value) < 50],
rownames(celltypeTest)[rank(celltypeTest$p.value) < 250]
))
example_sce = example_sce[keepGenes, c(batch2Sampled, batch3Sampled)]
example_sce = example_sce[base::rowSums(counts(example_sce)) != 0, base::colSums(counts(example_sce)) != 0]
table(example_sce$batch,
example_sce$cellTypes)
dim(example_sce)
example_sce = runPCA(example_sce, exprs_values = "logcounts")
scater::plotPCA(example_sce,
colour_by = "cellTypes",
shape_by = "batch")
save(example_sce,
file = "data/example_sce.rda")
```
```{r loading data}
## Subsetted mouse ESC data
data("example_sce", package = "scMerge")
```
In this mESC data, we pooled data from 2 different batches from three different cell types. Using a PCA plot, we can see that despite strong separation of cell types, there is also a strong separation due to batch effects. This information is stored in the `colData` of `example_sce`.
```{r checking raw data}
example_sce = runPCA(example_sce, exprs_values = "logcounts")
scater::plotPCA(
example_sce,
colour_by = "cellTypes",
shape_by = "batch")
```
# Illustrating pseudo-replicates constructions
The first major component of `scMerge` is to obtain negative controls for our normalisation. In this vignette, we will be using a set of pre-computed SEGs from a single cell mouse data made available through the `segList_ensemblGeneID` data in our package. For more information about the selection of negative controls and SEGs, please see Section [select SEGs](#selectnc).
```{r load SEG}
## single-cell stably expressed gene list
data("segList_ensemblGeneID", package = "scMerge")
head(segList_ensemblGeneID$mouse$mouse_scSEG)
```
The second major component of `scMerge` is to compute pseudo-replicates for cells so we can perform normalisation. We offer three major ways of computing this pseudo-replicate information:
1. Unsupervised clustering, using k-means clustering;
2. Supervised clustering, using known cell type information; and
3. Semi-supervised clustering, using partially known cell type information.
# Unsupervised `scMerge`
In unsupervised `scMerge`, we will perform a k-means clustering to obtain pseudo-replicates. This requires the users to supply a `kmeansK` vector with each element indicating number of clusters in each of the batches. For example, we know "batch2" and "batch3" both contain three cell types. Hence, `kmeansK = c(3, 3)` in this case.
```{r t1, echo = FALSE}
t1 = Sys.time()
```
```{r unsupervised_default, results='hide',fig.show='hide'}
scMerge_unsupervised <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_unsupervised")
```
```{r t2, echo = FALSE}
t2 = Sys.time()
```
We now colour construct the PCA plot again on our normalised data. We can observe a much better separation by cell type and less separation by batches.
```{r unsupervised_default_plotting}
scMerge_unsupervised = runPCA(scMerge_unsupervised, exprs_values = "scMerge_unsupervised")
scater::plotPCA(
scMerge_unsupervised,
colour_by = "cellTypes",
shape_by = "batch")
```
<!-- ##Selecting 80% of cells -->
<!-- ```{r results='hide',fig.show='hide'} -->
<!-- system.time(sce_mESC <- scMerge(sce_mESC, -->
<!-- ctl = segList_ensemblGeneID$mouse$mouse_scSEG, -->
<!-- kmeansK = c(1,3,3,1,1), -->
<!-- assay_name = "scMerge_unsupervised_80", -->
<!-- replicate_prop = 0.8)) -->
<!-- ``` -->
<!-- ```{r} -->
<!-- sce_mESC <- runPCA(sce_mESC, exprs_values = "scMerge_unsupervised_80") -->
<!-- scater::plotPCA(sce_mESC, colour_by="cellTypes",shape_by="batch") -->
<!-- ``` -->
# Selecting all cells
By default, `scMerge` only uses 50% of the cells to perform kmeans clustering. While this is sufficient to perform a satisfactory normalisation in most cases, users can control if they wish all cells be used in the kmeans clustering.
```{r unsupervised_prop1, results='hide',fig.show='hide'}
scMerge_unsupervised_all <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_unsupervised_all",
replicate_prop = 1)
```
```{r unsupervised_prop1_plotting}
scMerge_unsupervised_all = runPCA(scMerge_unsupervised_all,
exprs_values = "scMerge_unsupervised_all")
scater::plotPCA(
scMerge_unsupervised_all,
colour_by = "cellTypes",
shape_by = "batch")
```
<!-- ##Selecting 20% of cells -->
<!-- ```{r results='hide',fig.show='hide'} -->
<!-- system.time(sce_mESC <- scMerge(sce_mESC, -->
<!-- ctl = segList_ensemblGeneID$mouse$mouse_scSEG, -->
<!-- kmeansK = c(1,3,3,1,1), -->
<!-- assay_name = "scMerge_unsupervised_50", -->
<!-- replicate_prop = 0.2)) -->
<!-- ``` -->
<!-- ```{r} -->
<!-- sce_mESC <- runPCA(sce_mESC, exprs_values = "scMerge_unsupervised_50") -->
<!-- scater::plotPCA(sce_mESC, colour_by="cellTypes",shape_by="batch") -->
<!-- ``` -->
# Supervised `scMerge`
If **all** cell type information is available to the user, then it is possible to use this information to create pseudo-replicates. This can be done through the `cell_type` argument in the `scMerge` function.
```{r supervised, results='hide',fig.show='hide'}
scMerge_supervised <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_supervised",
cell_type = example_sce$cellTypes)
```
```{r supervised_plotting}
scMerge_supervised = runPCA(scMerge_supervised,
exprs_values = "scMerge_supervised")
scater::plotPCA(
scMerge_supervised,
colour_by = "cellTypes",
shape_by = "batch")
```
# Semi-supervised scMerge I
If the user is only able to access **partial** cell type information, then it is still possible to use this information to create pseudo-replicates. This can be done through the `cell_type` and `cell_type_inc` arguments in the `scMerge` function. `cell_type_inc` should contain a vector of indices indicating which elements in the `cell_type` vector should be used to perform semi-supervised scMerge.
```{r semi_supervised1, results='hide',fig.show='hide'}
scMerge_semisupervised1 <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3,3),
assay_name = "scMerge_semisupervised1",
cell_type = example_sce$cellTypes,
cell_type_inc = which(example_sce$cellTypes == "2i"),
cell_type_match = FALSE)
```
```{r semi_supervised1_plotting}
scMerge_semisupervised1 = runPCA(scMerge_semisupervised1,
exprs_values = "scMerge_semisupervised1")
scater::plotPCA(
scMerge_semisupervised1,
colour_by = "cellTypes",
shape_by = "batch")
```
# Semi-supervised scMerge II
<!-- Perform scMerge using known cell type information to identify mutual nearest cluster -->
There is alternative semi-supervised method to create pseudo-replicates for `scMerge`. This uses known cell type information to identify mutual nearest clusters and it is achieved via the `cell_type` and `cell_type_match = TRUE` options in the `scMerge` function.
```{r semi_supervised2, results='hide',fig.show='hide'}
scMerge_semisupervised2 <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_semisupervised2",
cell_type = example_sce$cellTypes,
cell_type_inc = NULL,
cell_type_match = TRUE)
```
```{r semi_supervised2_plotting}
scMerge_semisupervised2 = runPCA(scMerge_semisupervised2,
exprs_values = "scMerge_semisupervised2")
scater::plotPCA(
scMerge_semisupervised2,
colour_by = "cellTypes",
shape_by = "batch")
```
# Selecting negative controls {#selectnc}
In simple terms, a negative control is a gene that has expression values relatively constant across these datasets. The concept of using these negative control genes for normalisation was most widely used in the RUV method family (e.g. [Gagnon-Bartsch & Speed (2012)](https://academic.oup.com/biostatistics/article/13/3/539/248166) and [Risso et. al. (2014)](https://www.nature.com/articles/nbt.2931)) and there exist multiple methods to find these negative controls. In our paper, we recommened the SEGs as negative controls for scRNA-Seq data and SEGs can be found using either a data-adaptive computational method or external knowledge.
+ Computation method: We provide the function `scSEGIndex` to calculate the SEG from a data matrix. The output of this function is a `data.frame` with a SEG index calculated for each gene. See [Lin et. al. (2018)](https://www.biorxiv.org/content/10.1101/229815v2) for more details.
```{r segIndex1, eval = FALSE}
exprs_mat = SummarizedExperiment::assay(example_sce, 'counts')
result = scSEGIndex(exprs_mat = exprs_mat)
```
+ External knowledge: We have applied the SEG computational methodology on multiple human and mouse scRNA-Seq data and made these available as data objects in our package. The end-users can simply use these pre-computed results. There are also additional negative controls from bulk microarray and bulkd RNA-Seq data.
```{r segIndex2}
## SEG list in ensemblGene ID
data("segList_ensemblGeneID", package = "scMerge")
## SEG list in official gene symbols
data("segList", package = "scMerge")
## SEG list for human scRNA-Seq data
head(segList$human$human_scSEG)
## SEG list for human bulk microarray data
head(segList$human$bulkMicroarrayHK)
## SEG list for human bulk RNASeq data
head(segList$human$bulkRNAseqHK)
```
# Achieving fast and memory-efficient computation
## Using approximated SVD
Under most circumstances, `scMerge` is fast enough to be used on a personal laptop for a moderately large data. However, we do recognise the difficulties associated with computation when dealing with larger data. To this end, we devised a fast version of `scMerge`. The major difference between the two versions lies on the noise estimation component, which utilised singular value decomposition (SVD). In order to speed up `scMerge`, we used `BiocSingular` package that offers several SVD speed improvements. This computational method is able to speed up `scMerge` by obtain a very accurate approximation of the noise structure in the data. This option is achieved via the option `BSPARAM = IrlbaParam()` or `BSPARAM = RandomParam()`. Additionally, `svd_k` is a parameter that controlling the degree of approximations.
We recommend using this option in the case where the number of cells is large in your single cell data. The speed advantage we obtain for large single cell data is much more dramatic than on a smaller dataset like the example mESC data. For example, a single run of normal `scMerge` on a human [pancreas data](https://sydneybiox.github.io/scMerge/articles/Pancreas4_Data/Pancreas4_Data.html) (23699 features and 4566 cells) takes about 10 minutes whereas the speed up version takes just under 4 minutes.
```{r t3, echo = FALSE}
t3 = Sys.time()
```
```{r computation_fast, results='hide',fig.show='hide'}
library(BiocSingular)
scMerge_fast <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_fast",
BSPARAM = IrlbaParam(),
svd_k = 20)
```
```{r t4, echo = FALSE}
t4 = Sys.time()
```
```{r computation_svd_plotting}
paste("Normally, scMerge takes ", round(t2 - t1, 2), " seconds")
paste("Fast version of scMerge takes ", round(t4 - t3, 2), " seconds")
scMerge_fast = runPCA(scMerge_fast, exprs_values = "scMerge_fast")
scater::plotPCA(
scMerge_fast,
colour_by = "cellTypes",
shape_by = "batch") +
labs(title = "fast scMerge yields similar results to the default version")
```
## Parallelised computing
`scMerge` is implemented with a parallelised computational option via the [BiocParallel](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html) package. You can enable this option using the `BPPARAM` argument with various `BiocParallelParam` objects that is suitable for your operating system.
Please note that any parallelisation would incur a small overhead. Hence we recommend you do not use parallelisation for small data.
```{r parallel1, eval = FALSE}
library(BiocParallel)
scMerge_parallel <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_parallel",
BPPARAM = MulticoreParam(workers = 2)
)
```
## Sparse array
`scMerge` also supports sparse array input, which could be very helpful in speeding up computations and saving RAM. `scMerge` does not perform internal matrix conversion, so you may use the following codes as an example of converting typical `matrix` class to sparse matrices before running `scMerge`.
```{r, eval = FALSE}
library(Matrix)
library(DelayedArray)
sparse_input = example_sce
assay(sparse_input, "counts") = as(counts(sparse_input), "dgeMatrix")
assay(sparse_input, "logcounts") = as(logcounts(sparse_input), "dgeMatrix")
scMerge_sparse = scMerge(
sce_combine = sparse_input,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_sparse")
```
## Out-of-memory computations (through `HDF5Array`)
Bioconductor provides an infrastructure for out-of-memory computation through `HDF5Array`. In simple terms, we can load an on-disk data into RAM, make computations and write to hard disk. This is particularly helpful when the data is too large for in-RAM computations. You may use the following codes as an example of converting typical `matrix` class to `HDF5Array` matrices before running `scMerge`.
```{r, eval = FALSE}
library(HDF5Array)
library(DelayedArray)
DelayedArray:::set_verbose_block_processing(TRUE) ## To monitor block processing
hdf5_input = example_sce
assay(hdf5_input, "counts") = as(counts(hdf5_input), "HDF5Array")
assay(hdf5_input, "logcounts") = as(logcounts(hdf5_input), "HDF5Array")
scMerge_hdf5 = scMerge(
sce_combine = sparse_input,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_hdf5")
```
# Reference
Please check out our paper for detailed analysis and results on multiple scRNA-Seq data. https://doi.org/10.1073/pnas.1820006116.
```{r reference}
citation("scMerge")
```
# Session Info
```{r session info}
sessionInfo()
```