/
SingleCellExperiment.Rmd
175 lines (132 loc) · 5.84 KB
/
SingleCellExperiment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Run Canek on SingleCellExperiment objects"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Use Canek with Bioconductor}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
In this vignette, we demonstrate how to use `Canek` to correct batch effects from `SingleCellExperiment` objects following the recommendations from [Bioconductor](https://bioconductor.org/books/release/OSCA/) multi-sample analysis.
The toy data sets used in this vignette are included in the `Canek` R package.
```{r, include = FALSE}
knitr::opts_chunk$set(
fig.width = 7,
fig.align = "center",
collapse = TRUE,
comment = "#>"
)
```
## Init
```{r setup, message=FALSE}
library(Canek)
library(SingleCellExperiment)
library(scater)
library(batchelor)
library(scran)
library(ggplot2)
library(patchwork)
```
## Create SingleCellExperiment objects
```{r}
#Create independent SingleCellExperiment objects
sce <- lapply(names(Canek::SimBatches$batches), function(batch) {
counts <- SimBatches$batches[[batch]] #get the counts
SingleCellExperiment(list(counts=counts), #create the sce object
mainExpName = "example_sce")
})
#Add batch labels
names_batch <- c("B1", "B2")
names(sce) <- names_batch
sce$B1[["batch"]] <- "B1"
sce$B2[["batch"]] <- "B2"
#Add cell-type labels (included in Canek's package)
celltypes <- Canek::SimBatches$cell_types
ct_B1_idx <- 1:ncol(sce$B1)
ct_B2_idx <- (ncol(sce$B1)+1):length(celltypes)
sce$B1[["celltype"]] <- celltypes[ct_B1_idx]
sce$B2[["celltype"]] <- celltypes[ct_B2_idx]
```
Let's check the cells distribution across batches.
```{r}
table(sce$B1$batch)
table(sce$B2$batch)
```
## Pre-processing
In the pre-processing steps, we perform the normalization of counts and the variance analysis of genes independently for each batch. Then, we find common variable genes to use in downstream analyses.
### Normalization
We perform scaling normalization within each batch using the `multiBatchNorm` function from [batchelor](https://bioconductor.org/packages/release/bioc/html/batchelor.html) Bioconductor package.
```{r}
sce <- batchelor::multiBatchNorm(sce[["B1"]], sce[["B2"]])
names(sce) <- names_batch
```
### Feature selection
We use the `combineVar` function from [scran](https://bioconductor.org/packages/release/bioc/html/scran.html) Bioconductor package. This function receives independent variance models of genes for each of the batches. This allows us to select highly variable genes while preserving within-batch differences.
We found 273 combined variable features.
```{r}
#Get independent features' variant models
sce_var <- lapply(X = sce, FUN = scran::modelGeneVar)
#Combine the independent variables
combined_features <- scran::combineVar(sce_var[["B1"]], sce_var[["B2"]])
#Select features using the `bio` statistics (see combineVar documentation for further details)
combined_features <- combined_features$bio > 0
sum(combined_features)
```
### Visualization before batch effects correction
Then, we merge the two datasets and use the combined variable features for the Principal Components Analysis.
```{r}
uncorrected <- cbind(sce[["B1"]], sce[["B2"]])
uncorrected <- scater::runPCA(x = uncorrected[combined_features,], scale = TRUE)
```
Let's check out the elbow plot of the PCs.
```{r}
pca_uncorrected <- SingleCellExperiment::reducedDim(x = uncorrected, type = "PCA")
variance <- apply(X = pca_uncorrected, MARGIN = 2, FUN = var)
plot(x = 1:30, y = variance[1:30], xlab = "Principal components", ylab = "Variance")
```
The variance stabilizes after the first seven PCs. In this test, we will use 10 PCs to calculate the UMAP visualization, but feel free to try other numbers.
```{r}
set.seed(777)
uncorrected <- scater::runUMAP(x = uncorrected, dimred = "PCA", pca = 10)
```
We can now visualize the uncorrected data by batch and cell-type labels.
```{r}
p1 <- scater::plotUMAP(object = uncorrected, colour_by="batch")
p2 <- scater::plotUMAP(object = uncorrected, colour_by="celltype")
p1 + p2
```
We can observe that the cells labeled as *Cell Type 1* got divided into two groups that correlated with batch labels. Let's minimize this batch differences with Canek.
## Correcting batch effects with Canek
Canek accepts a `SingleCellExperiments` object with a `batch` identifier. We can pass a vector of variable features to use in the integration.
```{r}
features <- rownames(uncorrected)[combined_features]
#RunCanek in a list of SingleCellExperiment objects
#corrected <- Canek::RunCanek(x = sce, features = features)
#RunCanek in a single object with a batch identifier
corrected <- Canek::RunCanek(x = uncorrected, batches = "batch", features = features)
```
### Visualization after batch effects correction with Canek
To perform PCA it's important to use the corrected log counts. These are saved in the assay `Canek` in the `SingleCellExperiment` object and can be specified by changing the `exprs_values` parameter.
```{r}
corrected <- scater::runPCA(x = corrected, scale = TRUE, exprs_values = "Canek")
```
Let's check out the elbow plot after correction.
```{r}
pca_corrected <- SingleCellExperiment::reducedDim(x = corrected, type = "PCA")
variance <- apply(X = pca_corrected, MARGIN = 2, FUN = var)
plot(x = 1:30, y = variance[1:30], xlab = "Principal components", ylab = "Variance")
```
After correction, the elbow plot has changed and 5 PCs will be enough to calculate the UMAP visualization.
```{r}
set.seed(777)
corrected <- scater::runUMAP(x = corrected, dimred = "PCA", pca = 5)
```
We can now visualize the corrected data by batch and cell-type labels. We can observe that after batch-correction with Canek, we minimize the batch difference in cells from *Cell Type 1* while preserving the other cell types.
```{r}
p1 <- scater::plotUMAP(object = corrected, colour_by="batch")
p2 <- scater::plotUMAP(object = corrected, colour_by="celltype")
p1 + p2
```
## Session info
```{r}
sessionInfo()
```