/
03_Organoids_Integration.Rmd
213 lines (166 loc) · 6.22 KB
/
03_Organoids_Integration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: "Organoids Integration"
output:
workflowr::wflow_html:
code_folding: hide
---
```{r knitr, include = FALSE}
DOCNAME = "03_Organoids_Integration"
knitr::opts_chunk$set(autodep = TRUE,
cache = TRUE,
cache.path = paste0("cache/", DOCNAME, "/"),
cache.comments = FALSE,
echo = TRUE,
error = FALSE,
fig.align = "center",
fig.width = 10,
fig.height = 8,
message = FALSE,
warning = FALSE)
```
```{r libaries, cache = FALSE}
# scRNA-seq
library("SingleCellExperiment")
library("Seurat")
# Matrices
# Plotting
# Parallel
# Paths
library("here")
# Tidyverse
library("tidyverse")
```
```{r source, cache = FALSE}
```
```{r paths}
org123_path <- here("data/processed/Organoid123_SCE_filtered.Rds")
org4_path <- here("data/processed/Organoid4_SCE_filtered.Rds")
```
Introduction
============
Here we will combine the two batches of organoid samples using Seurat's
integration method.
```{r load-org123, cache.extra = tools::md5sum(org123_path)}
if (file.exists(org123_path)) {
org123_sce <- read_rds(org123_path)
} else {
stop("Organoid123 dataset is missing. ",
"Please run '01_Organoid123_QC.Rmd' first.",
call. = FALSE)
}
```
```{r load-org4, cache.extra = tools::md5sum(org4_path)}
if (file.exists(org4_path)) {
org4_sce <- read_rds(org4_path)
} else {
stop("Organoid4 dataset is missing. ",
"Please run '02_Organoid4_QC.Rmd' first.",
call. = FALSE)
}
```
The datasets are currently in `SingleCellExperiment` format but to use Seurat we
need to convert them to `Seurat` format.
```{r convert}
org123 <- as.seurat(org123_sce)
org4 <- as.seurat(org4_sce)
```
Variable genes
==============
Before we can align the datasets we need to select a set of genes to use. We
do this by looking for highly variable genes.
```{r var-genes}
org123 <- NormalizeData(org123, display.progress = FALSE)
org4 <- NormalizeData(org4, display.progress = FALSE)
org123 <- ScaleData(org123, display.progress = FALSE)
org4 <- ScaleData(org4, display.progress = FALSE)
org123 <- FindVariableGenes(org123, do.plot = FALSE)
org4 <- FindVariableGenes(org4, do.plot = FALSE)
genes.use <- unique(c(org123@var.genes, org4@var.genes))
genes.use <- intersect(genes.use, rownames(org123@scale.data))
genes.use <- intersect(genes.use, rownames(org4@scale.data))
```
Seurat finds `r length(org123@var.genes)` highly variable genes for the first
organoid dataset and `r length(org4@var.genes)` highly variable genes for
the second organoid dataset. We take these genes and use those that are present
in both datasets. This leaves us with `r length(genes.use)` selected genes.
Merging
=======
Before we try to properly align the datasets let's see what happens if we just
naively merge them together. If this works we might not need to do anything
more sophisticated.
```{r merge}
merged <- MergeSeurat(org123, org4,
add.cell.id1 = "Org123", add.cell.id2 = "Org4")
merged <- ScaleData(merged, display.progress = FALSE)
merged <- RunPCA(merged, pc.genes = genes.use, pcs.compute = 20,
do.print = FALSE)
merged <- ProjectPCA(merged, do.print = FALSE)
merged <- RunTSNE(merged, dims.use = 1:20, do.fast = TRUE)
merged@meta.data$DatasetSample <- paste(merged@meta.data$Dataset,
merged@meta.data$Sample, sep = "_")
TSNEPlot(merged, group.by = "DatasetSample")
```
Ideally we would see an even mix of the two datasets here but there is some
pretty clear separation. Let's move on to the alignment process.
Perform CCA
===========
The first step of Seurat's alignment is to perfrom Canoical Correlation Analysis
(CCA). This combines the two datasets in the same multi-dimensional space.
```{r cca, results = "hide"}
orgs <- RunCCA(org123, org4, genes.use = genes.use, num.cc = 50,
add.cell.id1 = "Org123", add.cell.id2 = "Org4")
p1 <- DimPlot(orgs, reduction.use = "cca", group.by = "Dataset",
pt.size = 0.5, do.return = TRUE)
p2 <- VlnPlot(orgs, features.plot = "CC1", group.by = "Dataset",
do.return = TRUE)
plot_grid(p1, p2)
```
Choose CCs
==========
Now we need to choose the dimensions to use for aligning the datasets. This is
similar to choosing principal components for clustering a single dataset. Here
for look for a drop off in correlation strength for each CC.
```{r plot-CCs}
plot <- MetageneBicorPlot(orgs, grouping.var = "Dataset", dims.eval = 1:50,
display.progress = FALSE)
```
We can also look at the genes associated with each of the first nine CCs.
```{r CC-heatmaps, fig.height = 10}
DimHeatmap(object = orgs, reduction.type = "cca", cells.use = 500,
dim.use = 1:9, do.balanced = TRUE)
```
```{r n-dims}
n.dims <- 25
```
Based on these plots we will to use the first `r n.dims` CCs.
Align CCA subspaces
===================
Now we can align the CCA subspaces for the two datasets. This makes the
dimensions more comparable in a way we can use for clustering.
```{r align}
orgs <- AlignSubspace(orgs, reduction.type = "cca",
grouping.var = "Dataset", dims.align = 1:n.dims,
verbose = FALSE)
p1 <- VlnPlot(object = orgs, features.plot = "ACC1", group.by = "Dataset",
do.return = TRUE)
p2 <- VlnPlot(object = orgs, features.plot = "ACC2", group.by = "Dataset",
do.return = TRUE)
plot_grid(p1, p2)
```
```{r align-tSNE}
orgs <- RunTSNE(orgs, reduction.use = "cca.aligned", dims.use = 1:20)
orgs@meta.data$DatasetSample<- paste(orgs@meta.data$Dataset,
orgs@meta.data$Sample,
sep = "_")
TSNEPlot(orgs, pt.size = 0.5, group.by = "DatasetSample")
```
This looks to have worked well, there is no obvious separation between the two
datasets.
This combined dataset of `r nrow(orgs@scale.data)` genes and
`r ncol(orgs@scale.data)` cells is the starting point for our clustering
analysis.
```{r write-combined}
write_rds(orgs, here("data/processed/Organoids_Seurat.Rds"))
```
Metadata
========