/
05-dimension_reduction_scRNA-seq.Rmd
436 lines (317 loc) · 18.3 KB
/
05-dimension_reduction_scRNA-seq.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
---
title: "scRNA-seq Dimension Reduction"
output:
html_notebook:
toc: true
toc_float: true
---
**CCDL 2020**
In this notebook, we'll try out some dimension reduction techniques on single-cell RNA-seq data.
Visualizing highly dimensional data is a common challenge in genomics, and especially with RNA-seq data.
The expression of every gene we look at is another dimension describing a sample.
When we also have hundreds or thousands of individual samples, as in the case of single-cell analysis, figuring out how to clearly display all of the data in a meaningful way is difficult.
A common practice is to common to use dimension reduction techniques so all of the data is in a more manageable form for plotting, clustering, and other downstream analyses.
## Set Up
```{r setup}
# Load libraries
library(ggplot2)
library(scater)
library(scran)
# Magrittr pipe
library(magrittr)
# Setting the seed for reproducibility
set.seed(12345)
```
For this module, we are going to use the dataset that we processed part of in
`04-tag-based_scRNA-seq_processing.Rmd`
## Directories and files
```{r filepaths}
# main data directory
data_dir <- file.path("data", "tabula-muris")
# Path to the single-sample alevin results
alevin_file <- file.path(data_dir, "alevin-quant",
"10X_P4_3", "alevin", "quants_mat.gz")
# normalized multi-sample Tabula Muris Data
tm_normalized_file <- file.path(data_dir, "normalized",
"TM_normalized.rds" )
# Path to the metadata file
metadata_file <- file.path(data_dir,
"TM_droplet_metadata.csv")
# Mitochondrial gene table
mito_file <- file.path(data_dir,
"mm_mitochondrial_genes.tsv")
```
## Single sample QC and normalization
```{r tximport_sce}
tm_txi <- tximport::tximport(alevin_file, type = "alevin")
tm_sce <- SingleCellExperiment(list(counts = tm_txi$counts))
```
```{r read_metadata}
tm_metadata <- readr::read_csv(metadata_file, guess_max = 10000) %>%
dplyr::filter(channel == "10X_P4_3")
```
As we did before with the Smart-seq2 data, we will want to filter and normalize this data with `scran/scater`.
While we did the filtering manually before, we can actually do most of the same analyses with `scran`.
First we want to get our mitochondrial genes list again.
```{r mitogenes}
mito_genes <- readr::read_tsv(mito_file) %>%
dplyr::filter(gene_id %in% rownames(tm_sce)) %>%
dplyr::pull(gene_id)
```
### Quality Control and Filtering
Now we can use `scater::addPerCellQC()` to calculate our QC statistics and add them to the metadata for each cell (the `colData` for the `SingleCellExperiment` object.)
We are adding the `subsets` argument, which allows us to specify a named list of gene subsets to calculate extra statistics for.
These might be spike-in controls, mitochondrial genes, rRNA, or whatever class of genes you would like.
The name of each subset is used in the statistic labels, so keep them short.
Here we will calculate statistics for only the mitochondrial genes, and we have given that set the name `mito`.
```{r calculateQC}
tm_sce <- scater::addPerCellQC(
tm_sce,
# a list of named gene subsets that we want stats for
# here we are using mitochondrial genes
subsets = list(mito = mito_genes))
cell_qc_df <- as.data.frame(colData(tm_sce))
```
What QC measures did `addPerCellQC()` calculate?
```{r peek_QC, live = TRUE}
# look at the column names of cell_qc_df
colnames(cell_qc_df )
```
Let's make some plots of the QC data and use it to inform our next steps.
Use the `cell_qc_df` data frame we just created, to make some plots and decide on cutoffs!
```{r qc_plots, live = TRUE}
# Plot some aspect of the QC data stored in `cell_qc_df`
ggplot(cell_qc_df,
aes(x = sum, y = detected, color = subsets_mito_percent)) +
geom_point() +
scale_color_viridis_c() +
scale_x_log10() +
scale_y_log10() +
geom_hline(yintercept = 200, color = "red")
```
```{r qc_plot2, live = TRUE}
# make another QC plot
ggplot(cell_qc_df,
aes(x = subsets_mito_percent)) +
geom_histogram() +
geom_vline(xintercept = 20, color = "red")
```
Looking at the data, it seems like a reasonable cutoff might be 500 detected genes, and a mitochondiral percentage less than 20%
Using these QC data, we can now create a subset of the `SingleCellExperiment` object, keeping only the columns (samples) that pass our QC thresholds:
```{r qcfilter, live = TRUE}
# create tm_sce_filtered object with just the cells that pass filters
qc_pass <- (tm_sce$detected > 500) & (tm_sce$subsets_mito_percent < 20)
tm_sce_filtered <- tm_sce[, qc_pass]
```
We can also filter by features (genes in our case) using `scater::addPerFeatureQC()` which will computer the number of samples where each gene is detected and the mean count across all genes.
We can then use those data (stored in `rowData`) to filter by row to only the genes that are detected in at least 5 cells, and with a mean count > 0.1.
```{r gene_qc}
tm_sce_filtered <- scater::addPerFeatureQC(tm_sce_filtered)
detected <- rowData(tm_sce_filtered)$detected > 5
expressed <- rowData(tm_sce_filtered)$mean > 0.1
# filter the genes (rows) this time
tm_sce_filtered <- tm_sce_filtered[detected & expressed, ]
```
### Normalize
Now we will perform the same normalization steps we did with the Smart-seq2 data, using `scran::computeSumFactors()` and `scater::logNormCounts()`
```{r normalize}
# Cluster similar cells
qclust <- scran::quickCluster(tm_sce_filtered)
# Compute sum factors for each cell cluster grouping.
tm_sce_filtered <- scran::computeSumFactors(tm_sce_filtered, clusters = qclust)
# Normalize and log transform.
tm_sce_filtered <- scater::logNormCounts(tm_sce_filtered)
```
![**Status of the current dataset**](diagrams/tag-based_3.png)
## Principal Components Analysis
The first dimensionality reduction algorithm we will use is PCA, which we have already used!
The idea of PCA is that it finds the axes which explain the greatest amount of variance in the data.
```{r PCA}
# Run PCA
pca_tab_mur <- calculatePCA(tm_sce_filtered)
# Make a dataframe with these PCA scores
pca_tab_mur <- data.frame(pca_tab_mur)
```
```{r Plot PCA}
# Plot this with ggplot2 and label the points with cell types
ggplot(pca_tab_mur,
aes(x = PC1, y = PC2)) +
geom_point()
```
It looks like there are a few different clusters here, which may represent different kinds of cells within the bladder tissue.
If we wanted to, we might look at marker genes within these clusters to try to determine which kinds of cells are in each cluster from this tissue.
This is exactly what the *Tabula Muris* group ended up doing, and you can see some of their results along that front at https://tabula-muris.ds.czbiohub.org/.
## Beyond PCA: UMAP
### More from the *Tabula Muris* Dataset
To explore more about dimensionality reduction, we will explore a larger subset of the data from the *Tabula Muris* study.
We have run alevin on 15 samples, representing 9 different tissues, and taken each of those samples through QC and normalization similar to what we have presented above (with slightly different cutoffs).
Following that, we selected the 1000 genes with the greatest biological variance (using the same method as shown in `02-normalizing_scRNA-seq.Rmd`).
These results, with accompanying metadata were saved into a `SingleCellExperiment` object, which is stored as RDS file at the location `tm_normalized_file` (defined in the `filepaths` chunk above).
Let's load that in and have a quick look at it.
```{r load_normalized}
tm_normalized <- readr::read_rds(tm_normalized_file)
```
How many cells are in this data? How many samples are there? How many cells from each tissue?
```{r explore_sce, live = TRUE}
# explore summaries of the SCE data
dim(tm_normalized) # column count is the number of cells
unique(tm_normalized$sample_id) # what are the samples
table(tm_normalized$tissue) # table of cells per tissue
as.data.frame(colData(tm_normalized)) %>% # dplyr summary table
dplyr::group_by(tissue) %>%
dplyr::summarize(n = dplyr::n(),
mean_detected = mean(detected),
max_detected = max(detected))
```
### PCA baseline
Before we move on to UMAP, let's calculate the principle components for this full data set, and see what it can show us.
In the chunk below, we calculate the PCs and plot the first two, colored by tissue, much as we did before.
```{r pca}
# calculate PCA
tm_pca <- scater::calculatePCA(tm_normalized)
# combine PCA results with metadata stored in sce
tm_meta_pca <- colData(tm_normalized) %>%
as.data.frame() %>%
dplyr::bind_cols(data.frame(tm_pca))
# plot 1st two PC with ggplot
ggplot(data.frame(tm_meta_pca),
aes(x = PC1, y = PC2, color = tissue)) +
geom_point(alpha = 0.1, size = 0.1) +
scale_color_brewer(palette = "Set1") + # a not terrible color scheme for colorblindness
# fix the legend to show opaque, large points
guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))
```
There is definitely some structure there, but the different tissues are certainly not well separated.
Newer, fancier techniques may allow us to do better!
### UMAP
**UMAP** (Uniform Manifold Approximation and Projection) is a machine learning technique designed to provide more detail in highly dimensional data than a typical principal components analysis.
While PCA assumes that the variation we care about has a particular distribution (normal, broadly speaking), UMAP allows more complicated distributions that it learns from the data.
The underlying mathematics are beyond me, but if you are more ambitious than I, you can look at the paper by [McInnes, Healy, & Melville (2018)](https://arxiv.org/abs/1802.03426).
The main advantage of this change in underlying assumptions is that UMAP can do a better job separating clusters, especially when some of those clusters may be more similar to each other than others.
Another dimentionality reduction technique that you may have heard of is **t-SNE** (t-distributed Stochastic Neighbor Embedding), which has similar properties to UMAP, and often produces similar results.
There is some ongoing debate about which of these two techniques is superior, and whether the differences are due to the underlying algorithm or to implementation and parameter initialization defaults.
Regardless of why, in our experience, UMAP seems to produce slightly better results and run a bit faster, but the differences can be subtle.
### Default parameters
For ease of use with this data, we will be using the `scater::calculateUMAP()` function to apply UMAP to our single cell data, but similar functions the `umap` package (notably `umap::umap()`) can be used to apply UMAP to any numerical matrix.
UMAP can be slow for a large data set with lots of parameters.
It is worth noting that the `scater::calculateUMAP()` implementation actually does PCA first, and then runs UMAP on the top 50 PCs.
`scater::calculateUMAP()` will return a matrix of results, with one row for each sample, and a column for each of the UMAP dimensions returned.
Let's see how it looks with the default parameters:
```{r run_umap, live = TRUE}
# run calculateUMAP and save to tm_umap
tm_umap <- scater::calculateUMAP(tm_normalized)
```
Now let's convert that matrix result to a data frame and plot the results!
```{r plot_umap}
# ggplot tm_umap
ggplot(data.frame(tm_umap), aes(x = X1, y = X2)) +
geom_point(alpha = 0.1, size = 0.1)
```
There is clearly a lot of structure in there, but is it meaningful?
Does it correspond to tissues, sex of the mouse, individual samples? Or is it all just random?
Let's add in the metadata we have stored in the `SingleCellExperiment` data (colData) and replot it.
There is a bit of conversion we have to do here to start, because `SingleCellExperiment` likes to use DataFrames
```{r umap_metadata}
tm_meta_umap <- colData(tm_normalized) %>%
as.data.frame() %>%
dplyr::bind_cols(data.frame(tm_umap)) %>% # add umap columns to data frame
dplyr::rename(UMAP1 = "X1", UMAP2 = "X2") # rename for meaning!
```
```{r plot_tissue, live = TRUE}
# plot UMAP with colors by tissue
ggplot(tm_meta_umap,
aes(x = UMAP1, y = UMAP2, color = tissue)) +
geom_point(alpha = 0.1, size = 0.1) +
scale_color_brewer(palette = "Set1") +
guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))
```
## UMAP experiments
Now that we have an idea of what a UMAP plot with the default parameters looks like, let's try experimenting with the `n_neighbors` parameter.
First, we should see what this parameter is, and what the default value is.
In the console, run `?scater::calculateUMAP` to see what this (and other parameters) are.
For even more parameters, you can look at the underlying implementation code that `calculateUMAP()` uses, which is the function `uwot::umap()`
In order to make our experimentation easier, we will create a *function* that allows us to rerun the same code chunk easily, but create an argument that allows us to change one variable: the n_neighbors variable.
```{r UMAP-function}
UMAP_plot_wrapper <- function(sce = tm_normalized, nn_param = 15) {
# Purpose: Run UMAP and plot the output
# Args: nn_param: a single numeric argument that will change the
# n_neighbors variable in the calculateUMAP() function.
# Output: a ggplot scatterplot with the two UMAP coordinates plotted and
# cell-types labeled with data point colors.
# Run UMAP with a specified n_neighbors parameter
umap_mat <- scater::calculateUMAP(sce, n_neighbors = nn_param)
meta_umap <- colData(sce) %>%
as.data.frame() %>%
dplyr::bind_cols(data.frame(umap_mat)) %>% # add umap columns to data frame
dplyr::rename(UMAP1 = "X1", UMAP2 = "X2") # rename for meaning!
# Plot this with ggplot2
ggplot(data.frame(meta_umap),
aes(x = UMAP1, y = UMAP2, color = tissue)) +
geom_point(alpha = 0.1, size = 0.1) +
scale_color_brewer(palette = "Set1") +
guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))
}
```
Let's make sure that works and gives the same result as before when we use the default parameters.
```{r function-test}
UMAP_plot_wrapper(nn_param = 15)
```
*Kind of?*
This isn't your fault!
UMAP is a non-deterministic function, which means that there is a random component to the results.
We can use `set.seed()` to be sure that an individual run (or set of runs) is the same every time you run your analysis, but it is important to check your results a few times with different random starting points to be sure that the random component is not giving you anomalous results.
Setting a different random number seed with `set.seed()` is one way to do this, or you can run the analysis multiple times in the same session, as we have done here.
Fill in the next few code chunks with the function and the `n_neighbors` argument you would like to use for each.
(Feel free to add more tests!)
Then run the chunks and compare your output graphs.
```{r run-UMAP-1, live = TRUE}
# Try something low?
UMAP_plot_wrapper(nn_param = 3)
```
```{r run-UMAP-2, live = TRUE}
# Try something high?
UMAP_plot_wrapper(nn_param = 100)
```
```{r run-UMAP-3, live = TRUE}
# Try whatever you like!
UMAP_plot_wrapper(nn_param = 5)
```
#### Some 'big picture' thoughts to take from this experiment:
1. Analyses (such as UMAP) have various limitations for interpretability.
The coordinates of UMAP output for any given cell can change dramatically depending on parameters, and even run to run with the same parameters.
This probably means that you shouldn't rely too heavily on the exact values of UMAP's output.
- One particular limitation of UMAP and t-SNE is that while observed clusters have some meaning, the distance *between* clusters usually does not (nor does cluster density).
The fact that two clusters are near each other should NOT be interpreted to mean that they are more related to each other than to more distant clusters.
(There is some disagreement about whether UMAP distances have more meaning, but it is probably safer to assume they don't.)
2. Different analyses also have their strengths.
Using cell-type labeling, our experiment illustrated that UMAP does appear to give some biologically relevant output information for this dataset.
3. Playing with parameters so you can fine-tune them is a good way to give you more information about a particular analysis as well as the data itself.
In summary, if the results of an analysis can be completely changed by changing its parameters, you should be more cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose.
### t-SNE comparison
In the block below is a similar analysis and plot with t-SNE (t-distributed Stochastic Neighbor Embedding).
Note that this analysis also does PCA before moving on to the fancy machine learning.
```{r}
tm_tsne <- scater::calculateTSNE(tm_normalized)
tm_meta_tsne <- colData(tm_normalized) %>%
as.data.frame() %>%
dplyr::bind_cols(data.frame(tm_tsne)) %>% # add tsne columns to data frame
dplyr::rename(TSNE1 = "X1", TSNE2 = "X2") # rename for meaning!
ggplot(data.frame(tm_meta_tsne),
aes(x = TSNE1, y = TSNE2, color = tissue)) +
geom_point(alpha = 0.1, size = 0.1) +
scale_color_brewer(palette = "Set1") +
guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))
```
Different! (Slower!) Is it better or worse? Hard to say!
Different people like different things, and one plot might illustrate a particular point better than another.
### Some further reading on dimension reduction:
- This website explains [PCA visually](http://setosa.io/ev/principal-component-analysis/).
- [Becht *et al.* (2018)](https://www.nature.com/articles/nbt.4314) discusses using [UMAP](https://github.com/lmcinnes/umap) for single-cell data.
- [Wattenberg *et al.* (2016)](https://distill.pub/2016/misread-tsne/) discuss how to use t-SNE properly with great visuals.
(The lessons apply to UMAP as well, with a broad substitution of the `n_neighbors` parameter for `perplexity`.)
- [Nguyen & Holmes (2019)](https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006907&type=printable) lay out guidelines on choosing dimensions reduction methods.
- [Freitag (2019)](https://rpubs.com/Saskia/520216) is a nice explanation and comparison of many different dimensionality reduction techniques that you may encounter.
## Session Info:
```{r}
sessionInfo()
```