scRNA-seq/05-dimension_reduction_scRNA-seq.Rmd

---
title: "scRNA-seq Dimension Reduction"
output:   
  html_notebook: 
    toc: true
    toc_float: true
---

**CCDL 2020**

In this notebook, we'll try out some dimension reduction techniques on single-cell RNA-seq data. 

Visualizing highly dimensional data is a common challenge in genomics, and especially with RNA-seq data.
The expression of every gene we look at is another dimension describing a sample.
When we also have hundreds or thousands of individual samples, as in the case of single-cell analysis, figuring out how to clearly display all of the data in a meaningful way is difficult. 

A common practice is to common to use dimension reduction techniques so all of the data is in a more manageable form for plotting, clustering, and other downstream analyses. 

## Set Up 

```{r setup}
# Load libraries
library(ggplot2)
library(scater)
library(scran)

# Magrittr pipe
library(magrittr)

# Setting the seed for reproducibility
set.seed(12345)
```

For this module, we are going to use the dataset that we processed part of in 
`04-tag-based_scRNA-seq_processing.Rmd`

## Directories and files

```{r filepaths}
# main data directory
data_dir <- file.path("data", "tabula-muris")

# Path to the single-sample alevin results
alevin_file <- file.path(data_dir, "alevin-quant", 
                         "10X_P4_3", "alevin", "quants_mat.gz")

# normalized multi-sample Tabula Muris Data
tm_normalized_file <- file.path(data_dir, "normalized",
                                "TM_normalized.rds" )

# Path to the metadata file 
metadata_file <- file.path(data_dir, 
                           "TM_droplet_metadata.csv")

# Mitochondrial gene table
mito_file <- file.path(data_dir,
                       "mm_mitochondrial_genes.tsv")
```

## Single sample QC and normalization

```{r tximport_sce}
tm_txi <- tximport::tximport(alevin_file, type = "alevin")
tm_sce <- SingleCellExperiment(list(counts = tm_txi$counts))
```

```{r read_metadata}
tm_metadata <- readr::read_csv(metadata_file, guess_max = 10000) %>%
  dplyr::filter(channel == "10X_P4_3")
```

As we did before with the Smart-seq2 data, we will want to filter and normalize this data with `scran/scater`.
While we did the filtering manually before, we can actually do most of the same analyses with `scran`.

First we want to get our mitochondrial genes list again.

```{r mitogenes}
mito_genes <- readr::read_tsv(mito_file) %>%
  dplyr::filter(gene_id %in% rownames(tm_sce)) %>%
  dplyr::pull(gene_id)
```

### Quality Control and Filtering

Now we can use `scater::addPerCellQC()` to calculate our QC statistics and add them to the metadata for each cell (the `colData` for the `SingleCellExperiment` object.) 
We are adding the `subsets` argument, which allows us to specify a named list of gene subsets to calculate extra statistics for. 
These might be spike-in controls, mitochondrial genes, rRNA, or whatever class of genes you would like. 
The name of each subset is used in the statistic labels, so keep them short. 
Here we will calculate statistics for only the mitochondrial genes, and we have given that set the name `mito`.

```{r calculateQC}
tm_sce <- scater::addPerCellQC(
  tm_sce, 
  # a list of named gene subsets that we want stats for
  # here we are using mitochondrial genes
  subsets = list(mito = mito_genes)) 

cell_qc_df <- as.data.frame(colData(tm_sce)) 
```

What QC measures did `addPerCellQC()` calculate?

```{r peek_QC, live = TRUE}
# look at the column names of cell_qc_df
colnames(cell_qc_df )
```

Let's make some plots of the QC data and use it to inform our next steps.
Use the `cell_qc_df` data frame we just created, to make some plots and decide on cutoffs!

```{r qc_plots, live = TRUE}
# Plot some aspect of the QC data stored in `cell_qc_df`
ggplot(cell_qc_df, 
       aes(x = sum, y = detected, color = subsets_mito_percent)) +
  geom_point() +
  scale_color_viridis_c() +
  scale_x_log10() +
  scale_y_log10() +
  geom_hline(yintercept = 200, color = "red")
```

```{r qc_plot2, live = TRUE}
# make another QC plot
ggplot(cell_qc_df, 
       aes(x = subsets_mito_percent)) +
  geom_histogram() + 
  geom_vline(xintercept = 20, color = "red")
```

Looking at the data, it seems like a reasonable cutoff might be 500 detected genes, and a mitochondiral percentage less than 20%

Using these QC data, we can now create a subset of the `SingleCellExperiment` object, keeping only the columns (samples) that pass our QC thresholds:

```{r qcfilter, live = TRUE}
# create tm_sce_filtered object with just the cells that pass filters
qc_pass <- (tm_sce$detected > 500) & (tm_sce$subsets_mito_percent < 20)
tm_sce_filtered <- tm_sce[, qc_pass]
```

We can also filter by features (genes in our case) using `scater::addPerFeatureQC()` which will computer the number of samples where each gene is detected and the mean count across all genes. 
We can then use those data (stored in `rowData`) to filter by row to only the genes that are detected in at least 5 cells, and with a mean count > 0.1.

```{r gene_qc}
tm_sce_filtered <- scater::addPerFeatureQC(tm_sce_filtered)
detected <- rowData(tm_sce_filtered)$detected > 5
expressed <- rowData(tm_sce_filtered)$mean > 0.1

# filter the genes (rows) this time
tm_sce_filtered <- tm_sce_filtered[detected & expressed, ]
```

### Normalize

Now we will perform the same normalization steps we did with the Smart-seq2 data, using `scran::computeSumFactors()` and `scater::logNormCounts()`

```{r normalize}
# Cluster similar cells
qclust <- scran::quickCluster(tm_sce_filtered)

# Compute sum factors for each cell cluster grouping.  
tm_sce_filtered <- scran::computeSumFactors(tm_sce_filtered, clusters = qclust)

# Normalize and log transform. 
tm_sce_filtered <- scater::logNormCounts(tm_sce_filtered)
```


![**Status of the current dataset**](diagrams/tag-based_3.png)

## Principal Components Analysis

The first dimensionality reduction algorithm we will use is PCA, which we have already used!
The idea of PCA is that it finds the axes which explain the greatest amount of variance in the data.


```{r PCA}
# Run PCA
pca_tab_mur <- calculatePCA(tm_sce_filtered)

# Make a dataframe with these PCA scores
pca_tab_mur <- data.frame(pca_tab_mur)
```

```{r Plot PCA}
# Plot this with ggplot2 and label the points with cell types
ggplot(pca_tab_mur, 
       aes(x = PC1, y = PC2)) + 
  geom_point() 
```

It looks like there are a few different clusters here, which may represent different kinds of cells within the bladder tissue.
If we wanted to, we might look at marker genes within these clusters to try to determine which kinds of cells are in each cluster from this tissue. 
This is exactly what the *Tabula Muris* group ended up doing, and you can see some of their results along that front at https://tabula-muris.ds.czbiohub.org/.


## Beyond PCA: UMAP

### More from the *Tabula Muris* Dataset

To explore more about dimensionality reduction, we will explore a larger subset of the data from the *Tabula Muris* study. 
We have run alevin on 15 samples, representing 9 different tissues, and taken each of those samples through QC and normalization similar to what we have presented above (with slightly different cutoffs).
Following that, we selected the 1000 genes with the greatest biological variance (using the same method as shown in `02-normalizing_scRNA-seq.Rmd`).

These results, with accompanying metadata were saved into a `SingleCellExperiment` object, which is stored as RDS file at the location `tm_normalized_file` (defined in the `filepaths` chunk above).
Let's load that in and have a quick look at it. 

```{r load_normalized}
tm_normalized <- readr::read_rds(tm_normalized_file)
```

How many cells are in this data? How many samples are there? How many cells from each tissue?

```{r explore_sce, live = TRUE}
# explore summaries of the SCE data
dim(tm_normalized) # column count is the number of cells

unique(tm_normalized$sample_id) # what are the samples

table(tm_normalized$tissue) # table of cells per tissue

as.data.frame(colData(tm_normalized)) %>% # dplyr summary table
  dplyr::group_by(tissue) %>%
  dplyr::summarize(n = dplyr::n(), 
                   mean_detected = mean(detected), 
                   max_detected = max(detected))
```
### PCA baseline

Before we move on to UMAP, let's calculate the principle components for this full data set, and see what it can show us.

In the chunk below, we calculate the PCs and plot the first two, colored by tissue, much as we did before.

```{r pca}
# calculate PCA
tm_pca <- scater::calculatePCA(tm_normalized)

# combine PCA results with metadata stored in sce
tm_meta_pca <- colData(tm_normalized) %>%
  as.data.frame() %>%
  dplyr::bind_cols(data.frame(tm_pca))

# plot 1st two PC with ggplot
ggplot(data.frame(tm_meta_pca), 
       aes(x = PC1, y = PC2, color = tissue)) + 
  geom_point(alpha = 0.1, size = 0.1) +
  scale_color_brewer(palette = "Set1") + # a not terrible color scheme for colorblindness
  # fix the legend to show opaque, large points
  guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))   
```

There is definitely some structure there, but the different tissues are certainly not well separated. 
Newer, fancier techniques may allow us to do better!

### UMAP

**UMAP** (Uniform Manifold Approximation and Projection) is a machine learning technique designed to provide more detail in highly dimensional data than a typical principal components analysis. 
While PCA assumes that the variation we care about has a particular distribution (normal, broadly speaking), UMAP allows more complicated distributions that it learns from the data. 
The underlying mathematics are beyond me, but if you are more ambitious than I, you can look at the paper by [McInnes, Healy, & Melville (2018)](https://arxiv.org/abs/1802.03426). 
The main advantage of this change in underlying assumptions is that UMAP can do a better job separating clusters, especially when some of those clusters may be more similar to each other than others.  

Another dimentionality reduction technique that you may have heard of is **t-SNE** (t-distributed Stochastic Neighbor Embedding), which has similar properties to UMAP, and often produces similar results. 
There is some ongoing debate about which of these two techniques is superior, and whether the differences are due to the underlying algorithm or to implementation and parameter initialization defaults. 
Regardless of why, in our experience, UMAP seems to produce slightly better results and run a bit faster, but the differences can be subtle.


### Default parameters

For ease of use with this data, we will be using the `scater::calculateUMAP()` function to apply UMAP to our single cell data, but similar functions the `umap` package (notably `umap::umap()`) can be used to apply UMAP to any numerical matrix.

UMAP can be slow for a large data set with lots of parameters.
It is worth noting that the `scater::calculateUMAP()` implementation actually does PCA first, and then runs UMAP on the top 50 PCs. 

`scater::calculateUMAP()` will return a matrix of results, with one row for each sample, and a column for each of the UMAP dimensions returned. 
Let's see how it looks with the default parameters:

```{r run_umap, live = TRUE}
# run calculateUMAP and save to tm_umap
tm_umap <- scater::calculateUMAP(tm_normalized)
```

Now let's convert that matrix result to a data frame and plot the results!

```{r plot_umap}
# ggplot tm_umap
ggplot(data.frame(tm_umap), aes(x = X1, y = X2)) + 
  geom_point(alpha = 0.1, size = 0.1)
```

There is clearly a lot of structure in there, but is it meaningful?

Does it correspond to tissues, sex of the mouse, individual samples? Or is it all just random?

Let's add in the metadata we have stored in the `SingleCellExperiment` data (colData) and replot it.
There is a bit of conversion we have to do here to start, because `SingleCellExperiment` likes to use DataFrames

```{r umap_metadata}
tm_meta_umap <- colData(tm_normalized) %>%
  as.data.frame() %>%
  dplyr::bind_cols(data.frame(tm_umap)) %>% # add umap columns to data frame
  dplyr::rename(UMAP1 = "X1", UMAP2 = "X2") # rename for meaning!
```

```{r plot_tissue, live = TRUE}
# plot UMAP with colors by tissue
ggplot(tm_meta_umap, 
       aes(x = UMAP1, y = UMAP2, color = tissue)) + 
  geom_point(alpha = 0.1, size = 0.1) +
  scale_color_brewer(palette = "Set1") + 
  guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))
```


## UMAP experiments

Now that we have an idea of what a UMAP plot with the default parameters looks like, let's try experimenting with the `n_neighbors` parameter. 
First, we should see what this parameter is, and what the default value is.
In the console, run `?scater::calculateUMAP` to see what this (and other parameters) are.
For even more parameters, you can look at the underlying implementation code that `calculateUMAP()` uses, which is the function `uwot::umap()`

In order to make our experimentation easier, we will create a *function* that allows us to rerun the same code chunk easily, but create an argument that allows us to change one variable: the n_neighbors variable. 

```{r UMAP-function}
UMAP_plot_wrapper <- function(sce = tm_normalized, nn_param = 15) {
  # Purpose: Run UMAP and plot the output
  # Args: nn_param: a single numeric argument that will change the 
  #                 n_neighbors variable in the calculateUMAP() function. 
  # Output: a ggplot scatterplot with the two UMAP coordinates plotted and 
  #         cell-types labeled with data point colors. 

  # Run UMAP with a specified n_neighbors parameter
  
  umap_mat <- scater::calculateUMAP(sce, n_neighbors = nn_param)

  meta_umap <- colData(sce) %>%
    as.data.frame() %>%
    dplyr::bind_cols(data.frame(umap_mat)) %>% # add umap columns to data frame
    dplyr::rename(UMAP1 = "X1", UMAP2 = "X2") # rename for meaning!
  
  # Plot this with ggplot2
  ggplot(data.frame(meta_umap), 
         aes(x = UMAP1, y = UMAP2, color = tissue)) + 
    geom_point(alpha = 0.1, size = 0.1) +
    scale_color_brewer(palette = "Set1") +
    guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))
}
```

Let's make sure that works and gives the same result as before when we use the default parameters.

```{r function-test}
UMAP_plot_wrapper(nn_param = 15)
```

*Kind of?*

This isn't your fault! 
UMAP is a non-deterministic function, which means that there is a random component to the results. 
We can use `set.seed()` to be sure that an individual run (or set of runs) is the same every time you run your analysis, but it is important to check your results a few times with different random starting points to be sure that the random component is not giving you anomalous results.
Setting a different random number seed with `set.seed()` is one way to do this, or you can run the analysis multiple times in the same session, as we have done here.

Fill in the next few code chunks with the function and the `n_neighbors` argument you would like to use for each. 
(Feel free to add more tests!)
Then run the chunks and compare your output graphs.

```{r run-UMAP-1, live = TRUE}
# Try something low?
UMAP_plot_wrapper(nn_param = 3)
```

```{r run-UMAP-2, live = TRUE}
# Try something high?
UMAP_plot_wrapper(nn_param = 100)
```

```{r run-UMAP-3, live = TRUE}
# Try whatever you like!
UMAP_plot_wrapper(nn_param = 5)
```

#### Some 'big picture' thoughts to take from this experiment: 

1. Analyses (such as UMAP) have various limitations for interpretability. 
The coordinates of UMAP output for any given cell can change dramatically depending on parameters, and even run to run with the same parameters. 
This probably means that you shouldn't rely too heavily on the exact values of UMAP's output. 

    - One particular limitation of UMAP and t-SNE is that while observed clusters have some meaning, the distance *between* clusters usually does not (nor does cluster density). 
    The fact that two clusters are near each other should NOT be interpreted to mean that they are more related to each other than to more distant clusters. 
    (There is some disagreement about whether UMAP distances have more meaning, but it is probably safer to assume they don't.)
    
2. Different analyses also have their strengths. 
Using cell-type labeling, our experiment illustrated that UMAP does appear to give some biologically relevant output information for this dataset. 

3. Playing with parameters so you can fine-tune them is a good way to give you more information about a particular analysis as well as the data itself. 
   
In summary, if the results of an analysis can be completely changed by changing its parameters, you should be more cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose. 

### t-SNE comparison

In the block below is a similar analysis and plot with t-SNE (t-distributed Stochastic Neighbor Embedding).
Note that this analysis also does PCA before moving on to the fancy machine learning.

```{r}
tm_tsne <- scater::calculateTSNE(tm_normalized)

tm_meta_tsne <- colData(tm_normalized) %>%
  as.data.frame() %>%
  dplyr::bind_cols(data.frame(tm_tsne)) %>% # add tsne columns to data frame
  dplyr::rename(TSNE1 = "X1", TSNE2 = "X2") # rename for meaning!

ggplot(data.frame(tm_meta_tsne), 
       aes(x = TSNE1, y = TSNE2, color = tissue)) + 
  geom_point(alpha = 0.1, size = 0.1) +
  scale_color_brewer(palette = "Set1") +
  guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))      
```

Different! (Slower!) Is it better or worse? Hard to say!
Different people like different things, and one plot might illustrate a particular point better than another. 

### Some further reading on dimension reduction:  

- This website explains [PCA visually](http://setosa.io/ev/principal-component-analysis/).  
- [Becht *et al.* (2018)](https://www.nature.com/articles/nbt.4314) discusses using [UMAP](https://github.com/lmcinnes/umap) for single-cell data.  
- [Wattenberg *et al.* (2016)](https://distill.pub/2016/misread-tsne/) discuss how to use t-SNE properly with great visuals. 
(The lessons apply to UMAP as well, with a broad substitution of the `n_neighbors` parameter for `perplexity`.)
- [Nguyen & Holmes (2019)](https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006907&type=printable) lay out guidelines on choosing dimensions reduction methods.  
- [Freitag (2019)](https://rpubs.com/Saskia/520216) is a nice explanation and comparison of many different dimensionality reduction techniques that you may encounter.


## Session Info:

```{r}
sessionInfo()
```