# scRNA-seq Monocle Tutorial 1

Authors: Krithika Bhuvaneshwar, Yuriy Gusev

Affiliation: Innovation Center For Biomedical Informatics (ICBI), and Biomedical Informatics Shared Resource (BISR) at Georgetown University Medical Center (GUMC)

***More about our research work:***
* *ICBI: https://icbi.georgetown.edu*
* *BISR: https://icbi.georgetown.edu/bisr/ and https://lombardi.georgetown.edu/research/sharedresources/bbsr/*


## General Steps in analysis
Based on the workflow mentioned in the e-book "Orchestrating Single-Cell Analysis with Bioconductor"
* Removal of low-quality cells
* Normalization and log-transformation
* Modeling of the mean-variance trend across genes
* A principal components analysis on the highly variable genes
* Clustering with graph-based methods
* Dimensionality reductions (t-SNE/UMAP)
* Marker detection for each cluster
* Make custom cell selections and detect markers for this selection
* Cell type annotation for each cluster across user selected reference datasets
* Perform Integration or Batch correction using MNN correction.
* Support Multi-modal analysis for Cite-seq data
* Perform analysis on subsets (filter based on cell annotation)





##  Before we start

If using google colab, change the colab runtype environment to R (default is python). Go to  Runtime -> Change runtime type -> In the "Notebook settings", change environment to "R"

## Monocle software
Monocle software: https://cole-trapnell-lab.github.io/monocle3/docs/installation/

* Monocle 3 is designed for use with absolute transcript counts (e.g. from UMI experiments).
* Monocle 3 works "out-of-the-box" with the transcript count matrices produced by Cell Ranger, the software pipeline for analyzing experiments from the 10X Genomics Chromium instrument.
* Monocle 3 also works well with data from other RNA-Seq workflows such as sci-RNA-Seq and instruments like the Biorad ddSEQ.

### Monocle workflow
* Load scRNA-seq data matrix
* Pre-process: normalize, remove batch effects
* Non-linear dimensionality reduction: t-SNE, UMAP
* Cluster cells
* Compare clusters: identify top markers, targeted contrasts
* Trajectory analysis

## Example using Monocle software in R


### Install monocle package - this can take SEVERAL minutes
Note - if you encounter errors during installaton, please refer this page: https://cole-trapnell-lab.github.io/monocle3/docs/installation/

In [5]:
#install monocle3 through the cole-trapnell-lab GitHub
#install.packages("devtools")
#devtools::install_github('cole-trapnell-lab/monocle3')

install.packages("monocle3")
library(monocle3)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“package ‘monocle3’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages”


ERROR: ignored

### Step: load required packages for analysis


In [4]:
library(monocle3)

# The tutorial shown below and on subsequent pages uses two additional packages:
library(ggplot2)
library(dplyr)

ERROR: ignored

## Introduction
Once the raw data in FASTQ file is processed, the data files can be in several formats

* Data can be MatrixMarket (MTX) format The MTX format is a sparse matrix format with genes on the rows and cells on the columns as output by Cell Ranger which is the output from the 10X machine, along with two metadata files (features information and cell information).
* RDS format Includes 3 files - expression_matrix, cell_metadata, gene_metadata (gene_annotation) files

Let us assume the input data sparse matrix is in a RDS format

### Step: Load input data
* Input data format : cell by gene expression matrix. The object name is `cell_data_set` class. This is dervied from  Bioconductor's `SingleCellExperiment` class. It requires three input files:
* `expression_matrix`, a numeric matrix of expression values, where rows are genes, and columns are cells
* `cell_metadata`, a data frame, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
* `gene_metadata`, an data frame, where rows are features (e.g. genes), and columns are gene attributes, such as biotype, gc content, etc.

Notes:
* The expression value matrix must:
(a) have the same number of columns as the cell_metadata has rows.
(b) have the same number of rows as the gene_metadata has rows
(c) row names of the cell_metadata object should match the column names of the expression matrix.
(d) row names of the gene_metadata object should match row names of the expression matrix.
(e) one of the columns of the gene_metadata should be named "gene_short_name", which represents the gene symbol or simple name (generally used for plotting) for each gene.



In this example, we will take an example dataset

In [None]:
expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_expression.rds"))

cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_colData.rds"))

gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_rowData.rds"))

Make a new cell_data_set (CDS) object as follows:


In [None]:
# Make the CDS object
cds <- monocle3::new_cell_data_set(expression_matrix,
                         cell_metadata = cell_metadata,
                         gene_metadata = gene_annotation)

Note - for data from 10X this is how data can be imprted
```{r, eval=FALSE}
# Provide the path to the Cell Ranger output.
cds <- load_cellranger_data("input_pbmc3k/filtered_gene_bc_matrices/hg19/")

#OR
cds <- load_mm_data(mat_path = "input_pbmc3k/filtered_gene_bc_matrices/hg19/matrix.mtx",
                    feature_anno_path = "input_pbmc3k/filtered_gene_bc_matrices/hg19/genes.tsv",
                    cell_anno_path = "input_pbmc3k/filtered_gene_bc_matrices/hg19/barcodes.tsv")

#To work with your data in a sparse format, simply provide it to Monocle 3 as a sparse matrix from the Matrix package:

cds <- new_cell_data_set(as(umi_matrix, "sparseMatrix"),
cell_metadata = cell_metadata,
gene_metadata = gene_metadata)
```

VERY IMPORTANT - do NOT use `as.matrix()` on the sparse matrix object. It will convert the sparse matrix into dense matrix object . Dense matrix takes up 20 times more space than sparse matrice

## Pre-process the data
This step is where you tell Monocle 3 how you want to normalize the data, whether to use Principal Components Analysis (the standard for RNA-seq) or Latent Semantic Indexing (common in ATAC-seq), and how to remove any batch effects. We will just use the standard PCA method in this demonstration. When using PCA, you should specify the number of principal components you want Monocle to compute.

It's a good idea to check that you're using enough PCs to capture most of the variation in gene expression across all the cells in the data set. You can look at the fraction of variation explained by each PC using `plot_pc_variance_explained():`

We can see that using more than 100 PCs would capture only a small amount of additional variation, and each additional PC makes downstream steps in Monocle slower.

### IMPORTANT -
According to Galaxy documentation, The plot shows that actually using more than ~100 PCs captures only a small amount of additional variation. However, if we look at how the cells are plotted on 2D graph when using different values of PCs, it is easier to visualise how the num_dim actually affects the output.
In the Galaxy example, 4 plots were done : 145 PCs, 165 PCs, 200 PCs, 210 PCs,
The graph is different each time.
Galaxy example used a value of of 210, which made the most sense for their dataset.

In [None]:
cds <- preprocess_cds(cds, num_dim = 100)

plot_pc_variance_explained(cds)

## Reduce dimensionality and visualize the cells
Now we're ready to visualize the cells. To do so, you can use either t-SNE, which is very popular in single-cell RNA-seq, or UMAP, which is increasingly common.

### t-SNE vs UMAP
The main difference between t-SNE and UMAP is the interpretation of the distance between objects or "clusters"
* Both t-SNE and UMAP are non-linear reductions
* UMAP operates on the k-nearest neighbours graph (for some small value of k), exactly as t-SNE does
* t-SNE video https://www.youtube.com/watch?v=NEaUSP4YerM
* t-SNE preserves local structure in the data.This means with t-SNE you cannot interpret the distance between clusters A and B at different ends of your plot. You cannot infer that these clusters are more dissimilar than A and C, where C is closer to A in the plot. But within cluster A, you can say that points close to each other are more similar objects than points at different ends of the cluster image.
*  It captures global structure better than t-SNE. With UMAP, you should be able to interpret both the distances between / positions of points and clusters.
* Both algorithms are highly stochastic and very much dependent on choice of hyperparameters (t-SNE even more than UMAP) and can yield very different results in different runs,
* while t-SNE doesn't have much use outside of visualization, UMAP is a general-purpose dimensionality reduction technique that can be used as preprocessing for machine learning.

Monocle 3 uses UMAP by default, as it is both faster and better suited for clustering and trajectory analysis in RNA-seq. To reduce the dimensionality of the data down into the X, Y plane so we can plot it easily, call reduce_dimension():

If you have a relatively large dataset (with >10,000 cells or more), you may want to take advantage of options that can accelerate UMAP. Passing umap.fast_sgd=TRUE to reduce_dimension() will use a fast stochastic gradient descent method inside of UMAP. If your computer has multiple cores, you can use the cores argument to make UMAP multithreaded. However, invoking reduce_dimension() with either of these options will make it produce slighly different output each time you run it. If this is acceptable to you, you could see signifant reductions in the running time of reduction_dimension().

In [None]:
cds2 <- reduce_dimension(cds,
                         reduction_method = "UMAP")
plot_cells(cds2)

Each point in the plot above represents a different cell in the cell_data_set object cds. As you can see the cells form many groups, some with thousands of cells, some with only a few. Cao & Packer annotated each cell according to type manually by looking at which genes it expresses.
We can color the cells in the UMAP plot by the authors' original annotations using the color_cells_by argument to `plot_cells()`

In [None]:
plot_cells(cds2, color_cells_by="cao_cell_type")

## Alternate option t SNE
fI you want, you can also use t-SNE to visualize your data. First, call reduce_dimension with reduction_method="tSNE".
```{r}
cds3 <- reduce_dimension(cds2, reduction_method="tSNE")

#Then, when you call plot_cells(), pass reduction_method="tSNE" to it as well:
plot_cells(cds3, reduction_method="tSNE", color_cells_by="cao_cell_type")
```

## Check for and remove batch effects

If the dataset  comprises data from MULTIPLE samples, there will be a risk that batch effects will impact analysis. So always check for batch effects when you perform dimensionality reduction.

You should add a column to the `colData` that encodes which batch each cell is from. Then you can simply color the cells by batch. Cao & Packer et al included a "plate" annotation in their data, which specifies which sci-RNA-seq plate each cell originated from. Coloring the UMAP by plate reveals:

In [None]:
plot_cells(cds2, color_cells_by="plate", label_cell_groups=FALSE)


Remove batch effect is by running the align_cds() function:


In [None]:
cds4 <- align_cds(cds2,
                 num_dim = 50,
                 alignment_group = "plate")
cds5 <- reduce_dimension(cds4)
plot_cells(cds5, color_cells_by="plate", label_cell_groups=FALSE)

## Group cells into clusters
Grouping cells into clusters is an important step in identifying the cell types represented in your data. Monocle uses a technique called **community detection** to group cells.

This approach was introduced by Levine et al as part of the phenoGraph algorithm.

### IMPORTANT - Resolution paramter
* When using standard igraph louvain clustering, the value of resolution parameter is by default set to NULL, which means that it is determined automatically. Although the resulting clusters are OK, it would be nice to get some more granularity to identify cell types more specifically.
* The higher the resolution value, the more clusters we get.
* In Galaxy example, the resolution value was set as 0.0002.
* In this example, the resolution value was set as 1e-5.

You can cluster your cells using the cluster_cells() function, like this:

In [None]:
cds6 <- cluster_cells(cds5,
                      resolution=1e-5)
plot_cells(cds6)

IMORTANT - The cluster_cells() also divides the cells into larger, more well separated groups called **partitions**, using a statistical test from Alex Wolf et al, introduced as part of their PAGA algorithm. You can visualize these partitions like this:

In [None]:
plot_cells(cds6, color_cells_by="partition", group_cells_by="partition")


## Find marker genes expressed by each cluster
Once cells have been clustered, we can ask what genes makes them different from one another. To do that, start by calling the top_markers() function:

In [None]:
marker_test_res <- top_markers(cds5,
                               group_cells_by="partition",
                               reference_cells=1000,
                               cores=8)

The data frame marker_test_res contains a number of metrics for how specifically expressed each gene is in each partition. We could group the cells according to cluster, partition, or any categorical variable in colData(cds). You can rank the table according to one or more of the specificity metrics and take the top gene for each cluster. For example, pseudo_R2 is one such measure. We can rank markers according to pseudo_R2 like this:

In [None]:
top_specific_markers <- marker_test_res %>%
                            filter(fraction_expressing >= 0.10) %>%
                            group_by(cell_group) %>%
                            top_n(1, pseudo_R2)

top_specific_marker_ids <- unique(top_specific_markers %>% pull(gene_id))
head(top_specific_marker_ids)

Now, we can plot the expression and fraction of cells that express each marker in each group with the plot_genes_by_group function:

In [None]:
plot_genes_by_group(cds,
                    top_specific_marker_ids,
                    group_cells_by="partition",
                    ordering_type="maximal_on_diag",
                    max.size=3)

It's often informative to look at more than one marker, which you can do just by changing the first argument to top_n():

In [None]:
top_specific_markers <- marker_test_res %>%
                            filter(fraction_expressing >= 0.10) %>%
                            group_by(cell_group) %>%
                            top_n(3, pseudo_R2)

top_specific_marker_ids <- unique(top_specific_markers %>% pull(gene_id))

plot_genes_by_group(cds,
                    top_specific_marker_ids,
                    group_cells_by="partition",
                    ordering_type="cluster_row_col",
                    max.size=3)

This is the end of this tutorial

## References
* Monocle documentation: https://cole-trapnell-lab.github.io/monocle3/docs/getting_started/
* Galaxy training: https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/tutorial.html#monocle-workflow