# scRNA-seq Monocle Tutorial 2 - Trajectories

Authors: Krithika Bhuvaneshwar, Yuriy Gusev

Affiliation: Innovation Center For Biomedical Informatics (ICBI), and Biomedical Informatics Shared Resource (BISR) at Georgetown University Medical Center (GUMC)

***More about our research work:***
* *ICBI: https://icbi.georgetown.edu*
* *BISR: https://icbi.georgetown.edu/bisr/ and https://lombardi.georgetown.edu/research/sharedresources/bbsr/*


##  Before we start

If using google colab, change the colab runtype environment to R (default is python). Go to  Runtime -> Change runtime type -> In the "Notebook settings", change environment to "R"

## Monocle software
Monocle software: https://cole-trapnell-lab.github.io/monocle3/docs/installation/

* Monocle 3 is designed for use with absolute transcript counts (e.g. from UMI experiments).
* Monocle 3 works "out-of-the-box" with the transcript count matrices produced by Cell Ranger, the software pipeline for analyzing experiments from the 10X Genomics Chromium instrument.
* Monocle 3 also works well with data from other RNA-Seq workflows such as sci-RNA-Seq and instruments like the Biorad ddSEQ.

Reference:
Reference: https://cole-trapnell-lab.github.io/monocle3/docs/trajectories/


### Install monocle package - this can take a SEVERAL minutes
Note - if you encounter errors during installaton, please refer this page: https://cole-trapnell-lab.github.io/monocle3/docs/installation/

In [None]:
#install monocle3 through the cole-trapnell-lab GitHub
install.packages("devtools")
devtools::install_github('cole-trapnell-lab/monocle3')

#if you wish to install the development version of monocle3
#devtools::install_github('cole-trapnell-lab/monocle3', ref="develop")

### Step: load required packages for analysis


In [None]:
library(monocle3)


In this example, we will take an example dataset

In [None]:
expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_expression.rds"))
cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_colData.rds"))
gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_rowData.rds"))


Make a new cell_data_set (CDS) object as follows:


In [None]:
# Make the CDS object
cds <- monocle3::new_cell_data_set(expression_matrix,
                         cell_metadata = cell_metadata,
                         gene_metadata = gene_annotation)

VERY IMPORTANT - do NOT use `as.matrix()` on the sparse matrix object. It will convert the sparse matrix into dense matrix object . Dense matrix takes up 20 times more space than sparse matrice

## Pre-process the data
This step is where you tell Monocle 3 how you want to normalize the data, whether to use Principal Components Analysis (the standard for RNA-seq) or Latent Semantic Indexing (common in ATAC-seq), and how to remove any batch effects.

This time, we will use a different strategy for batch correction, which includes what Packer & Zhu et al did in their original analysis:

Note: Your data will not have the loading batch information demonstrated here, you will correct batch using your own batch information. -- data is in the form of batches

* `align_cds()` aligns groups of cells (i.e. batches)
* `residual_model_formula_str` is for subtracting continuous effects. You can use this to control for things like the fraction of mitochondrial reads in each cell, which is sometimes used as a QC metric for each cell.

In this experiment (as in many scRNA-seq experiments), some cells spontanously lyse, releasing their mRNAs into the cell suspension immediately prior to loading into the single-cell library prep. This "supernatant RNA" contaminates each cells' transcriptome profile to a certain extent. Fortunately, it is fairly straightforward to estimate the level of background contamination in each batch of cells and subtract it, which is what Packer et al did in the original study. Each of the columns bg.300.loading, bg.400.loading, corresponds to a background signal that a cell might be contaminated with. Passing these colums as terms in the residual_model_formula_str tells align_cds() to subtract these signals prior to dimensionality reduction, clustering, and trajectory inference. Note that you can call align_cds() with alignment_group, residual_model_formula, or both.

In [None]:
cds <- preprocess_cds(cds, num_dim = 50)

cds <- align_cds(cds,
                 alignment_group = "batch",
                 residual_model_formula_str = "~ bg.300.loading + bg.400.loading + bg.500.1.loading + bg.500.2.loading + bg.r17.loading + bg.b01.loading + bg.b02.loading")

## Reduce dimensionality and visualize the cells
Despite the fact that we are only looking at a small slice of this dataset, Monocle reconstructs a trajectory with numerous branches. Overlaying the manual annotations on the UMAP reveals that these branches are principally occupied by one cell type.

In [None]:
cds <- reduce_dimension(cds, reduction_method = "UMAP")

plot_cells(cds,
           label_groups_by_cluster=FALSE,
           color_cells_by = "cell.type")

## Group cells into clusters
Although cells may continuously transition from one state to the next with no discrete boundary between them, Monocle does not assume that all cells in the dataset descend from a common transcriptional "ancestor". In many experiments, there might in fact be multiple distinct trajectories. For example, in a tissue responding to an infection, tissue resident immune cells and stromal cells will have very different initial transcriptomes, and will respond to infection quite differently, so they should be a part of the same trajectory.

Monocle is able to learn when cells should be placed in the same trajectory as opposed to separate trajectories through its clustering procedure. Recall that we run cluster_cells(), each cell is assigned not only to a cluster but also to a partition. When you are learning trajectories, each partition will eventually become a separate trajectory. We run cluster_cells()as before.

In [None]:
cds <- cluster_cells(cds)


IMORTANT - The cluster_cells() also divides the cells into larger, more well separated groups called **partitions**, using a statistical test from Alex Wolf et al, introduced as part of their PAGA algorithm. You can visualize these partitions like this:

In [None]:
plot_cells(cds, color_cells_by = "partition")


## Learn the trajectory graph
Next, we will fit a principal graph within each partition using the learn_graph() function:

In [None]:
cds <- learn_graph(cds)
plot_cells(cds,
           color_cells_by = "cell.type",
           label_groups_by_cluster=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE)

## Order the cells in pseudotime
Once we've learned a graph, we are ready to order the cells according to their progress through the developmental program. Monocle measures this progress in pseudotime. Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation.

In order to place the cells in order, we need to tell Monocle where the "beginning" of the biological process is. We do so by choosing regions of the graph that we mark as "roots" of the trajectory. In time series experiments, this can usually be accomplished by finding spots in the UMAP space that are occupied by cells from early time points:

In [None]:
plot_cells(cds,
           color_cells_by = "embryo.time.bin",
           label_cell_groups=FALSE,
           label_leaves=TRUE,
           label_branch_points=TRUE,
           graph_label_size=1.5)

In [None]:
cds <- order_cells(cds)

## Choose root nodes
In the above example, we just chose one location, but you could pick as many as you want. Plotting the cells and coloring them by pseudotime shows how they were ordered:

In [None]:
plot_cells(cds,
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=1.5)

Some of the cells are gray. This means they have infinite pseudotime, because they were not reachable from the root nodes that were picked. In general, any cell on a partition that lacks a root node will be assigned an infinite pseudotime. In general, you should choose at least one root per partition.

It's often desirable to specify the root of the trajectory programmatically, rather than manually picking it. The function below does so by first grouping the cells according to which trajectory graph node they are nearest to. Then, it calculates what fraction of the cells at each node come from the earliest time point. Then it picks the node that is most heavily occupied by early cells and returns that as the root.

In [None]:
# a helper function to identify the root principal points:
get_earliest_principal_node <- function(cds, time_bin="130-170"){
  cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)

  closest_vertex <-
  cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
  root_pr_nodes <-
  igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
  (which.max(table(closest_vertex[cell_ids,]))))]

  root_pr_nodes
}
cds <- order_cells(cds, root_pr_nodes=get_earliest_principal_node(cds))

#Passing the programatically selected root node to order_cells() via the root_pr_nodeargument yields:

plot_cells(cds,
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=1.5)

## Subset cells by branch
It is often useful to subset cells based on their branch in the trajectory. The function choose_graph_segments allows you to do so interactively.

In [None]:
cds_sub <- choose_graph_segments(cds)


## Working with 3D trajectories


In [None]:
cds_3d <- reduce_dimension(cds, max_components = 3)
cds_3d <- cluster_cells(cds_3d)
cds_3d <- learn_graph(cds_3d)
cds_3d <- order_cells(cds_3d, root_pr_nodes=get_earliest_principal_node(cds))

cds_3d_plot_obj <- plot_cells_3d(cds_3d, color_cells_by="partition")

This is the end of this tutorial

## References
* Monocle documentation: https://cole-trapnell-lab.github.io/monocle3/docs/getting_started/
* Galaxy training: https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/tutorial.html#monocle-workflow