# scRNA-seq Monocle Tutorial 3 - Differential expression

Authors: Krithika Bhuvaneshwar, Yuriy Gusev

Affiliation: Innovation Center For Biomedical Informatics (ICBI), and Biomedical Informatics Shared Resource (BISR) at Georgetown University Medical Center (GUMC)

***More about our research work:***
* *ICBI: https://icbi.georgetown.edu*
* *BISR: https://icbi.georgetown.edu/bisr/ and https://lombardi.georgetown.edu/research/sharedresources/bbsr/*


##  Before we start

If using google colab, change the colab runtype environment to R (default is python). Go to  Runtime -> Change runtime type -> In the "Notebook settings", change environment to "R"

## Monocle software
Monocle software: https://cole-trapnell-lab.github.io/monocle3/docs/installation/

* Monocle 3 is designed for use with absolute transcript counts (e.g. from UMI experiments).
* Monocle 3 works "out-of-the-box" with the transcript count matrices produced by Cell Ranger, the software pipeline for analyzing experiments from the 10X Genomics Chromium instrument.
* Monocle 3 also works well with data from other RNA-Seq workflows such as sci-RNA-Seq and instruments like the Biorad ddSEQ.

Reference:
Reference: https://cole-trapnell-lab.github.io/monocle3/docs/differential/

### Install monocle package - this can take a SEVERAL minutes
Note - if you encounter errors during installaton, please refer this page: https://cole-trapnell-lab.github.io/monocle3/docs/installation/

In [None]:
#install monocle3 through the cole-trapnell-lab GitHub
install.packages("devtools")
devtools::install_github('cole-trapnell-lab/monocle3')

#if you wish to install the development version of monocle3
#devtools::install_github('cole-trapnell-lab/monocle3', ref="develop")

### Step: load required packages for analysis


In [None]:
library(monocle3)


In this example, we will take an example dataset

In [None]:
expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_expression.rds"))
cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_colData.rds"))
gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_rowData.rds"))


Make a new cell_data_set (CDS) object as follows:


In [None]:
# Make the CDS object
cds <- monocle3::new_cell_data_set(expression_matrix,
                         cell_metadata = cell_metadata,
                         gene_metadata = gene_annotation)

Performing differential expression analysis on all genes in a cell_data_set object can take anywhere from minutes to hours, depending on how complex the analysis is.

Begin with a small set of genes that we know are important in ciliated neurons to demonstrate Monocle's capabilities:

In [None]:

ciliated_genes <- c("che-1",
                    "hlh-17",
                    "nhr-6",
                    "dmd-6",
                    "ceh-36",
                    "ham-1")

#Make a new cell_data_set (CDS) object as follows:

cds_subset <- cds[rowData(cds)$gene_short_name %in% ciliated_genes,]

dim(cds_subset)


VERY IMPORTANT - do NOT use `as.matrix()` on the sparse matrix object. It will convert the sparse matrix into dense matrix object . Dense matrix takes up 20 times more space than sparse matrice

## Regression model details
Monocle works by fitting a regression model to each gene. You can specify this model to account for various factors in your experiment (time, treatment, and so on).

`gene_fits` is a tibble that contains a row for each gene. The `model` column contains generalized linear model objects, each of which aims to explain the expression of a gene across the cells using the equation above. The parameter `model_formula_str` should be a string specifying the model formula. The model formulae you use in your tests can include any term that exists as a column in the colData table,

In [None]:
gene_fits <- fit_models(cds_subset,
                        model_formula_str = "~embryo.time")

gene_fits

Find out Which of these genes have time-dependent expression. First, we extract a table of coefficients from each model using the coefficient_table() function

In [None]:
fit_coefs <- coefficient_table(gene_fits)
fit_coefs

Extract the time terms:


In [None]:
emb_time_terms <- fit_coefs %>% filter(term == "embryo.time")


Pull out the genes that have a significant time component. coefficient_table() tests whether each coefficient differs significantly from zero under the Wald test. By default, coefficient_table() adjusts these p-values for multiple hypothesis testing using the method of Benjamini and Hochberg. These adjusted values can be found in the q_value column. We can filter the results and control the false discovery rate as follows:

In [None]:
emb_time_terms %>% filter (q_value < 0.05) %>%
         select(gene_short_name, term, q_value, estimate)

We can see that five of the six genes significantly vary as a function of time.

Monocle also provides some easy ways to plot the expression of a small set of genes grouped by the factors you use during differential analysis. This helps you visualize the differences revealed by the tests above. One type of plot is a "violin" plot.

In [None]:
plot_genes_violin(cds_subset, group_cells_by="embryo.time.bin", ncol=2) +
      theme(axis.text.x=element_text(angle=45, hjust=1))

## Controlling for batch effects and other factors


In [None]:
gene_fits <- fit_models(cds_subset,
                        model_formula_str = "~embryo.time + batch")
fit_coefs <- coefficient_table(gene_fits)
fit_coefs %>% filter(term != "(Intercept)") %>%
      select(gene_short_name, term, q_value, estimate)

## Evaluating models of gene expression


In [None]:
evaluate_fits(gene_fits)

## Choosing a distribution for modeling gene expression
Monocle uses generalized linear models to capture how a gene's expression depends on each variable in the experiment. These models require you to specify a distribution that describes gene expression values. Most studies that use this approach to analyze their gene expression data use the negative binomial distribution, which is often appropriate for sequencing read or UMI count data. The negative binomial is at the core of many packages for RNA-seq analysis, such as DESeq2.

Monocle's `fit_models()` supports the
* negative binomial distribution
* Poisson,
* Binomial,
* quasipoisson

The default is the "quasipoisson", which is very similar to the negative binomial. Quasipoisson is a a bit less accurate than the negative binomial but much faster to fit, making it well suited to datasets with thousands of cells.

This is the end of this tutorial

## References
* Monocle documentation: https://cole-trapnell-lab.github.io/monocle3/docs/getting_started/
* Galaxy training: https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/tutorial.html#monocle-workflow