# Getting Started with RNA Velocity


RNA velocity is based on bridging measurements to a underlying mechanism, mRNA splicing, with two modes indicating the current and future state [[RNA velocity—current challenges and future perspectives](https://www.embopress.org/doi/full/10.15252/msb.202110282)].  It is a method used to predict the future gene expression of a cell based on the measurement of both spliced and unspliced transcripts of mRNA [[2](https://towardsdatascience.com/rna-velocity-the-cells-internal-compass-cf8d75bb2f89)].

RNA velocity could be used to infer the direction of gene expression changes in single-cell RNA sequencing (scRNA-seq) data. It provides insights into the future state of individual cells by using the abundance of unspliced to spliced RNA transcripts. This ratio can indicate the transcriptional dynamics and potential fate of a cell, such as whether it is transitioning from one cell type to another or undergoing differentiation [[RNA velocity of single cells](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6130801/)].

+ **velocyto**

Velocyto is a package for the analysis of expression dynamics in single cell RNA seq data. In particular, it enables estimations of RNA velocities of single cells by distinguishing unspliced and spliced mRNAs in standard single-cell RNA sequencing protocols. It is the first paper proposed the concept of RNA velocity. velocyto predicted RNA velocity by solving the proposed differential equations for each gene [[RNA velocity of single cells](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6130801/)]. 

+ **scVelo**

scVelo is a method that solves the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. This generalizes RNA velocity to systems with transient cell states, which are common in development and in response to perturbations. scVelo was applied to disentangling subpopulation kinetics in neurogenesis and pancreatic endocrinogenesis. scVelo demonstrate the capabilities of the dynamical model on various cell lineages in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis [[Generalizing RNA velocity to transient cell states through dynamical modeling](https://www.nature.com/articles/s41587-020-0591-3)].

Here,I will go through the basics of scVelo. The following tutorials go straight into analysis of RNA velocity, latent time, driver identification and many more.

First of all, the input data for scVelo are two count matrices of pre-mature (unspliced) and mature (spliced) abundances, which were obtained from standard sequencing protocols, using the velocyto.

In [4]:
from platform import python_version

print(python_version())

3.11.5


In [None]:
#!pip install numpy==1.23.2 pandas==1.5.3 matplotlib==3.7.3 scanpy==1.9.6 igraph==0.9.8 scvelo==0.2.5 loompy==3.0.6 anndata==0.8.0

In [None]:
#!pip install tqdm 
#!pip install ipywidgets
#!pip install pandas==1.1.5 
#!pip install numpy==1.21.1

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import scanpy as sc
import igraph
import scvelo as scv
import loompy as lmp
import anndata

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set parameters for plots, including size, color, etc.
scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.set_figure_params('scvelo')  # for beautified visualization

In [None]:
adata = scv.read('/home/aruna/Desktop/Analysis/Project1/Data/Project1_merged.h5ad', cache=True)
adata

In [None]:
adata.obs

+ Metadata info:
    The single-cell data consists of 8 different samples-A0 A6 B0 B6 C0 C6 DO D6:
    
    A- A549 WT Cells with no IFNb pretreatment.
    
    B- A549 WT Cells, pretreated with 10U of IFNb for 18 hours. 
    
    C- A549 IRF3 K.O Cells with no IFNb pretreatment.

    D- A549 IRF3 K.O Cells, pretreated with 10U of IFNb for 18 hours.

    0- a sample that was not stimulated with vRNA.

    6- a sample that was stimulated with vRNA for 6 hours.

In [None]:
#adata.obs['condition'] = 'WT'

In [None]:
#Trait = adata.obs.batch=='D6'
#adata.obs.loc[Trait, 'condition'] = 'IRF3&IFNb&vRNA'
#adata.obs['condition']

In [None]:
#ldata = scv.read('/home/aruna/Desktop/Analysis/Project1/Data/Project1_loomp.loom', cache=True)
#ldata

In [None]:
scv.utils.clean_obs_names(adata)
scv.utils.clean_obs_names(ldata)
adata = scv.utils.merge(adata, ldata)

In [None]:
adata.obs

In [None]:
adata.obs.info()

In [None]:
adata.layers

In [None]:
scv.pl.proportions(adata, groupby="cell_type")

In [None]:
#adata.obs['condition']=adata.obs['condition'].astype('category')

Here, the proportions of spliced/unspliced counts are displayed. Depending on the protocol used (Drop-Seq, Smart-Seq, inDrop and 10x Genomics Chromium protocols), we typically have between 10%-25% of unspliced molecules containing intronic sequences. We also advice you to examine the variations on cluster level to verify consistency in splicing efficiency. Here, we find variations, with slightly **lower** unspliced proportions at **Ciliated & Merkels cells**, then ***higher proportion*** at ***Mesangial & Keratinocytes cells***.

In [None]:
scv.pl.proportions(adata, groupby="condition")

In [None]:
sc.pl.embedding(adata, basis="umap", color=["condition", "batch"], ncols=1)

## Estimate RNA velocity

RNA velocity estimation can currently be tackled with three existing approaches:

• steady-state / deterministic model (using steady-state residuals)

• stochastic model (using second-order moments),

• dynamical model (using a likelihood-based framework).

 + **The steady-state / deterministic model:**, as being used in velocyto, estimates velocities as follows: Under the assumption that transcriptional phases (induction and repression) last sufficiently long to reach a steady-state equilibrium (active and inactive),**`velocities are quantified as the deviation of the observed ratio from its steady-state ratio`**. The equilibrium mRNA levels are approximated with a linear regression on the presumed steady states in the lower and upper quantiles. This simplification makes two fundamental assumptions: a common splicing rate across genes and steady-state mRNA levels to be reflected in the data. It can lead to errors in velocity estimates and cellular states as the assumptions are often violated, in particular when a population comprises multiple heterogeneous subpopulation dynamics.


 + **The stochastic model:** aims to better capture the steady states. By treating transcription, splicing and degradation as probabilistic events, the resulting Markov process is approximated by moment equations. By including secondorder moments, **`it exploits not only the balance of unspliced to spliced mRNA levels but also their covariation`**. It has been demonstrated on the endocrine pancreas that stochasticity adds valuable information, overall yielding higher consistency than the deterministic model, while remaining as efficient in computation time.


 + **The dynamical model:** (most powerful while computationally most expensive) solves the full dynamics of splicing kinetics for each gene. **`It thereby adapts RNA velocity to widely varying specifications such as non-stationary populations, as does not rely on the restrictions of a common splicing rate or steady states to be sampled`**.
 
     The splicing dynamics 
     
     `𝑑𝑢(𝑡)/𝑑𝑡 = 𝛼𝑘(𝑡) − 𝛽𝑢(𝑡), (4.1)` 
     
     `𝑑𝑠(𝑡)/𝑑𝑡 = 𝛽𝑢(𝑡) − 𝛾𝑠(4.2) (𝑡)`,
     
     is solved in a likelihood-based expectation-maximization framework, **`by iteratively estimating the parameters of reaction rates and latent cell-specific variables`**, i.e. transcriptional state k and cell-internal latent time t.It thereby aims to learn the unspliced/spliced phase trajectory. Four transcriptional states are modeled to account for all possible configurations of gene activity: two dynamic transient states (induction and repression) and two steady states (active and inactive) potentially reached after each dynamic transition.

In [None]:
scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.log1p(adata)

In [None]:
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

In [None]:
scv.tl.recover_dynamics(adata, n_jobs=4)

In [None]:
scv.tl.velocity(adata, mode='dynamical', n_jobs=4)
scv.tl.velocity_graph(adata)

Running the dynamical model can take a while. Hence, you may want to store the results for re-use.

In [None]:
adata.write('/home/aruna/Desktop/Analysis/Project1/Data/Project1&DynVelo.h5ad', compression='gzip')

In [None]:
#adata = scv.read('/home/aruna/Desktop/Analysis/Project1/Data/Project1&DynVelo.h5ad', cache=True)

In [None]:
adata

In [None]:
print(adata.var['velocity_genes'].sum(), adata.n_vars)
top_genes = adata.var_names[adata.var.fit_likelihood.argsort()[::-1]]
scv.pl.scatter(adata, basis=top_genes[:10], ncols=4, color='condition', wspace=.5, hspace=.75)

There are 465 genes being used and 1102 cells.

## Project the velocities

In [None]:
scv.pl.velocity_embedding_stream(adata, basis='umap', color='condition', 
                                 legend_loc='right margin')

In [None]:
scv.pl.velocity_embedding_grid(adata, basis='umap', color='condition', 
                          legend_loc='right margin')

## Interprete the velocities

See the gif <a href="https://user-images.githubusercontent.com/31883718/80227452-eb822480-864d-11ea-9399-56886c5e2785.gif">here</a> to get an idea of how to interpret a spliced vs. unspliced phase portrait. Gene activity is orchestrated by transcriptional regulation. Transcriptional induction for a particular gene results in an **increase of (newly transcribed) precursor unspliced mRNAs** while, conversely, repression or absence of transcription results in a **decrease of unspliced mRNAs**. Spliced mRNAs is produced from unspliced mRNA and follows the same trend with a time lag. Time is a hidden/latent variable. Thus, the dynamics needs to be inferred from what is actually measured: spliced and unspliced mRNAs as displayed in the phase portrait.

In [None]:
adata

In [None]:
df = scv.get_df(adata, 'rank_genes_groups/names')
df.head(10)

In [None]:
scv.pl.velocity(adata, ['HNRNPH1',  'CTNNB1', 'SET', 'YWHAE', 'EIF5A'], color='condition')

## Identify important genes
 
 + **By Condition**

In [None]:
scv.tl.rank_velocity_genes(adata, groupby='condition', min_corr=.3)

df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head(10)

In [None]:
for condition in ['IRF3&IFNb&vRNA', 'WT&IFNb&vRNA', 'IRF3&vRNA']:
    scv.pl.scatter(adata, df[condition][:5], ylabel=condition, color='condition', wspace=.6)

## Speed and coherence

Two more useful stats: - The speed or rate of differentiation is given by the length of the velocity vector. - The coherence of the vector field (i.e., how a velocity vector correlates with its neighboring velocities) provides a measure of confidence.

In [None]:
scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95])

In [None]:
df = adata.obs.groupby('condition')[keys].mean().T
df.style.background_gradient(cmap='coolwarm', axis=1)

## Velocity graph and pseudotime

We can visualize the velocity graph to portray all velocity-inferred cell-to-cell connections/transitions. It can be confined to high-probability transitions by setting a threshold. 

In [None]:
scv.pl.velocity_graph(adata, threshold=.9, color='condition', 
                          legend_loc='right margin')

In [None]:
#x, y = scv.utils.get_cell_transitions(adata, basis='umap', n_neighbors=10, starting_cell=70)
#ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
#ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)

Finally, based on the velocity graph, a velocity pseudotime can be computed. After inferring a distribution over root cells from the graph, it measures the average number of steps it takes to reach a cell after walking along the graph starting from the root cells.

In [None]:
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')

## PAGA velocity graph

PAGA graph abstraction has benchmarked as top-performing method for trajectory inference. It provides a graph-like map of the data topology with weighted edges corresponding to the connectivity between two clusters. Here, PAGA is extended by velocity-inferred directionality.

In [None]:
adata

In [None]:
# PAGA requires to install igraph, if not done yet.
!pip install python-igraph --upgrade --quiet

In [None]:
# this is needed due to a current bug - bugfix is coming soon.
adata.uns['neighbors']['distances'] = adata.obsp['distances']
adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']

scv.tl.paga(adata, groups='condition', use_time_prior=False)
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')

This reads from left/row to right/column, thus e.g. assigning a confident transition from Merkel sells to Basophils.

This table can be summarized by a directed graph superimposed onto the UMAP embedding.

In [None]:
scv.pl.paga(adata, basis='umap', size=50, alpha=.2, 
            min_edge_width=2, node_size_scale=1)

In [None]:
scv.pl.velocity_embedding_stream(adata, basis='umap', color='condition', 
                                 legend_loc='right margin', ncols=1)

Here we observb **2** main velocity direction, one in **Gamma delta T cells and Monocytes**. 

In [None]:
results_file = '/home/aruna/Desktop/Analysis/Project1/Data/Project1&DynVelo.h5ad'  
adata.write(results_file)

## Some more analysis for dynamical mode

### Kinetic rate paramters

The rates of RNA transcription, splicing and degradation are estimated without the need of any experimental data.

They can be useful to better understand the cell identity and phenotypic heterogeneity.

In [None]:
df = adata.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]

kwargs = dict(xscale='log', fontsize=16)
with scv.GridSpec(ncols=3) as pl:
    pl.hist(df['fit_alpha'], xlabel='transcription rate', **kwargs)
    pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs)
    pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)

scv.get_df(adata, 'fit*', dropna=True).head()

The estimated gene-specific parameters comprise rates of transription (fit_alpha), splicing (fit_beta), degradation (fit_gamma), switching time point (fit_t_), a scaling parameter to adjust for under-represented unspliced reads (fit_scaling), standard deviation of unspliced and spliced reads (fit_std_u, fit_std_s), the gene likelihood (fit_likelihood), inferred steady-state levels (fit_steady_u, fit_steady_s) with their corresponding p-values (fit_pval_steady_u, fit_pval_steady_s), the overall model variance (fit_variance), and a scaling factor to align the gene-wise latent times to a universal, gene-shared latent time (fit_alignment_scaling).



### Latent time

The dynamical model recovers the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and approximates the real time experienced by cells as they differentiate, based only on its transcriptional dynamics.

In [None]:
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80)

In [None]:
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata, var_names=top_genes, sortby='latent_time', col_color='condition', n_convolve=100)

### Top-likelihood genes

Driver genes display pronounced dynamic behavior and are systematically detected via their characterization by high likelihoods in the dynamic model.

In [None]:
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata, basis=top_genes[:15], ncols=5, frameon=False, color='condition')

### Cluster-specific top-likelihood genes

Moreover, partial gene likelihoods can be computed for a each cluster of cells to enable cluster-specific identification of potential drivers.

In [None]:
scv.tl.rank_dynamical_genes(adata, groupby='condition')
df = scv.get_df(adata, 'rank_dynamical_genes/names')
df.head(10)

In [None]:
for condition in ['IRF3&IFNb&vRNA', 'WT&IFNb&vRNA', 'IRF3&vRNA', 'WT', 'IRF3', 'IRF3&IFNb']:
    scv.pl.scatter(adata, df[condition][:5], ylabel=condition, color='condition', wspace=.6)