<a href="https://colab.research.google.com/github/Ken-Lau-Lab/single-cell-lectures/blob/main/section06_trajectory_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## __Section 6:__ Trajectory Inference

March 01, 2022

## Phase 0: Downloading datasets and installing packages

In [None]:
!git clone git://github.com/Ken-Lau-Lab/single-cell-lectures  # for Colab users

In [None]:
!git clone git://github.com/KenLauLab/pCreode

In [None]:
# installing pcreode, our trajectory inference algorithm, Herring et al. 2018
# (https://www.sciencedirect.com/science/article/pii/S2405471217304490)
!pip install pCreode/.

In [None]:
!pip install numpy  # for Colab users
!pip install pandas  # for Colab users
!pip install scanpy  # for Colab users
!pip install leidenalg  # for Colab users
# software library for graph theoretic representations of the data
!pip install python-igraph==0.7.1.post6

## Phase 0: Loading installed python packages and datasets into the environment

In [None]:
import scanpy as sc
import pandas as pd 
import numpy as np
import pcreode
import matplotlib.pylab as plt

In [None]:
myeloid_adata = sc.read_h5ad("pCreode/data/Myeloid_Raw_Normalized_Transformed.h5ad")
myeloid_adata

The pCreode algorithm maps the topology of the cellular coordinates in PCA transformed gene expression space.

![alt text](https://raw.githubusercontent.com/bobchen1701/SCA_Course_SP_2020/master/Screen%20Shot%202020-03-03%20at%205.29.13%20PM.png)

## Phase 1: Preprocessing

In [None]:
#we feature select using highly_variable_genes to maximize the signal to noise ratio
sc.pp.highly_variable_genes(myeloid_adata,min_mean=0.1,min_disp=0)
sc.pl.highly_variable_genes(myeloid_adata)

In [None]:
myeloid_adata.var.highly_variable.sum()

In [None]:
#we can perform the pca with the top 558 most highly variable genes. this allows us to visualize the variation in the gene expression of single cells in a more interpretable number of dimensions
sc.pp.pca(myeloid_adata, n_comps = 20,use_highly_variable=True)
sc.pl.pca(myeloid_adata,components=['1,2','1,3','2,3'],color='CD34')

In [None]:
sc.pl.pca_variance_ratio(myeloid_adata) #note that the vast majority of the variance in the data is captured by the first 10 principal components or so

In [None]:
sc.pl.pca(myeloid_adata,components=['1,2','2,3','3,4','4,5'],color='CD34')

In [None]:
pca_reduced_data = myeloid_adata.obsm['X_pca'][:,:3] 
#Here we simply subset the myeloid_adata PCA observation matrix to its first 3 components using standard array subsetting conventions

## Phase 1: Density-weighted K-nearest neighbors graph parameter optimization and construction

We must first determine the radius in which to draw edges between nodes in this graph 

In [None]:
dens = pcreode.Density( pca_reduced_data ) #input the 3 principal components 
best_guess = dens.nearest_neighbor_hist() #look at the distribution of distances between cells in that space
#best automatic guess is 0.73

Example of a radius set too low

In [None]:
myeloid_adata.obs['Density'] = dens.get_density( radius=0.4) #set myeloid_adata 'Density' observation, or the number of neighbors which fall within the radius search constraints
dens.density_hist( n_bins=50) #the distribution is skewed towards a low density graph, which is unable to incorporate more global similarities

In [None]:
sc.pl.pca(myeloid_adata,components=['1,2','2,3'],color = 'Density') #overlay each cell's Density value onto a three principal components
#there are regions in this overlay that are very low density, which is undesireable

Example of radius set too high

In [None]:
myeloid_adata.obs['Density'] = dens.get_density( radius=3) #set myeloid_adata 'Density' observation
dens.density_hist( n_bins=100)

In [None]:
sc.pl.pca(myeloid_adata,components=['1,2','2,3'],color = 'Density')  #overlay each cell's Density value onto a three principal components
#the density seems to become an overlay that doesn't consider the local structures and becomes more of an overall representation of where the "center" of the graph is

Example of "just right" radius

In [None]:
myeloid_adata.obs['Density'] = dens.get_density( radius=1) #set myeloid_adata 'Density' observation
dens.density_hist( n_bins=100)

In [None]:
sc.pl.pca(myeloid_adata,components=['1,2','2,3'],color = 'Density')  #overlay each cell's Density value onto a three principal components
#this is pretty good since theres a fairly symmetric distribution of densities

## Phase 1: Downsampling and noise reduction

Here we determine the values used for the noise and target parameters.

In [None]:
noise = 8 #noise cutoff based on density value
target = 20 #target downsampling proportion

In [None]:
downed, downed_ind = pcreode.Down_Sample( pca_reduced_data, myeloid_adata.obs['Density'], noise, target)

In [None]:
sc.pl.pca(myeloid_adata[downed_ind],components=['1,2','2,3'],color = 'Density')  #density downsampled

## Phase 2-3: Running pCreode using the above determined parameters and PCA

In [None]:
file_path = "Output/"

In [None]:
!mkdir Output

In [None]:
out_graph, out_ids = pcreode.pCreode( data=pca_reduced_data, density=np.array(myeloid_adata.obs['Density']), noise=noise, 
                                      target=target, file_path=file_path, num_runs=3)

## Phase 4: Scoring each pCreode graph

In [None]:
graph_ranks = pcreode.pCreode_Scoring( data=pca_reduced_data, file_path=file_path, num_graphs=3)

In [None]:
analysis = pcreode.Analysis( file_path=file_path, graph_id=graph_ranks[0], data=pca_reduced_data, density=np.array(myeloid_adata.obs['Density']), noise=noise)

In [None]:
seed=5656

In [None]:
analysis.plot_save_graph( seed=seed, overlay=pd.Series(myeloid_adata.obs_vector('ELANE')), file_out='ELANE', upper_range=1.25,node_label_size=25)

In [None]:
analysis.plot_save_graph( seed=seed, overlay=pd.Series(myeloid_adata.obs_vector('CD34')), file_out='CD34', upper_range=3,node_label_size=25)

In [None]:
analysis.plot_analyte_dynamics( pd.Series(myeloid_adata.obs_vector('ELANE')), 2) #2 is the root due to prior knowledge of cd34 expression