# Transcription factor activity with Decoupler


## Data

<div style="padding-top: 10px; font-size: 15px;">
We'll start from the previously saved AnnData of metacells that we called <code>3_CellrankAdata.h5ad</code>


## Notebook content
<div style="padding-top: 10px; font-size: 15px;">
    <ul>
        <li>TF activity computation with Decoupler </li>
        <li>TF activity along trajectories - Combining Decoupler and CellRank</li>
</ul>

</div>

<div style="padding-top: 10px; font-size: 15px;">
Decoupler overview:

<div>
  <img src="https://decoupler-py.readthedocs.io/en/latest/_images/graphical_abstract.png" width="800">
</div>

Reference: <a href="https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac016/6544613">decoupleR: ensemble of computational methods to infer biological activities from omics data</a>

From the homepage of the <a href="https://decoupler-py.readthedocs.io/en/latest/">Documentation</a>
> `decoupler` is a package containing different **statistical methods** to extract biological activities from omics data within a unified framework. It allows to flexibly test any enrichment method with any prior knowledge resource and incorporates methods that take into account the sign and weight.

It also wrap many utilities for <a href="https://decoupler-py.readthedocs.io/en/latest/notebooks/pseudobulk.html">pseudobulk analysis</a> , <a href="https://decoupler-py.readthedocs.io/en/latest/notebooks/msigdb.html">functional enrichment and databases access</a>, <a href="https://decoupler-py.readthedocs.io/en/latest/notebooks/translate.html">genes' names conversion</a>

Today we will focus on Transcription Factor (TF) activity inference. A tutorial for this can be found <a href="https://decoupler-py.readthedocs.io/en/latest/notebooks/dorothea.html">here</a> 

---

# Library loading

In [None]:
import scanpy as sc, anndata as ad, numpy as np, pandas as pd
import warnings
import yaml
import os
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy.external as sce
import scipy.sparse as sp
import statsmodels.api as sm
import scanpy as sc
import cellrank as cr
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances

from sklearn_extra.cluster import KMedoids

import matplotlib.pyplot 
import scanpy.external as sce
from matplotlib.colors import TwoSlopeNorm
import warnings
warnings.filterwarnings('ignore')

from plotly.subplots import make_subplots
import plotly.graph_objects as go
from scipy import stats
warnings.filterwarnings('ignore')
import scvelo as scv
import plotly.express as px
import plotly.io as pio
import itertools
import decoupler as dc
import sys
pio.renderers.default = "jupyterlab"
import random
random.seed(1)


In [None]:
homeDir = os.getenv("HOME")

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()

import matplotlib.pyplot as plt
sys.path.insert(1, "./utils/")


from CleanAdata import *
from SankeyOBS import *


# Load Metacells Anndata

<div style="padding-top: 10px; font-size: 15px;">
We load here the dataset. If you don't have this AnnData saved in the current folder, uncomment the second line and comment the first:

In [None]:
CombinedAdata = sc.read_h5ad("./3_CellrankAdata.h5ad")
#CombinedAdata = sc.read_h5ad("/group/brainomics/InputData/3_CellrankAdata.h5ad")

# Compute TF activity

<div style="padding-top: 10px; font-size: 15px;">
We will compute the transcription factor activity based on the gene expression of their target as imputed by MAGIC, in order to have a more clean signal. Each transcription factor activity will be computed as a <a href="https://decoupler-py.readthedocs.io/en/latest/generated/decoupler.run_ulm.html#decoupler.run_ulm">Univariate Linear Model</a> of the weighted expression of its targets. The targets and its weight are determined by an external source, in our case <a href="https://github.com/saezlab/CollecTRI">CollecTRI</a> but you could use other resources such as <a href="https://saezlab.github.io/dorothea/">Dorothea</a>.

In [None]:
CombinedAdata.X = CombinedAdata.layers["MAGIC_imputed_data"].copy()

# Download database of regulons
net = dc.get_collectri(organism='human', split_complexes=False)
net
net

<div style="padding-top: 10px; font-size: 15px;">
We run the model, fitting it on each cell's imputed gene expression. The activity will be inferred from the t-value of the slope:

<div>
  <img src="https://decoupler-py.readthedocs.io/en/latest/_images/ulm.png" width="800">
</div>

In [None]:
dc.run_ulm(
    mat=CombinedAdata,
    net=net,use_raw=False, 
    source='source',
    target='target',
    weight='weight',
    verbose=True
)

<div style="padding-top: 10px; font-size: 15px;">
In the end we'll have a score for each TF in each cell, that will be extracted and stored in a new AnnData using the `get_acts()` function.

In [None]:
acts = dc.get_acts(CombinedAdata, obsm_key='ulm_estimate')
acts

<div style="padding-top: 10px; font-size: 15px;">
For each cell type we can compute the "marker TF" using the <code>rank_sources_groups()</code> function, a wrapper of the <code>rank_genes_groups()</code> function from Scanpy. We can then inspect the top active TF and the less active TF for each group:

In [None]:
df = dc.rank_sources_groups(acts, groupby='AggregatedClass', reference='rest', method='t-test_overestim_var')
n_markers = 7
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers
sc.pl.matrixplot(acts, source_markers, 'AggregatedClass', dendrogram=True, standard_scale='var',
                 colorbar_title='Z-scaled scores', cmap='RdBu_r')

In [None]:
dfdown = dc.rank_sources_groups(acts, groupby='AggregatedClass', reference='rest', method='t-test_overestim_var')
n_markers = 7
source_markersDOWN = dfdown.groupby('group').tail(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markersDOWN
sc.pl.matrixplot(acts, source_markersDOWN, 'AggregatedClass', dendrogram=True, standard_scale='var',
                 colorbar_title='Z-scaled scores', cmap='RdBu_r')

<div style="padding-top: 10px; font-size: 15px;">
Let's see these top markers in the draw graph space:

In [None]:
UpandDOwnMarkers = {celltype:[source_markersDOWN[celltype][0]]+[source_markers[celltype][0]] for celltype in list(source_markersDOWN.keys())}

In [None]:
for k in UpandDOwnMarkers.keys():
    print(k)
    sc.pl.draw_graph(acts, color=UpandDOwnMarkers[k], cmap='RdBu_r',  add_outline=True, ncols=2, vmin='p1', vmax='p99', 
                     title=["Top down:{} for {}".format(UpandDOwnMarkers[k][0], k), "Top up:{} for {}".format(UpandDOwnMarkers[k][1], k)])

# Combining TF activity with CellRank

<div style="padding-top: 10px; font-size: 15px;">
We load the model that we previously trained to infer the macrostate from our combined kernel of pseudotime (Palantir output), pluripotency score (CytoTrace output) and transcriptional similarity and compute once again the macrostates and fate probabilities:

In [None]:
import pickle

with open('./GPCCA.pickle', 'rb') as file:
    g = pickle.load(file)

In [None]:
g.fit(n_states=4, cluster_key="AggregatedLabel")
g.plot_macrostates(which="all", basis="X_draw_graph_fa")
g.set_initial_states("CycProg")
g.set_terminal_states(["RG_late", "SubPlate","OPC_Oligo"])
g.compute_fate_probabilities()
g.plot_fate_probabilities(basis="X_draw_graph_fa", same_plot=False, add_outline=True)

<div style="padding-top: 10px; font-size: 15px;">
    
Again we can determing trends of expression fitting a GAM model. Here however we will determine trends in TF activity along trajectories:

In [None]:
g.adata = g.adata[acts.obs_names,0:acts.shape[1]]
g.adata.var_names = acts.var_names
g.X = acts.X.copy()

In [None]:
acts.uns = g.adata.uns.copy()
acts.obsm = g.adata.obsm.copy()
acts.obs = g.adata.obs.copy()

In [None]:
model = cr.models.GAMR(acts, n_knots=6, smoothing_penalty=10.0)


# compute putative drivers for the Beta trajectory
OPC_Oligo_drivers = g.compute_lineage_drivers(lineages="OPC_Oligo")

# plot heatmap
cr.pl.heatmap(
    acts,
    model=model,  # use the model from before
    lineages="OPC_Oligo",
    cluster_key="AggregatedLabel",
    show_fate_probabilities=True,
    genes=OPC_Oligo_drivers.head(40).index,
    time_key="palantir_pseudotime",
    figsize=(12, 10),
    show_all_genes=True,
    weight_threshold=(1e-3, 1e-3),
)

In [None]:
# compute putative drivers for the Beta trajectory
RG_late_drivers = g.compute_lineage_drivers(lineages="RG_late")

# plot heatmap
cr.pl.heatmap(
    acts,
    model=model,  # use the model from before
    lineages="RG_late",
    cluster_key="AggregatedLabel",
    show_fate_probabilities=True,
    genes=RG_late_drivers.head(40).index,
    time_key="palantir_pseudotime",
    figsize=(12, 10),
    show_all_genes=True,
    weight_threshold=(1e-3, 1e-3),
)

In [None]:
# compute putative drivers for the Beta trajectory
SubPlate_drivers = g.compute_lineage_drivers(lineages="SubPlate")

# plot heatmap
cr.pl.heatmap(
    acts,
    model=model,  # use the model from before
    lineages="SubPlate",
    cluster_key="AggregatedLabel",
    show_fate_probabilities=True,
    genes=SubPlate_drivers.head(40).index,
    time_key="palantir_pseudotime",
    figsize=(12, 10),
    show_all_genes=True,
    weight_threshold=(1e-3, 1e-3),
)