# Spatial functional analysis

Spatial transcriptomics technologies yield many molecular readouts that are hard to interpret by themselves. One way of summarizing this information is by inferring biological activities from prior knowledge.

In this notebook we showcase how to use `decoupler` for transcription factor and pathway activity inference from a human data-set. The data consists of a 10X Genomics Visium slide of a human lymph node and it is available at their [website](https://www.10xgenomics.com/resources/datasets).


<div class="alert alert-info">

**Note**
    
This tutorial assumes that you already know the basics of `decoupler`. Else, check out the [Usage](https://decoupler-py.readthedocs.io/en/latest/notebooks/usage.html) tutorial first.

</div>

## Loading packages

First, we need to load the relevant packages, `scanpy` to handle RNA-seq data
and `decoupler` to use statistical methods.

In [None]:
import os
import sys
import numpy as np
import pandas as pd
import scanpy as sc
import decoupler as dc

# Plotting options, change to your liking
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))

import matplotlib as mpl
from matplotlib import font_manager as fm
import matplotlib.pyplot as plt

fm.fontManager.addfont("/usr/share/fonts/truetype/msttcorefonts/Arial.ttf")
mpl.rcParams.update({
    "font.family": ["Arial", "Noto Sans", "DejaVu Sans"],
    "mathtext.fontset": "dejavusans",
    "axes.unicode_minus": False,
    "pdf.fonttype": 42,
    "svg.fonttype": "none",
})

##### Set working directory for analysis

In [None]:
cwd = '/media/bio/Disk/Research Data/EBV/omicverse'
os.chdir(cwd)
updated_dir = os.getcwd()
print("Updated working directory: ", updated_dir)

from pathlib import Path
saving_dir = Path('Results/10.NPC_ST_Analysis')
saving_dir.mkdir(parents=True, exist_ok=True)

## Loading the data

#### Reading in annotated AnnData object

In [None]:
adata = sc.read_h5ad("Processed Data/GSE206245_NPC_ST_Cluster_Tangram.h5ad")
adata

In [None]:
print(np.min(adata.X), np.max(adata.X))

## Transcription factor activity inference

The first functional analysis we can perform is to infer transcription factor (TF) activities from our transcriptomics data. We will need a gene regulatory network (GRN) and a statistical method.

### CollecTRI network
[CollecTRI](https://github.com/saezlab/CollecTRI) is a comprehensive resource
containing a curated collection of TFs and their transcriptional targets
compiled from 12 different resources. This collection provides an increased
coverage of transcription factors and a superior performance in identifying
perturbed TFs compared to our previous
[DoRothEA](https://saezlab.github.io/dorothea/) network and other literature
based GRNs. Similar to DoRothEA, interactions are weighted by their mode of
regulation (activation or inhibition).

For this example we will use the human version (mouse and rat are also
available). We can use `decoupler` to retrieve it from `omnipath`. The argument
`split_complexes` keeps complexes or splits them into subunits, by default we
recommend to keep complexes together.

<div class="alert alert-info">

**Note**
    
In this tutorial we use the network CollecTRI, but we could use any other GRN coming from an inference method such as [CellOracle](https://morris-lab.github.io/CellOracle.documentation/), [pySCENIC](https://pyscenic.readthedocs.io/en/latest/) or [SCENIC+](https://scenicplus.readthedocs.io/en/latest/). 

</div> 

In [None]:
net = dc.get_collectri(organism='human', split_complexes=False)
net

### Activity inference with Univariate Linear Model (ULM)

To infer TF enrichment scores we will run the Univariate Linear Model (`ulm`) method. For each spot in our slide (`adata`) and each TF in our network (`net`), it fits a linear model that predicts the observed gene expression
based solely on the TF's TF-Gene interaction weights. Once fitted, the obtained t-value of the slope is the score. If it is positive, we interpret that the TF is active and if it is negative we interpret that it is inactive.


To run `decoupler` methods, we need an input matrix (`mat`), an input prior knowledge
network/resource (`net`), and the name of the columns of `net` that we want to use.

In [None]:
dc.run_ulm(
    mat=adata,
    net=net,
    source='source',
    target='target',
    weight='weight',
    verbose=True,
    use_raw=False
)

The obtained scores (`ulm_estimate`) and p-values (`ulm_pvals`) are stored in the `.obsm` key:

In [None]:
adata.obsm['ulm_estimate']

**Note**: Each run of `run_ulm` overwrites what is inside of `ulm_estimate` and `ulm_pvals`. if you want to run `ulm` with other resources and still keep the activities inside the same `AnnData` object, you can store the results in any other key in `.obsm` with different names, for example:

In [None]:
adata.obsm['collectri_ulm_estimate'] = adata.obsm['ulm_estimate'].copy()
adata.obsm['collectri_ulm_pvals'] = adata.obsm['ulm_pvals'].copy()
adata

### Visualization

To visualize the obtained scores, we can re-use many of `scanpy`'s plotting functions.
First though, we need to extract them from the `adata` object.

In [None]:
acts = dc.get_acts(adata, obsm_key='collectri_ulm_estimate')
acts

`dc.get_acts` returns a new `AnnData` object which holds the obtained activities in its `.X` attribute, allowing us to re-use many `scanpy` functions, for example:

In [None]:
acts

### Exploration

Let's identify which are the top TF per cluster. We can do it by using the function `dc.rank_sources_groups`, which identifies marker TFs using the same statistical tests available in scanpy's `scanpy.tl.rank_genes_groups`.

In [None]:
df = dc.rank_sources_groups(acts, groupby='scNiche', reference='rest', method='t-test_overestim_var')
df

We can then extract the top 3 markers per cluster:

In [None]:
n_markers = 5
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers

We can plot the obtained markers:

In [None]:
sc.pl.matrixplot(acts, source_markers, 'scNiche', dendrogram=False, standard_scale='var',
                 colorbar_title='Z-scaled scores', cmap='magma',
                 save="TF_Activities_Spatial_scNiche.pdf")

## Pathway activity inference

We can also infer pathway activities from our transcriptomics data.

### PROGENy model

[PROGENy](https://saezlab.github.io/progeny/) is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction.
For this example we will use the human weights (other organisms are available) and we will use the top 500 responsive genes ranked by p-value. Here is a brief description of each pathway:

- **Androgen**: involved in the growth and development of the male reproductive organs.
- **EGFR**: regulates growth, survival, migration, apoptosis, proliferation, and differentiation in mammalian cells
- **Estrogen**: promotes the growth and development of the female reproductive organs.
- **Hypoxia**: promotes angiogenesis and metabolic reprogramming when O2 levels are low.
- **JAK-STAT**: involved in immunity, cell division, cell death, and tumor formation.
- **MAPK**: integrates external signals and promotes cell growth and proliferation.
- **NFkB**: regulates immune response, cytokine production and cell survival.
- **p53**: regulates cell cycle, apoptosis, DNA repair and tumor suppression.
- **PI3K**: promotes growth and proliferation.
- **TGFb**: involved in development, homeostasis, and repair of most tissues.
- **TNFa**: mediates haematopoiesis, immune surveillance, tumour regression and protection from infection.
- **Trail**: induces apoptosis.
- **VEGF**: mediates angiogenesis, vascular permeability, and cell migration.
- **WNT**: regulates organ morphogenesis during development and tissue repair.

To access it we can use `decoupler`.

In [None]:
progeny = dc.get_progeny(organism='human', top=100)
progeny

### Activity inference with Multivariate Linear Model (MLM)

To infer pathway enrichment scores we will run the Multivariate Linear Model (`mlm`) method. For each spot in our slide (`adata`), it fits a linear model that predicts the observed gene expression based on all pathways' Pathway-Gene interactions weights.
Once fitted, the obtained t-values of the slopes are the scores. If it is positive, we interpret that the pathway is active and if it is negative we interpret that it is inactive.
     
We can run `mlm` with a simple one-liner:

In [None]:
dc.run_mlm(
    mat=adata,
    net=progeny,
    source='source',
    target='target',
    weight='weight',
    verbose=True,
    use_raw=False
)

# Store in new obsm keys
adata.obsm['progeny_mlm_estimate'] = adata.obsm['mlm_estimate'].copy()
adata.obsm['progeny_mlm_pvals'] = adata.obsm['mlm_pvals'].copy()

The obtained scores (t-values)(`mlm_estimate`) and p-values (`mlm_pvals`) are stored in the `.obsm` key:

In [None]:
adata.obsm['progeny_mlm_estimate']

### Visualization

Like in the previous section, we will extract the activities from the `adata` object.

In [None]:
acts = dc.get_acts(adata, obsm_key='progeny_mlm_estimate')
acts

Once extracted we can visualize them:

In [None]:
sc.pl.violin(
    acts,
    keys='EGFR',
    groupby='scNiche',
    rotation=90
)

Here we show the activity of the pathway Trail, which is associated with apoptosis.

### Exploration

We can visualize which pathways are more active in each cluster:

In [None]:
sc.pl.matrixplot(acts, var_names=acts.var_names, groupby='scNiche', dendrogram=False, standard_scale='var',
                 colorbar_title='Z-scaled scores', cmap='magma',
                 save="PROGENy_Pathways_Spatial_scNiche.pdf")

## Functional enrichment of biological terms

Finally, we can also infer activities for general biological terms or processes.

### MSigDB gene sets

The Molecular Signatures Database ([MSigDB](http://www.gsea-msigdb.org/gsea/msigdb/)) is a resource containing a collection of gene sets annotated to different biological processes.

In [None]:
# Retrieve MSigDB resource
msigdb = dc.get_resource('MSigDB')
msigdb

As an example, we will use the hallmark and stemness gene sets, but we could have used any other. 

<div class="alert alert-info">

**Note**
    
To see what other collections are available in MSigDB, type: `msigdb['collection'].unique()`.

</div>  

We can filter by for `hallmark`:

In [None]:
# Filter by hallmark and stemness
# msigdb = msigdb[(msigdb['collection'] == 'hallmark') | (msigdb['geneset'] == 'MALTA_CURATED_STEMNESS_MARKERS')]
msigdb = msigdb[(msigdb['collection'] == 'hallmark')]

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]

# Rename
# Ensure that 'geneset' contains 'HALLMARK_' before splitting
def safe_split(name):
    if 'HALLMARK_' in name:
        return name.split('HALLMARK_')[1]
    else:
        return 'CURATED_STEMNESS'  # Or return a default value like 'Unknown' if you prefer

# Apply the safe_split function to the 'geneset' column
msigdb.loc[:, 'geneset'] = msigdb['geneset'].apply(safe_split)

# Check the result
msigdb

### Enrichment with Over Representation Analysis (ORA)

To infer functional enrichment scores we will run the Over Representation Analysis (`ora`) method.
As input data it accepts an expression matrix (`decoupler.run_ora`) or the results of differential expression analysis (`decoupler.run_ora_df`).
For the former, by default the top 5% of expressed genes by sample are selected as the set of interest (S*), and for the latter a user-defined
significance filtering can be used.
Once we have S*, it builds a contingency table using set operations for each set stored in the gene set resource being used (`net`).
Using the contingency table, `ora` performs a one-sided Fisher exact test to test for significance of overlap between sets.
The final score is obtained by log-transforming the obtained p-values, meaning that higher values are more significant.

     
We can run `ora` with a simple one-liner:

In [None]:
dc.run_ora(
    mat=adata,
    net=msigdb,
    source='geneset',
    target='genesymbol',
    verbose=True,
    use_raw=False
)

# Store in a different key
adata.obsm['msigdb_ora_estimate'] = adata.obsm['ora_estimate'].copy()
adata.obsm['msigdb_ora_pvals'] = adata.obsm['ora_pvals'].copy()

The obtained scores (-log10(p-value))(`ora_estimate`) and p-values (`ora_pvals`) are stored in the `.obsm` key:

In [None]:
adata.obsm['msigdb_ora_estimate'].iloc[:, 0:5]

### Visualization

Like in the previous sections, we will extract the activities from the `adata` object.

In [None]:
acts = dc.get_acts(adata, obsm_key='msigdb_ora_estimate')

# We need to remove inf and set them to the maximum value observed
acts_v = acts.X.ravel()
max_e = np.nanmax(acts_v[np.isfinite(acts_v)])
acts.X[~np.isfinite(acts.X)] = max_e

acts

Once extracted we can visualize them:

In [None]:
sc.pl.matrixplot(
    acts,
    var_names=acts.var_names,      
    groupby='scNiche',      
    dendrogram=False,                
    standard_scale='var',            
    colorbar_title='Z-scaled scores', 
    cmap='magma',                    
    swap_axes=False,                   
    save="Hallmark_Enrichment_Spatial_scNiche_New.pdf"
)


### Exploration

Let's identify which are the top terms per cluster. We can do it by using the function `dc.rank_sources_groups`, as shown before.

In [None]:
df = dc.rank_sources_groups(acts, groupby='scNiche', reference='rest', method='t-test_overestim_var')
df

We can then extract the top 5 terms per cluster:

In [None]:
n_top = 5
term_markers = df.groupby('group').head(n_top).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
term_markers

We can plot the obtained terms:

In [None]:
# Create the matrix plot and ensure it uses the correct figure
sc.pl.matrixplot(acts, term_markers, 'scNiche', dendrogram=False, standard_scale='var',
                 colorbar_title='Z-scaled scores', cmap='magma', swap_axes=True, 
                 save="Top_Hallmark_Enrichment_Spatial_scNiche_New.pdf")

### DIY stemness gene sets

In [None]:
# Read gmt file with NA filtering
def read_gmt(gmt_file):
    with open(gmt_file, 'r') as f:
        gene_sets = {}
        for line in f:
            parts = line.strip().split('\t')  
            gene_name = parts[0]  
            genes = [gene for gene in parts[2:] if gene and gene != "NA"]  
            gene_sets[gene_name] = genes
    return gene_sets


In [None]:
gmt_file = 'Dataset/CURATED_STEMNESS_GENESET_PNAS_2019.gmt'
gene_sets = read_gmt(gmt_file)

gene_sets

In [None]:
# Creat new dataFrame
new_data = pd.DataFrame({
    'genesymbol': [gene for gene in gene_sets['CURATED_STEMNESS_GENESET_PNAS_2019']],
    'collection': ['hallmark'] * len(gene_sets['CURATED_STEMNESS_GENESET_PNAS_2019']),
    'geneset': ['Stemness'] * len(gene_sets['CURATED_STEMNESS_GENESET_PNAS_2019'])
})
new_data

In [None]:
# Merge the new data into the existing msigdb dataset
msigdb = pd.concat([msigdb, new_data], ignore_index=True)
print(msigdb)

In [None]:
dc.run_ora(
    mat=adata,
    net=msigdb,
    source='geneset',
    target='genesymbol',
    verbose=True,
    use_raw=False
)

# Store in a different key
adata.obsm['msigdb_ora_estimate'] = adata.obsm['ora_estimate'].copy()
adata.obsm['msigdb_ora_pvals'] = adata.obsm['ora_pvals'].copy()

In [None]:
acts = dc.get_acts(adata, obsm_key='msigdb_ora_estimate')

# We need to remove inf and set them to the maximum value observed
acts_v = acts.X.ravel()
max_e = np.nanmax(acts_v[np.isfinite(acts_v)])
acts.X[~np.isfinite(acts.X)] = max_e

acts

In [None]:
sc.pl.spatial(
    acts[acts.obs['sample_id']=="NPC_ST19"],
    color=['Tumor', 'SPP1+ Macro','Stemness','MYC_TARGETS_V2','E2F_TARGETS','G2M_CHECKPOINT','WNT_BETA_CATENIN_SIGNALING','TGF_BETA_SIGNALING'],
    cmap='magma',
    size=1.5,
    ncols=4,
    library_id="NPC_ST19",
    frameon=False,
    save="Top_Hallmark_Enrichment_Spatial_scNiche4_new.pdf"
)

#### DIY a collection of immune cell function gene sets

We can easily compute cell type enrichment scores by running the ulm method.

In [None]:
Immune_Functional_Signatures = pd.read_excel('Dataset/Immune_Functional_Genesets.xlsx', sheet_name=0)
Immune_Functional_Signatures.head()

In [None]:
dc.run_ora(
    mat=adata,
    net=Immune_Functional_Signatures,
    source='geneset',
    target='genesymbol',
    verbose=True,
    use_raw=False
)

# Store in a different key
adata.obsm['Immune_Functional_Signatures_ora_estimate'] = adata.obsm['ora_estimate'].copy()
adata.obsm['Immune_Functional_Signatures_ora_pvals'] = adata.obsm['ora_pvals'].copy()

### Visualization

Like in the previous sections, we will extract the activities from the `adata` object.

In [None]:
acts = dc.get_acts(adata, obsm_key='Immune_Functional_Signatures_ora_estimate')

# We need to remove inf and set them to the maximum value observed
acts_v = acts.X.ravel()
max_e = np.nanmax(acts_v[np.isfinite(acts_v)])
acts.X[~np.isfinite(acts.X)] = max_e

acts

Once extracted we can visualize them:

In [None]:
sc.pl.matrixplot(
    acts,
    var_names=acts.var_names,         
    groupby='scNiche',               
    dendrogram=False,                 
    standard_scale='var',           
    colorbar_title='Z-scaled scores', 
    cmap='magma',                    
    swap_axes=False,       
    save="Immune_Functional_Enrichment_Spatial_scNiche.pdf"
)



**<span style="font-size:16px;">Session information：</span>**

In [None]:
import sys
import platform
import pkg_resources

# Get Python version information
python_version = sys.version
# Get operating system information
os_info = platform.platform()
# Get system architecture information
architecture = platform.architecture()[0]
# Get CPU information
cpu_info = platform.processor()
# Print Session information
print("Python version:", python_version)
print("Operating system:", os_info)
print("System architecture:", architecture)
print("CPU info:", cpu_info)

# Print imported packages and their versions
print("\nImported packages and their versions:")
for package in pkg_resources.working_set:
    print(package.key, package.version)