## 07_2. Myeloid -- Functional enrichment of biological terms

<div 
    <p style="text-align: left;">Updated Time: 2025-02-20</p>
</div>

#### Loading packages

First, we need to load the relevant packages, `scanpy` to handle scRNA-seq data
and `decoupler` to use statistical methods.

In [None]:
import scanpy as sc
import omicverse as ov
import decoupler as dc
import pertpy as pt

# Only needed for processing
import os
import sys
import numpy as np
import pandas as pd

# Needed for some plotting
import matplotlib.pyplot as plt
ov.plot_set()

# Plotting options, change to your liking
sc.settings.set_figure_params(dpi=300, frameon=False)
sc.set_figure_params(dpi=300)
sc.set_figure_params(figsize=(4, 4))

import warnings
warnings.simplefilter("ignore")

#### Loading the data



**<span style="font-size:16px;">Set working directory for analysis</span>**

In [None]:
cwd = '/media/bio/Disk/Research Data/EBV/omicverse'
os.chdir(cwd)
updated_dir = os.getcwd()
print("Updated working directory: ", updated_dir)

**<span style="font-size:16px;">Load data for analysis</span>**

In [None]:
adata_myeloid = sc.read_h5ad("Processed Data/scRNA_Myeloid.h5ad")
adata_myeloid

In [None]:
for i in adata_myeloid.obs['Myeloid_subtype'].cat.categories:
  number = len(adata_myeloid.obs[adata_myeloid.obs['Myeloid_subtype']==i])
  print('the number of category {} is {}'.format(i,number))

In [None]:
print(np.min(adata_myeloid.X), np.max(adata_myeloid.X))


#### Preprocessing

You can use `recover_counts` to recover the raw counts after normalize and log1p

In [None]:
X_counts_recovered, size_factors_sub=ov.pp.recover_counts(adata_myeloid.X, 50*1e4, 50*1e5, log_base=None, chunk_size=50000)
adata_myeloid.layers['counts']=X_counts_recovered

In [None]:
adata_myeloid.raw = adata_myeloid
adata_myeloid.X=adata_myeloid.layers['counts']
print(np.min(adata_myeloid.X), np.max(adata_myeloid.X))

In [None]:
# Select myeloid cells for downstream analysis
adata_myeloid = adata_myeloid[adata_myeloid.obs['Myeloid_subtype'].isin(['C1QC+ Macro','SPP1+ Macro','IL1B+ Macro'])].copy()
adata_myeloid

#### Prepare gene set
The Molecular Signatures Database (MSigDB) is a resource containing a collection of gene sets annotated to different biological processes.


In [None]:
msigdb = dc.get_resource('MSigDB')
msigdb

#### DIY a collection of gene sets

In [None]:
# Process Myeloid Immunity Signature
# 1) Load the Excel file
df = pd.read_excel("Dataset/Myeloid.Immunity.Signature.xlsx", header=0, dtype=str)

# 2) Remove empty rows/columns and strip whitespace
df = df.dropna(axis=1, how="all").dropna(axis=0, how="all")
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

# 3) Reshape from wide to long format
long_df = df.melt(var_name="geneset", value_name="genesymbol")

# 4) Remove NA/empty values and drop duplicates
long_df = long_df.dropna(subset=["genesymbol"])
long_df = long_df[long_df["genesymbol"].str.len() > 0]
Myeloid_Immunity_Signature = (
    long_df.drop_duplicates(subset=["geneset", "genesymbol"])
           .reset_index(drop=True)
)

Myeloid_Immunity_Signature

In [None]:
# Assume net is Myeloid_Immunity_Signature
genes_in_data = set(adata_myeloid.var_names)
genes_in_net  = set(Myeloid_Immunity_Signature['genesymbol'])

# Intersection
present = genes_in_net & genes_in_data
# Missing genes
absent  = genes_in_net - genes_in_data

print(f"✅ Total genes in net: {len(genes_in_net)}")
print(f"✅ Genes found in adata: {len(present)}")
print(f"❌ Genes not found in adata: {len(absent)}")

# Show a subset of missing genes (first 100)
print("Example of missing genes:", list(absent)[:100])


#### Enrichment with Over Representation Analysis (ORA)
To infer functional enrichment scores we will run the Over Representation Analysis (ora) method. As input data it accepts an expression matrix (decoupler.run_ora) or the results of differential expression analysis (decoupler.run_ora_df). For the former, by default the top 5% of expressed genes by sample are selected as the set of interest (S), and for the latter a user-defined significance filtering can be used. Once we have S, it builds a contingency table using set operations for each set stored in the gene set resource being used (net). Using the contingency table, ora performs a one-sided Fisher exact test to test for significance of overlap between sets. The final score is obtained by log-transforming the obtained p-values, meaning that higher values are more significant.


We can run ora with a simple one-liner:

In [None]:
dc.run_ora(
    mat=adata_myeloid,
    net=Myeloid_Immunity_Signature,
    source='geneset',
    target='genesymbol',
    verbose=True
)

The obtained scores (-log10(p-value))(ora_estimate) and p-values (ora_pvals) are stored in the .obsm key:

In [None]:
adata_myeloid.obsm['ora_estimate']

#### Visualization
To visualize the obtianed scores, we can re-use many of scanpy’s plotting functions. First though, we need to extract them from the adata object.

In [None]:
acts = dc.get_acts(adata_myeloid, obsm_key='ora_estimate')

# We need to remove inf and set them to the maximum value observed
acts_v = acts.X.ravel()
max_e = np.nanmax(acts_v[np.isfinite(acts_v)])
acts.X[~np.isfinite(acts.X)] = max_e

acts

dc.get_acts returns a new AnnData object which holds the obtained activities in its .X attribute, allowing us to re-use many scanpy functions, for example: 

Angiogenesis
Antigen Presentation
Cell Cycle and Apoptosis
Cell Migration and Adhesion
Chemokine Signaling
Complement Activation
Cytokine Signaling
Differentiation and Maintenance of Myeloid Cells
ECM remodeling
Fc Receptor Signaling
Growth Factor Signaling
Interferon Signaling
Lymphocyte Activation
Metabolism
Pathogen Response
T-cell Activation and Checkpoint Signaling
TH1 Activation
TH2 Activation
TLR Signaling

In [None]:
sc.pl.violin(acts, keys=['Anti-inflammatory', 'Pro inflammatory', 'M1 Macrophage Polarization', 'M2 Macrophage Polarization',	
                         'Type I Interferon response', 'Type II Interferon Response', 'Hypoxia', 'Angiogenesis'], 
                         groupby='Myeloid_subtype', rotation=90)

In [None]:
# -*- coding: utf-8 -*-
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# ==== User config ====
group_col = "Myeloid_subtype"  # grouping column in acts.obs
signatures = [
    "Anti-inflammatory",
    "Pro inflammatory",
    "M1 Macrophage Polarization",
    "M2 Macrophage Polarization",
    "Type I Interferon response",
    "Type II Interferon Response",
    "Hypoxia",
    "Angiogenesis",
]

# Abbreviations for cleaner x-axis labels
abbr_map = {
    "Anti-inflammatory": "Anti-inflammatory",
    "Pro inflammatory": "Pro-inflammatory",
    "M1 Macrophage Polarization": "M1 Polarization",
    "M2 Macrophage Polarization": "M2 Polarization",
    "Type I Interferon response": "IFN-I Response",
    "Type II Interferon Response": "IFN-II Response",
    "Hypoxia": "Hypoxia",
    "Angiogenesis": "Angiogenesis",
}

outdir = "Results/07.Myeloid/"
os.makedirs(outdir, exist_ok=True)
# =====================

# 1) Pull scores from obsm
ora = acts.obsm["ora_estimate"]
if not isinstance(ora, pd.DataFrame):
    ora = pd.DataFrame(ora, index=acts.obs_names)

# 2) Keep only available signature columns
present = [s for s in signatures if s in ora.columns]
missing = [s for s in signatures if s not in ora.columns]
if missing:
    print("⚠ Missing signatures in obsm['ora_estimate'] (ignored):", missing)
if not present:
    raise ValueError("No target signature columns found in obsm['ora_estimate'].")

# 3) Assemble long-format dataframe
if group_col not in acts.obs.columns:
    raise KeyError(f"Grouping column not found in acts.obs: {group_col}")

score_df = ora[present].copy()
score_df[group_col] = acts.obs[group_col].values
long_df = score_df.melt(id_vars=group_col, var_name="Signature", value_name="Score")

# 4) Labels and order (use abbreviations; keep default group order/levels)
long_df["Signature_lbl"] = long_df["Signature"].map(lambda s: abbr_map.get(s, s))
sig_order = [abbr_map.get(s, s) for s in present]

# Respect existing categorical order if present
if str(acts.obs[group_col].dtype) == "category":
    hue_order = list(acts.obs[group_col].cat.categories)
else:
    hue_order = None  # seaborn default

# 5) Plot
sns.set_theme(context="talk", style="white")  # white background
plt.figure(figsize=(18, 5), facecolor="white")

ax = sns.violinplot(
    data=long_df,
    x="Signature_lbl", y="Score", hue=group_col,
    order=sig_order, hue_order=hue_order, palette='Set3',
    cut=0, scale="width", inner="box", linewidth=0.8, saturation=0.9
)

# Keep a single legend (no title), inside top-left
handles, labels = ax.get_legend_handles_labels()
if hue_order is not None:
    n = len(hue_order)
else:
    n = len(pd.unique(labels))
handles, labels = handles[:n], labels[:n]
plt.legend(handles, labels, title=None, loc="upper left", frameon=False)

# Axes styling
plt.xlabel("")
plt.ylabel("Cell score")
plt.title("")
plt.xticks(rotation=0)
ax.set_facecolor("white")
sns.despine(trim=False)  # keep full spines
ax.tick_params(axis="x", which="both", bottom=True, top=False, length=5)
ax.tick_params(axis="y", which="both", left=True, right=False, length=5)

plt.tight_layout()

# 6) Save before show
pdf_path = os.path.join(outdir, "ORA_violin_Myeloid_Immunity_Signature.pdf")
png_path = os.path.join(outdir, "ORA_violin_Myeloid_Immunity_Signature.png")
plt.savefig(pdf_path, bbox_inches="tight", dpi=300)
plt.savefig(png_path, bbox_inches="tight", dpi=300)
plt.show()
plt.close()

print("✅ Saved:")
print(" -", pdf_path)
print(" -", png_path)



**<span style="font-size:16px;">Session information：</span>**

In [None]:
import sys
import platform
import pkg_resources

# Get Python version information
python_version = sys.version
# Get operating system information
os_info = platform.platform()
# Get system architecture information
architecture = platform.architecture()[0]
# Get CPU information
cpu_info = platform.processor()
# Print Session information
print("Python version:", python_version)
print("Operating system:", os_info)
print("System architecture:", architecture)
print("CPU info:", cpu_info)

# Print imported packages and their versions
print("\nImported packages and their versions:")
for package in pkg_resources.working_set:
    print(package.key, package.version)