## 10_1. Preprocess, integration and clustering of NPC Spatial Data

<div style="text-align: left;">
    <p style="text-align: left;">Updated Time: 2025-04-09</p>
</div>

STAligner is designed for alignment and integration of spatially resolved transcriptomics data.

STAligner first normalizes the expression proﬁles for all spots and constructs a spatial neighbor network using the spatial coordinates. STAligner further employs a graph attention auto-encoder neural network to extract spatially aware embedding, and constructs the spot triplets based on current embeddings to guide the alignment process by attracting similar spots and discriminating dissimilar spots across slices. STAligner introduces the triplet loss to update the spot embedding to reduce the distance from the anchor to positive spot, and increase the distance from the anchor to negative spot. The triplet construction and auto-encoder training are optimized iteratively until batch-corrected embeddings are generated. b. STAligner can be applied to integrate ST datasets to achieve alignment and simultaneous identification of spatial domains from different biological samples in (a), technological platforms (I), developmental (embryonic) stages (II), disease conditions (III) and consecutive slices of a tissue for 3D slice alignment (IV).

Zhou, X., Dong, K. & Zhang, S. Integrating spatial transcriptomics data across different conditions, technologies and developmental stages. Nat Comput Sci 3, 894–906 (2023). https://doi.org/10.1038/s43588-023-00528-w

#### Loading packages

In [None]:
import os
import numpy as np
import omicverse as ov
import scanpy as sc
import anndata as ad
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from scipy.sparse import csr_matrix
from matplotlib.patches import Patch

ov.utils.ov_plot_set()

import warnings
warnings.simplefilter("ignore") 

##### Set working directory for analysis

In [None]:
cwd = '/media/bio/Disk/Research Data/EBV/omicverse'
os.chdir(cwd)
updated_dir = os.getcwd()
print("Updated working directory: ", updated_dir)

# Preprocess data

Here, We use the mouse olfactory bulb data generated by Stereo-seq and Slide-seqV2. The processed Stereo-seq and Slide-seqV2 data can be downloaded from https://drive.google.com/drive/folders/1Omte1adVFzyRDw7VloOAQYwtv_NjdWcG?usp=share_link. and the original tutorals can be finded from https://staligner.readthedocs.io/en/latest

Here is a critical point that must be clarified: for STAligner, it first calculates highly variable genes before concating annadata samples. Therefore, the number of highly variable genes should not be selected too low. Otherwise, in the case of a large number of samples, the downstream features for STAligner training would be insufficient, impacting the model's performance.

When using STAligner, it is necessary to adjust the **rad_cutoff** parameter according to different data to ensure that each spot has an **average of 5-10 adjacent spots** connected to it. Such as: "11.3356 neighbors per cell on average."


#### Preprocess NPC Spatial Data

In [None]:
# **Root directory for 10X Visium data**
pathway = "Dataset/GSE206245"

# **Batch_list to store all AnnData objects**
Batch_list = []
adj_list = []

sample_ids = ['NPC_ST05', 'NPC_ST06', 'NPC_ST07', 'NPC_ST08', 'NPC_ST09', 'NPC_ST10', 'NPC_ST11', 'NPC_ST12', 'NPC_ST16', 'NPC_ST17', 'NPC_ST18', 'NPC_ST19']
print(sample_ids)

# **Iterate over all sample directories**
for sample_id in sample_ids:
    print(f"\n🟢 Processing sample: {sample_id}")

    try:
        # **Read 10X Visium data**
        sample_path = os.path.join(pathway,sample_id)
        adata = sc.read_visium(sample_path, library_id=sample_id)

        # check whether the adata.X is sparse matrix
        if isinstance(adata.X, pd.DataFrame):
            adata.X = csr_matrix(adata.X)
        else:
            pass

        # add batch name
        adata.obs["sample_id"] = sample_id
              
        # make var name unique
        adata.var_names_make_unique(join="++")

        # make spot name unique
        adata.obs_names = [x+'_'+sample_id for x in adata.obs_names]

        # Spot filtering
        sc.pp.calculate_qc_metrics(adata, inplace=True)
        adata = adata[adata.obs["total_counts"] > 500, :] # Remove spots with low UMI counts
        adata = adata[adata.obs["total_counts"] < 60000, :] # Remove spots with extremely high UMI counts 
        adata = adata[adata.obs["n_genes_by_counts"] > 100, :] # Remove spots with a low number of detected genes
        adata = adata[adata.obs["n_genes_by_counts"] < 10000, :] # Remove spots with a extremely high number of detected genes
        adata = adata[adata.obs["in_tissue"] == 1, :] # Retain only spots within the tissue (specific to Visium)
        
        # save adata.raw
        adata.raw = adata.copy()

        # Constructing the spatial network
        ov.space.Cal_Spatial_Net(adata, rad_cutoff=200) # rad_cutoff need to be test, the spatial network are saved in adata.uns[‘adj’]

        # Normalization
        sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=13000)
        sc.pp.normalize_total(adata, target_sum=1e4)
        sc.pp.log1p(adata)
        adata = adata[:, adata.var['highly_variable']]

        # **Store in Batch_list**
        adj_list.append(adata.uns['adj'])
        Batch_list.append(adata)

        print(f"✅ Successfully loaded: {sample_id} → Dimensions: {adata.shape}")

    except Exception as e:
        print(f"❌ Failed to load: {sample_id}, Error: {e}")

In [None]:
adata_concat = ad.concat(Batch_list, label="library_id", keys=sample_ids, uns_merge="unique")
adata_concat.obs["batch"] = adata_concat.obs["sample_id"].astype('category')
adata_concat

In [None]:
print(adata_concat.X.shape)
print(np.min(adata_concat.X), np.max(adata_concat.X))

# Training STAligner model

Here, we used `ov.space.pySTAligner` to construct a STAGATE object to train the model.

We are using the `train_STAligner_subgraph` function from STAligner to reduce GPU memory usage, each slice is considered as a subgraph for training.

In [None]:
import random
import numpy as np
import torch

# Set random seed for Python's built-in random module
random.seed(666)

# Set random seed for NumPy
np.random.seed(666)

# Set random seed for PyTorch (both CPU and GPU)
torch.manual_seed(666)
torch.cuda.manual_seed_all(666)

# Ensure deterministic behavior for CUDA operations (may reduce performance)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
%%time
# iter_comb is used to specify the order of integration. For example, (0, 1) means slice 0 will be algined with slice 1 as reference.
iter_comb = [(i, i + 1) for i in range(len(sample_ids) - 1)]

# Here, to reduce GPU memory usage, each slice is considered as a subgraph for training.
STAligner_obj = ov.space.pySTAligner(adata_concat, verbose=True, knn_neigh = 100, n_epochs = 600, iter_comb = iter_comb,
                                     batch_key = 'batch',  key_added='STAligner', Batch_list = Batch_list)

In [None]:
STAligner_obj.train()

We stored the latent embedding in `adata.obsm['STAligner']`.

In [None]:
adata = STAligner_obj.predicted()
adata

In [None]:
print(adata.X.shape)
print(np.min(adata.X), np.max(adata.X))

# Clustering the space

We can use `GMM`, `leiden` or `louvain` to cluster the space.

`ov.utils.cluster(adata,use_rep='STAligner',method='GMM',n_components=7,covariance_type='full', tol=1e-9, max_iter=1000, random_state=3607`

or `sc.pp.neighbors(adata, use_rep='STAligner', random_state=666)`            
`ov.utils.cluster(adata,use_rSTAlignerGATE',method='leiden',resolution=1)`

In [None]:
sc.pp.neighbors(adata, use_rep='STAligner', random_state=666)
sc.tl.umap(adata_concat, random_state=666)

In [None]:
# Run leiden clustering for different resolutions
for resolution in [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]:
    ov.pp.leiden(
        adata,
        resolution=resolution,
        key_added=f"leiden_{str(resolution).replace('.', '_')}",
    )

#### Plot the clustree

In [None]:
from pyclustree import clustree

In [None]:
# Plot the clustree
fig = clustree(
    adata,
    [f"leiden_{str(resolution).replace('.', '_')}" for resolution in [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]],
    title="Clustree of Spatial Niches",
    edge_weight_threshold=0.00,  # the minimum fraction of the parent cluster assigned to the child cluster to plot
    show_fraction=True,  # show the fraction of cells in each cluster
)
fig.set_size_inches(10, 8)
fig.set_dpi(100)

#### Adding cluster scoring

In [None]:
adata.obsm['X_pca']=adata.obsm['STAligner']

In [None]:
# Supported are Silhouette score, Calinski and Harabasz score and Davies-Bouldin score.
fig = clustree(
    adata,
    [f"leiden_{str(resolution).replace('.', '_')}" for resolution in [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]],
    title="Clustree of Spatial Niches with Silhouette Score",
    score_clustering="silhouette",
    score_basis="pca",
)
fig.set_size_inches(10, 8)
fig.set_dpi(100)

# Save and Show plot
plt.tight_layout()
plt.savefig("Results/10.NPC_ST_Analysis/Silhouette_Score_Spatial_scNiche.pdf", bbox_inches='tight')
plt.show()

In [None]:
# Supported are Silhouette score, Calinski and Harabasz score and Davies-Bouldin score.
fig = clustree(
    adata,
    [f"leiden_{str(resolution).replace('.', '_')}" for resolution in [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]],
    title="Clustree of Spatial Niches with Calinski Harabasz Score",
    score_clustering="calinski_harabasz",
    score_basis="pca",
)
fig.set_size_inches(10, 8)
fig.set_dpi(100)

# Save and Show plot
plt.tight_layout()
plt.savefig("Results/10.NPC_ST_Analysis/Calinski_Harabasz_Score_Spatial_scNiche.pdf", bbox_inches='tight')
plt.show()

In [None]:
# Supported are Silhouette score, Calinski and Harabasz score and Davies-Bouldin score.

fig = clustree(
    adata,
    [f"leiden_{str(resolution).replace('.', '_')}" for resolution in [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]],
    title="Clustree of Spatial Niches with Davies Bouldin Score",
    score_clustering="davies_bouldin",
    score_basis="pca",
)
fig.set_size_inches(10, 8)
fig.set_dpi(100)

# Save and Show plot
plt.tight_layout()
plt.savefig("Results/10.NPC_ST_Analysis/Davies_Bouldin_Score_Spatial_scNiche.pdf", bbox_inches='tight')
plt.show()

Based on the cluster scoring，a resolution of 0.2 may be the optimal. Here we visualize the optimal clustering using UMAP representation:

In [None]:
original_clusters = sorted(adata_concat.obs['leiden_0_2'].unique(), key=lambda x: int(x))
new_labels = [f"Niche{i+1}" for i in range(len(original_clusters))]
label_map = dict(zip(original_clusters, new_labels))
adata_concat.obs['scNiche'] = adata_concat.obs['leiden_0_2'].map(label_map)

In [None]:
from matplotlib import patheffects
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(4,4))
ov.pl.embedding(adata_concat,
                  basis='X_umap',
                  color=['scNiche'], 
                  palette='Paired',
                  show=False, legend_loc=None, add_outline=False, 
                  frameon='small',legend_fontoutline=2,ax=ax
                 )

ov.utils.gen_mpl_labels(
    adata_concat,
    'scNiche',
    exclude=("None",),  
    basis='X_umap',
    ax=ax,
    adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
    text_kwargs=dict(fontsize= 9,weight='bold',
                     path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
)

plt.savefig("Results/10.NPC_ST_Analysis/Umap_Spatial_scNiche.pdf", format='pdf')
plt.show()


In [None]:
from matplotlib import patheffects
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(4,4))
ov.pl.embedding(adata_concat,
                  basis='X_umap',
                  color=['batch'], 
                  palette='Paired',
                  show=False, 
                  legend_fontoutline=2,
                  title='',
                  ax=ax
                 )

plt.savefig("Results/10.NPC_ST_Analysis/Umap_Batch_Spatial.pdf", format='pdf')
plt.show()

In [None]:
fig,ax=plt.subplots(figsize = (6,4))
ov.pl.cellproportion(adata=adata_concat, celltype_clusters='scNiche', groupby='batch', legend=True, ax=ax)
# Save and Show plot
plt.tight_layout()
plt.savefig("Results/10.NPC_ST_Analysis/Composition_Spatial_scNiche.pdf", bbox_inches='tight')
plt.show()

We can also map the clustering results back to the original spatial coordinates to obtain spatially specific clustering results.

In [None]:
# Get all batch names
sample_ids = adata_concat.obs['sample_id'].unique()

# Define grid layout for subplots
n_cols = 4
n_rows = -(-len(sample_ids) // n_cols)  # Ceiling division

# Create figure and subplots
fig, axes = plt.subplots(
    n_rows, n_cols, figsize=(4 * n_cols, 4 * n_rows),
    gridspec_kw={'wspace': 0.1, 'hspace': 0.1}
)
axes = axes.flatten()

title_size = 12

# Extract global Niche categories and corresponding colors
# Sort Niche labels by numeric order
Niche_labels = sorted(adata_concat.obs['scNiche'].unique(), key=lambda x: int(x.replace("Niche", "")))
Niche_colors = adata_concat.uns['scNiche_colors']
legend_data = dict(zip(Niche_labels, Niche_colors))

# Plot spatial clusters for each sample_id
for i, sample_id in enumerate(sample_ids):
    adata_subset = adata_concat[adata_concat.obs['sample_id'] == sample_id]

    # Plot spatial cluster map
    sc.pl.embedding(
        adata_subset,
        color='scNiche',
        title=sample_id,
        basis="spatial",
        alpha=0.75,
        legend_fontsize=12,
        legend_loc=None,  # Disable individual subplot legends
        show=False,
        ax=axes[i]
    )

    # Set title, invert Y axis, and hide axis ticks
    ax = axes[i]
    ax.set_title(sample_id, size=title_size)
    ax.invert_yaxis()
    ax.set_xticks([])  # Hide x-axis ticks
    ax.set_yticks([])  # Hide y-axis ticks
    ax.set_xlabel("")
    ax.set_ylabel("")


# Remove empty subplots (if any)
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Rebuild unified legend (ensure sorted)
legend_data = dict(zip(Niche_labels, Niche_colors))
handles = [Patch(facecolor=color, label=label) for label, color in legend_data.items()]

fig.legend(
    handles=handles,
    loc='center left',
    bbox_to_anchor=(0.98, 0.5),
    fontsize=14,
    title='scNiche',
    frameon=False,
    ncol=1
)

# Adjust layout to make space for the right legend
plt.tight_layout()

# Save and Show plot
plt.tight_layout()
plt.savefig("Results/10.NPC_ST_Analysis/Combined_Spatial_scNiche.pdf", bbox_inches='tight')
plt.show()

#### Save Spatial AnnData object with clustering

In [None]:
adata_concat

In [None]:
adata_concat = adata_concat.raw.to_adata()
print(adata_concat.X.shape)
print(np.min(adata_concat.X), np.max(adata_concat.X))

In [None]:
adata_concat

In [None]:
adata_concat.obs.to_csv("Processed Data/scNiche_metadata.csv", index=True)
adata_concat.write('Processed Data/GSE206245_NPC_ST_Cluster.h5ad',compression='gzip')


**<span style="font-size:16px;">Session information：</span>**

In [None]:
import sys
import platform
import pkg_resources

# Get Python version information
python_version = sys.version
# Get operating system information
os_info = platform.platform()
# Get system architecture information
architecture = platform.architecture()[0]
# Get CPU information
cpu_info = platform.processor()
# Print Session information
print("Python version:", python_version)
print("Operating system:", os_info)
print("System architecture:", architecture)
print("CPU info:", cpu_info)

# Print imported packages and their versions
print("\nImported packages and their versions:")
for package in pkg_resources.working_set:
    print(package.key, package.version)