## 11_2. Mapping single-cell profile onto spatial profile

<div style="text-align: left;">
    <p style="text-align: left;">Updated Time: 2025-04-09</p>
</div>

# Mapping single-cell profile onto spatial profile

Tangram is a method for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. Tangram takes as input a single-cell dataset and a spatial dataset, collected from the same anatomical region/tissue type. Via integration, Tangram creates new spatial data by aligning the scRNAseq profiles in space. This allows to project every annotation in the scRNAseq (e.g. cell types, program usage) on space.

The most common application of Tangram is to resolve cell types in space. Another usage is to correct gene expression from spatial data: as scRNA-seq data are less prone to dropout than (e.g.) Visium or Slide-seq, the “new” spatial data generated by Tangram resolve many more genes. As a result, we can visualize program usage in space, which can be used for ligand-receptor pair discovery or, more generally, cell-cell communication mechanisms. If cell segmentation is available, Tangram can be also used for deconvolution of spatial data. If your single cell are multimodal, Tangram can be used to spatially resolve other modalities, such as chromatin accessibility.

Biancalani, T., Scalia, G., Buffoni, L. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat Methods 18, 1352–1362 (2021). https://doi.org/10.1038/s41592-021-01264-7

##### Load libraries

In [None]:
import os
import sys
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import infercnvpy as cnv
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import omicverse as ov
ov.plot_set()

import warnings
warnings.simplefilter("ignore")

##### Set working directory for analysis

In [None]:
cwd = '/media/bio/Disk/Research Data/EBV/omicverse'
os.chdir(cwd)
updated_dir = os.getcwd()
print("Updated working directory: ", updated_dir)

from pathlib import Path
saving_dir = Path('Results/11.TCell')
saving_dir.mkdir(parents=True, exist_ok=True)

### Prepared scRNA-seq

Published scRNA-seq datasets of lymph nodes have typically lacked an adequate representation of germinal centre-associated immune cell populations due to age of patient donors. We, therefore, include scRNA-seq datasets spanning lymph nodes, spleen and tonsils in our single-cell reference to ensure that we captured the full diversity of immune cell states likely to exist in the spatial transcriptomic dataset.

Here we download this dataset, import into anndata and change variable names to ENSEMBL gene identifiers.

Link: https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad

#### Load & Prep Data

In [None]:
adata = sc.read_h5ad("Processed Data/scRNA_Annotation.h5ad")
adata

In [None]:
for i in adata.obs['Cell_type'].cat.categories:
  number = len(adata.obs[adata.obs['Cell_type']==i])
  print('the number of category {} is {}'.format(i,number))

In [None]:
adata_Epi = sc.read_h5ad("Processed Data/scRNA_Epi_CNV.h5ad")
adata_Epi

In [None]:
adata.obs['Cell_type'] = adata.obs['Cell_type'].astype(str) 
adata.obs.loc[adata_Epi.obs.index, 'Cell_type'] = adata_Epi.obs['cnv_status']


In [None]:
adata_myeloid = sc.read_h5ad("Processed Data/scRNA_Myeloid.h5ad")
adata_myeloid

In [None]:
adata.obs['Cell_type'] = adata.obs['Cell_type'].astype(str)
adata.obs.loc[adata_myeloid.obs.index, 'Cell_type'] = adata_myeloid.obs['Myeloid_subtype']

In [None]:
adata_TCell = sc.read_h5ad("Processed Data/scRNA_TCell.h5ad")
adata_TCell

In [None]:
adata.obs['Cell_type'] = adata.obs['Cell_type'].astype(str)
adata.obs.loc[adata_TCell.obs.index, 'Cell_type'] = adata_TCell.obs['T_subtype']

In [None]:
adata = adata[adata.obs['Cell_type'] != 'T'].copy()

In [None]:
adata.obs['Cell_type'] = adata.obs['Cell_type'].astype('category')
adata.obs['Cell_type'] = adata.obs['Cell_type'].cat.reorder_categories([
    'Normal', 'Tumor', 'Fibroblasts', 
    'CD8⁺ GZMB⁺ early Tem','CD8⁺ GZMB⁺ Tem', 'CD8⁺ GZMK⁺ Tpex', 'CD8⁺ GZMB⁺ Tex', 'CD8⁺ activated-stress Tem', 'CD8⁺ ZNF683⁺ Trm', 'CD8⁺ ISG⁺ T',
    'CD4⁺ Tcm', 'CD4⁺ ISG⁺ T', 'CD4⁺ IL21⁺ Tfh', 'TNFRSF9⁺ Treg', 'TNFRSF9⁻ Treg', 'CD4⁻CD8⁻ T', 'Proliferating T',
    'NK', 'B', 'Plasma',
    'C1QC+ Macro','SPP1+ Macro','IL1B+ Macro','CD14+ Mono','CD16+ Mono',
    'Mast','Neutrophil','cDC','pDC','IgM+ plasma-like'
])

In [None]:
for i in adata.obs['Cell_type'].cat.categories:
  number = len(adata.obs[adata.obs['Cell_type']==i])
  print('the number of category {} is {}'.format(i,number))

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(3,3))
ov.utils.embedding(
    adata,
    basis="X_umap",
    color=['Cell_type'],
    title='Cell_type',
    frameon='small',
    #ncols=1,
    wspace=0.65,
    #palette=ov.utils.pyomic_palette()[11:],
    show=False,
    ax=ax
)

In [None]:
def subsample_per_class(
    adata,
    label_col: str = "Cell_type",
    per_class_max: int = 2000,
    seed: int = 0,
    include_na: bool = False,
    verbose: bool = True,
):
    """
    Stratified random subsampling by class.

    For each category in `adata.obs[label_col]`, keep at most `per_class_max` cells.
    If the category size <= per_class_max, keep all cells from that category.

    Parameters
    ----------
    adata : anndata.AnnData
        Input AnnData object.
    label_col : str
        Column in `adata.obs` containing class labels.
    per_class_max : int
        Maximum number of cells to keep per class.
    seed : int
        Random seed for reproducibility.
    include_na : bool
        If True, treat NA/None as a separate class and subsample it as well.
        If False, drop NA rows before grouping.
    verbose : bool
        If True, print a summary of the subsample.

    Returns
    -------
    anndata.AnnData
        Subsampled AnnData with original row order preserved.
    """
    if label_col not in adata.obs.columns:
        raise KeyError(f"Column {label_col!r} not found in adata.obs.")

    rng = np.random.default_rng(seed)

    # Prepare labels (optionally keep NA as a separate class)
    labels = adata.obs[label_col]
    if include_na:
        lab = labels.copy()
        if lab.dtype.name == "category" and "__NA__" not in lab.cat.categories:
            lab = lab.cat.add_categories(["__NA__"])
        lab = lab.fillna("__NA__")
        idx_base = adata.obs.index
    else:
        # Drop NA rows from grouping; keep their indices out of selection
        lab = labels.dropna()
        idx_base = lab.index

    # Group indices by label and sample up to per_class_max from each group
    keep_indices = []
    for cat, g in adata.obs.loc[idx_base].groupby(lab, observed=True):
        idx = g.index.to_numpy()
        k = min(len(idx), per_class_max)
        if len(idx) > k:
            chosen = rng.choice(idx, size=k, replace=False)
        else:
            chosen = idx
        keep_indices.append(chosen)

    # Concatenate and preserve original order on subset
    keep_indices = np.concatenate(keep_indices) if keep_indices else np.array([], dtype=object)
    mask = adata.obs.index.isin(keep_indices)
    adata_sub = adata[mask].copy()

    if verbose:
        print(f"Subsampled: {adata.shape} -> {adata_sub.shape}")
        print(adata_sub.obs[label_col].value_counts().sort_index())

    return adata_sub

# Example usage:
adata = subsample_per_class(adata, label_col="Cell_type", per_class_max=3000, seed=123)


In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(3,3))
ov.utils.embedding(
    adata,
    basis="X_umap",
    color=['Cell_type'],
    title='Cell_type',
    frameon='small',
    #ncols=1,
    wspace=0.65,
    #palette=ov.utils.pyomic_palette()[11:],
    show=False,
    ax=ax
)


#### Preprocessing scRNA-seq

You can use `recover_counts` to recover the raw counts after normalize and log1p

In [None]:
adata_sc = adata.copy()
print("RAW",adata_sc.X.max())

In [None]:
X_counts_recovered, size_factors_sub=ov.pp.recover_counts(adata_sc.X, 50*1e4, 50*1e5, log_base=None, chunk_size=50000)
adata_sc.layers['counts']=X_counts_recovered

In [None]:
adata_sc.raw = adata_sc
adata_sc.X=adata_sc.layers['counts']
print(np.min(adata_sc.X), np.max(adata_sc.X))

In [None]:
adata.write_h5ad("Processed Data/scRNA_Annotation_Refined_3000_123.h5ad")

For data quality control and preprocessing, we can easily use omicverse's own preprocessing functions to do so

In [None]:
adata_sc=ov.pp.preprocess(adata_sc,mode='shiftlog|pearson',n_HVGs=2000,target_sum=1e4)
adata_sc.raw = adata_sc
adata_sc = adata_sc[:, adata_sc.var.highly_variable_features]
print("Normalize",adata_sc.X.max())

### Prepared stRNA-seq

First let’s read spatial Visium data from 10X Space Ranger output. Here we use lymph node data generated by 10X and presented in [Kleshchevnikov et al (section 4, Fig 4)](https://www.biorxiv.org/content/10.1101/2020.11.15.378125v1). This dataset can be conveniently downloaded and imported using scanpy. See [this tutorial](https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_short_demo.html) for a more extensive and practical example of data loading (multiple visium samples).

In [None]:
adata = sc.read_h5ad("Processed Data/GSE206245_NPC_ST_Cluster.h5ad")
adata

In [None]:
print(np.min(adata.X), np.max(adata.X))

In [None]:
for i in adata.obs['sample_id'].cat.categories:
  number = len(adata.obs[adata.obs['sample_id']==i])
  print('the number of category {} is {}'.format(i,number))

We used the same pre-processing steps as for scRNA-seq

<div class="admonition warning">
  <p class="admonition-title">Note</p>
  <p>
    We introduced the spatial special svg calculation module prost in omicverse versions greater than `1.6.0` to replace scanpy's HVGs, if you want to use scanpy's HVGs you can set mode=`scanpy` in `ov.space.svg` or use the following code.
  </p>
</div>

```python
#adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
#adata.raw = adata
#adata = adata[:, adata.var.highly_variable_features]
```

In [None]:
# sc.pp.calculate_qc_metrics(adata, inplace=True)
# adata = adata[:,adata.var['total_counts']>100]
adata.X = adata.X.toarray()
adata.raw = adata
adata=ov.space.svg(adata,mode='prost',n_svgs=2000,target_sum=1e4,platform="visium",)
adata = adata[:, adata.var.space_variable_features]
adata_sp=adata.copy()
adata_sp

## Tangram model

Tangram is a Python package, written in PyTorch and based on scanpy, for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. The single-cell dataset and the spatial dataset should be collected from the same anatomical region/tissue type, ideally from a biological replicate, and need to share a set of genes. 

We can use `omicverse.space.Tangram` to apply the Tangram model.

In [None]:
tg=ov.space.Tangram(adata_sc,adata_sp,clusters='Cell_type')

The function maps iteratively as specified by num_epochs. We typically interrupt mapping after the score plateaus. 
- The score measures the similarity between the gene expression of the mapped cells vs spatial data on the training genes. 
- The default mapping mode is mode=`cells`, which is recommended to run on a GPU. 
- Alternatively, one can specify mode=`clusters` which averages the single cells beloning to the same cluster (pass annotations via cluster_label). This is faster, and is our chioce when scRNAseq and spatial data come from different specimens. 
- If you wish to run Tangram with a GPU, set device=`cuda:0` otherwise use the set device=`cpu`. 
- density_prior specifies the cell density within each spatial voxel. Use uniform if the spatial voxels are at single cell resolution (ie MERFISH). The default value, rna_count_based, assumes that cell density is proportional to the number of RNA molecules

In [None]:
tg.train(mode="cells",num_epochs=500,device="cpu")

We can use `tg.cell2location()` to get the cell location in spatial spots.

In [None]:
adata_plot=tg.cell2location()
adata_plot.obs.columns

In [None]:
adata_plot.obs.head()

#### Save Spatial AnnData Object with Tangram Mapping

In [None]:
adata_plot

In [None]:
adata_plot = adata_plot.raw.to_adata()
adata_plot

In [None]:
print(adata_plot.X.shape)
print(np.min(adata_plot.X), np.max(adata_plot.X))

In [None]:
adata_plot.write('Processed Data/GSE206245_NPC_ST_Cluster_Tangram_Refined_3000_123.h5ad',compression='gzip')
adata_plot.obsm['tangram_ct_pred'].to_csv("/media/bio/Disk/Research Data/EBV/omicverse/Processed Data/tangram_ct_pred_Refined_3000_123.csv")
adata_plot.obs[['sample_id', 'scNiche']].to_csv("/media/bio/Disk/Research Data/EBV/omicverse/Processed Data/GSE206245_NPC_ST_scNiche_Refined_3000_123.csv")


**<span style="font-size:16px;">Session information：</span>**

In [None]:
import sys
import platform
import pkg_resources

# Get Python version information
python_version = sys.version
# Get operating system information
os_info = platform.platform()
# Get system architecture information
architecture = platform.architecture()[0]
# Get CPU information
cpu_info = platform.processor()
# Print Session information
print("Python version:", python_version)
print("Operating system:", os_info)
print("System architecture:", architecture)
print("CPU info:", cpu_info)

# Print imported packages and their versions
print("\nImported packages and their versions:")
for package in pkg_resources.working_set:
    print(package.key, package.version)