Skip to content

Step 2. Cluster cell neighborhoods into approximately independent blocks

Huwenbo Shi edited this page Jun 11, 2026 · 2 revisions

Overview

scEPS uses statsitical block bootstrap to estimate standard errors for aggregated scEPS statsitics across a group of cell neighborhoods. Here, the blocks represent approximately independent cell neighborhoods, and are obtained by applying mini batch k-means clustering on the neighborhood abundance matrix. The intuition of this approach is that cell neighborhoods with similar donor representations tend to have more correlated scEPS statistics.

Please note that this step of defining approximately indepedent cell neighborhood blocks should be applied across all cells in the single-cell data, even if indepedent analyses are performed for each cell type separately in Step 1.

We provide a Python command line tool to cluster cell neighborhoods into approximately independent blocks. Below, we will go over:

Input used for cell neighborhood clustering

The clustering tool (sceps_cluster_neighborhood.py) should be applied to the same single-cell RNA-seq data analyzed by scEPS in 1. Estimate scEPS statistics at each individual cell neighborhood. However, the clustering tool only uses the batch harmonized k-NN graph for the cells. The user may reduce memory usage in this step, by removing adata.X from the single-cell data.

Examples ways to run the cell neighborhood clustering tool

The following code snippet is a typical example shell script to run the neighborhood clustering tool under default settings.

python <path to the scEPS package>/sceps_cluster_neighborhood.py \
    --adata <path to single-cell data> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --donor-id-col <column in adata.obs representing donor ID> \
    --neighbors-use-rep <cell embedding used to construct k-NN graph> \
    --num-kmeans-cluster <number of k-means clusters, 100 by default> \
    --out <Output file names>

Explanation of the output

After sceps_cluster_neighborhood.py runs successfully, a data frame with file name specified by the --out flag will be saved as a text file. Each row of the data frame represents a mapping from a cell neighborhood (indexed by cells) to an approximately independent cluster. The data frame has the following columns:

<cell ID>      sceps.neighborhood_cluster

Here, <cell ID> is the name of cell ID column in the single-cell data, as specified by the --cell-id-col flag; sceps.neighborhood_cluster is the approximately independent cluster assigned to the cell neighborhood.

Explanation of all the options in the cell neighborhood clustering tool

Below, we provide detailed explanations for all the available options implemented in the cell neighborhood clustering tool.

Option Usage Description
--adata string, required This is used to specify the input single-cell RNA-seq data in h5ad format. This should be same single-cell data as analyzed by scEPS. However, the user may remove adata.X to reduce memory usage, as the clustering tool only requires k-NN graph for the cells.
--cell-id-col string, optional, empty string by default This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs.
--donor-id-col string, required This is used to specify the cell embedding (e.g., PCA, scVI embeddings, etc.) used to construct the k-NN graph.
--num-kmeans-cluster integer, optional, 100 by default This is used to specify the desired number of clusters (i.e., approximately independent blocks of cell neighborhoods).
--seed integer, optional, 0 by default This is used to specify the seed for the random number generator.
--out string, required This is used to specify the output file name.