Skip to content

Step 4. Calculate the correlation between scEPS statistics and other attributes

Huwenbo Shi edited this page Jun 11, 2026 · 1 revision

Overview

We provide a tool, sceps_corr.py, for calculating the correlation between scEPS statistics with gene expression across cells, which index cell neighborhoods. Below, we will go over:

Input required for running the tool

This tool requires the following data as input:

  • A single-cell data, containing log-normalized gene expression matrix, in h5ad format

  • A text file containing a data frame of estimated scEPS statistics across all cell neighborhoods

  • (optional) A text file (with header) containing a list of cell IDs, across which the correlation is calculated

Example script to run the tool

The following shell script calculates the correlation (Spearman's $\rho$) between the expression of each gene with scEPS statistics across all cells.

python <path to the scEPS package>/sceps_corr.py \
    --adata <single-cell data containing log-noramlized gene expression matrix> \
    --sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
    --gene-id-col <column in adata.var representing gene ID/symbol> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --out <prefix for the output file name>
Option Usage Description
--adata string, required This is used to specify the path to the single-cell data containing log-noramlized gene expression matrix.
--sceps-result string, required This is used to specify a text file containing the data frame for estimates of scEPS statistics across all cell neighborhoods.
--cell-id-col string, optional, empty string by default This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs.
--gene-id-col string, optional, empty string by default This is used to specify the name of the column that represents gene symbols/IDs in the adata.var data frame of the single-cell data. If left empty, scEPS will use what's in adata.var.index as gene symbols/IDs.
--out string, required This is used to specify the prefix of the output file name.

The following shell script calculates the correlation (Spearman's $\rho$) between the expression of each gene with scEPS statistics across cells of a particular cell type.

python <path to the scEPS package>/sceps_corr.py \
    --adata <single-cell data containing log-noramlized gene expression matrix> \
    --sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
    --gene-id-col <column in adata.var representing gene ID/symbol> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --cell-type-col <column in adata.obs representing cell types> \
    --focal-cell-type <focal cell type to analyze> \
    --out <prefix for the output file name>
Option Usage Description
--cell-type-col string, optional, empty string by default This is used to specify the name of the column that represents cell types in the adata.obs data frame of the single-cell data. This is an emtpy string by default.
--focal-cell-type string, optional, empty string by default This is used to specify the focal cell type to analyze. If specified, scEPS will only analyze cell neighborhoods in the specified cell type. This is an emtpy string by default.

By default, we don't provide test statistics, testing the significance of the correlations, because it's difficult to derive a test that accounts for the correlations in both the scEPS statistics and the gene expression across cells. Nevertheless, we allow the user to obtain a test statistics, by bootstrapping across approximately indepdent neighborhood blocks (see example script below). However, the user should be cautious when interpreting the test statistics.

python <path to the scEPS package>/sceps_corr.py \
    --adata <single-cell data containing log-noramlized gene expression matrix> \
    --sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
    --gene-id-col <column in adata.var representing gene ID/symbol> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --neighborhood-clusters <a text file mapping cell neighborhoods to blocks> \
    --num-bootstrap 1000 \
    --out <prefix for the output file name>
Option Usage Description
--neighborhood-clusters string, optional, empty string by default This is used to specify the text file from step 2, representing a pre-computed mapping of cell neighborhoods to approximately independent blocks of cell neighborhoods. If left empty, the tool will use regular bootstrap instead of block bootstrap.
--num-bootstrap integer, optional, 0 by default This is used to specify the number of bootstrap samples (0 by default, not calculating test statistics).

Explanation of the output

The typical shell script to run sceps_corr.py would create a text file with the following columns:

Column name Description
GENE Gene name.
R_OMEGA_GWAS Correlation between gene expression and scEPS $\omega^2_{gwas}$ statistics across cells.
R_OMEGA_CONTROL Correlation between gene expression and scEPS $\omega^2_{ctrl}$ statistics across cells.
R_OMEGA_REST Correlation between gene expression and scEPS $\omega^2_{rest}$ statistics across cells.
R_OMEGA_OVERALL Correlation between gene expression and scEPS $\omega^2_{overall}$ statistics across cells.
R_OMEGA_DIFF Correlation between gene expression and scEPS $d$ statistics across cells.

If the user chooses to obtain test statistics for each of the correlation, the following additional columns will be included in the output text file:

Column name Description
SE_R_OMEGA_GWAS Standard error of the correlation between gene expression and scEPS $\omega^2_{gwas}$ statistics across cells.
Z_R_OMEGA_GWAS Z-score of the correlation between gene expression and scEPS $\omega^2_{gwas}$ statistics across cells.
P_R_OMEGA_GWAS p-value of the correlation between gene expression and scEPS $\omega^2_{gwas}$ statistics across cells.
SE_R_OMEGA_CONTROL Standard error of the correlation between gene expression and scEPS $\omega^2_{ctrl}$ statistics across cells.
Z_R_OMEGA_CONTROL Z-score of the correlation between gene expression and scEPS $\omega^2_{ctrl}$ statistics across cells.
P_R_OMEGA_CONTROL p-value of the correlation between gene expression and scEPS $\omega^2_{ctrl}$ statistics across cells.
SE_R_OMEGA_REST Standard error of the correlation between gene expression and scEPS $\omega^2_{rest}$ statistics across cells.
Z_R_OMEGA_REST Z-score of the correlation between gene expression and scEPS $\omega^2_{rest}$ statistics across cells.
P_R_OMEGA_REST p-value of the correlation between gene expression and scEPS $\omega^2_{rest}$ statistics across cells.
SE_R_OMEGA_OVERALL Standard error of the correlation between gene expression and scEPS $\omega^2_{overall}$ statistics across cells.
Z_R_OMEGA_OVERALL Z-score of the correlation between gene expression and scEPS $\omega^2_{overall}$ statistics across cells.
P_R_OMEGA_OVERALL p-value of the correlation between gene expression and scEPS $\omega^2_{overall}$ statistics across cells.
SE_R_OMEGA_DIFF Standard error of the correlation between gene expression and scEPS $d$ statistics across cells.
Z_R_OMEGA_DIFF Z-score of the correlation between gene expression and scEPS $d$ statistics across cells.
P_R_OMEGA_DIFF p-value of the correlation between gene expression and scEPS $d$ statistics across cells.

Explanation of all the options in the tool

The following options are used to specify the input for calculating the correlations.

Option Usage Description
--sceps-result string, required This is used to specify a text file containing the data frame for estimates of scEPS statistics across all cell neighborhoods.
--adata string, required This is used to specify the path to the single-cell data containing log-noramlized gene expression matrix.
--cell-id-col string, optional, empty string by default This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs.
--gene-id-col string, optional, empty string by default This is used to specify the name of the column that represents gene symbols/IDs in the adata.var data frame of the single-cell data. If left empty, scEPS will use what's in adata.var.index as gene symbols/IDs.
--lognorm-count bool, optional, False by default If specified, the tool will apply log-normalization on the single-cell data. This flag should only be specified if the data contains raw read count.
--cell-type-col string, optional, empty string by default This is used to specify the name of the column that represents cell types in the adata.obs data frame of the single-cell data. This is an emtpy string by default.
--focal-cell-type string, optional, empty string by default This is used to specify the focal cell type to analyze. If specified, scEPS will only analyze cell neighborhoods in the specified cell type. This is an emtpy string by default.
--focal-cells string, optional, empty string by default This is used to specify a text file containing a list of focal cell neighborhoods to analyze. If specified, scEPS will only analyze cells in this file. The header of this file should be the same as specified by --cell-id-col.

The following options are used for job parallelization purposes.

Option Usage Description
--total-num-job integer, optional, None by default This is used to specify the total number of parallel jobs used for the scEPS analysis.
--job-idx integer, optional, None by default This is used to specify the index of the parallel job for analyzing a subset of the single-cell data.
--start-idx integer, optional, None by default This is used to specify the starting index (inclusive) of the cell neighborhood to analyze.
--stop-idx integer, optional, None by default is used to specify the stopping index (exclusive) of the cell neighborhood to analyze.

The following options are used to specify how to calculate the correlation.

Option Usage Description
--corr-type string from {"spearman", "pearson"}, optional, "spearman" by default This is used to specify the correlation type. The user can choose from "spearman" and "pearson" with "spearman" as the default.
--min-num-nonzero integer, optional, 100 by default This is used to specify the minimum number of cells the gene needs to be expressed in (100 by default).
--neighborhood-clusters string, optional, empty string by default This is used to specify the text file from step 2, representing a pre-computed mapping of cell neighborhoods to approximately independent blocks of cell neighborhoods. If left empty, the tool will use regular bootstrap instead of block bootstrap.
--num-bootstrap integer, optional, 0 by default This is used to specify the number of bootstrap samples (0 by default, not calculating test statistics).
--block-bootstrap string, optional, sceps.neighborhood_cluster by default This is used to specify the column that represents approximately independent cell neighborhood blocks (sceps.neighborhood_cluster by default). If left empty, the tool will fall back to regular bootstrap across individual (instead of blocks of) cell neighborhoods.
--seed integer, optional, 0 by default This is used to specify the seed for the random number generator (0 by default).
--out string, required This is used to specify the prefix of the output file name.