Skip to content

Step 1. Estimate scEPS statistics at each individual cell neighborhood

Huwenbo Shi edited this page Jun 11, 2026 · 2 revisions

Overview

The first step in a scEPS analysis is to estimate the scEPS $d$, $\omega^2_{gwas}$, $\omega^2_{ctrl}$, $\omega^2_{rest}$, $\omega^2_{overall}$ statistics for each cell neighborhood, using sceps.py. Here, we will go over:

Input required to run scEPS

scEPS requires the following as input:

  • (required) A text file containing the MAGMA association statistics for each gene. The file should contain a GENE column, which contains gene symbols or IDs that match those used in the single-cell data. The file should also contain a ZSTAT column, which represent the MAGMA association statistics from step 0a.

  • (required) A single-cell data in h5ad format. The single-cell data should contain at least 20 donors showing variations in their disease phenotype.

  • (optional) A text file containing disease phenotypes for the donors in the single-cell data. scEPS can analyze phenotypes that is stored in a file external to the single-cell data.

  • (optional) A list of cell IDs to apply scEPS analysis. If provided, scEPS will only estimate the scEPS statistics for cell neighborhoods indexed by these cells.

Example ways to run scEPS

Typical shell script to run scEPS

The following code snippet is a typical example shell script to run scEPS under default settings.

python <path to scEPS package>/sceps.py \
    --adata <path to single-cell data> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --gene-id-col <column in adata.var representing gene ID/symbol> \
    --donor-id-col <column in adata.obs representing donor ID> \
    --gene-list <path to the file containing MAGMA gene-level association statistics> \
    --auto-gene-selection \
    --pheno <column in adata.obs representing the phenotype> \
    --scale-pheno \
    --scale-pheno-neighborhood \
    --covar <a list of covariates to adjust, separated by commas> \
    --total-num-job <total number of parallel jobs used to analyze the single-cell data> \
    --job-idx <index for the current parallel job> \
    --out <prefix for the output file name>
Option Usage Description
--adata string, required This is used to specify the input single-cell RNA-seq data in h5ad format. The single-cell data should consist of multiple donors, showing variations in their disease phenotypes.
--cell-id-col string, optional, empty string by default This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs.
--gene-id-col string, optional, empty string by default This is used to specify the name of the column that represents gene symbols/IDs in the adata.var data frame of the single-cell data. If left empty, scEPS will use what's in adata.var.index as gene symbols/IDs.
--donor-id-col string, required This is used to specify the name of the column that represents donor IDs in the adata.obs data frame of the single-cell data.
--gene-list string, required This is used to specify a text file containing the gene-level MAGMA associaiton statistics. The text file should contain a GENE column and a ZSTAT column.
--auto-gene-selection bool, optional, False by default If specified (recommended), scEPS will automatically select GWAS genes based on the gene-level MAGMA association statistics. The user can also bypass auto gene selection, using the --num-gwas-genes flag to specify the number of top GWAS genes.
--pheno string, required This is used to specify the column that represents phenotype in the adata.obs data frame of the single-cell data or in the text file external to the single-cell data.
--scale-pheno bool, optional, , False by default If specified (recommended), scEPS will mean-center and scale the phenotype to have unit variance, at the global level (i.e., before variance decomposition at individual cell neighborhood level). Otherwise, scEPS will only mean-center the phenotype.
--scale-pheno-neighborhood bool, optional, , False by default If specified (recommended), scEPS will mean-center and scale the phenotype to have unit variance, at each cell neighborhood, on top of the global level standardization. Otherwise, scEPS will only mean-center the phenotype.
--covar string, optional This is used to specify a list of columns in adata.obs that represent covariates to adjust. Multiple covariates names should be separated by commas. We recommend to include sceps.prop_cells_local in this flag, to adjust for the impact of neighborhood abundance on the phenotype. An example specification for --covar could be --covar "age,gender,sceps.prop_cells_local".
--total-num-job integer, optional, None by default This is used to specify the total number of parallel jobs used for the scEPS analysis.
--job-idx integer, optional, None by default This is used to specify the index of the parallel job for analyzing a subset of the single-cell data.
--out string, required This is used to specify the prefix for the output file names.

Analyzing phenotypes saved in a text file

If the phenotype is not stored in the single-cell h5ad file, the user can provide a text file with the phenotype, using the --pheno-file flag, and specify the phenotype to analyze using the --pheno flag.

The following is an example shell script to analyze phenotype saved in an external text file.

python <path to scEPS package>/sceps.py \
    --adata <path to single-cell data> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --gene-id-col <column in adata.var representing gene ID/symbol> \
    --donor-id-col <column in adata.obs representing donor ID> \
    --gene-list <path to the file containing MAGMA gene-level association statistics> \
    --auto-gene-selection \
    --pheno-file <an external text file containing the phenotype> \
    --pheno <column in the phenotype file above representing the phenotype> \
    --scale-pheno \
    --scale-pheno-neighborhood \
    --covar <a list of covariates to adjust, separated by commas> \
    --total-num-job <total number of parallel jobs used to analyze the single-cell data> \
    --job-idx <index for the current parallel job> \
    --out <prefix for the output file name>

Here, the --pheno-file flag is used to specify the external phenotype file, which should have a column representing the donor IDs and one or multiple columns representing the phenotypes. The name of the column for donor IDs should be the same as the one used in the single-cell data (i.e., specified by the --donor-id-col flag). And at least one of the names of the columns for phenotypes should be the one specified by the --pheno flag.

Calling scEPS functions within Python

We also allow the user to call scEPS functions from within Python, instead of from command line. The following code snippet provides an example of how to do this. Briefly, we provide a function, create_sceps_default_args, to specify the default/recommended settings to run scEPS. The users can provide additional settings, including input files, cell IDs, and gene symbols, etc., by modifying the args object.

import sys
sys.path.append('/path/to/sceps_tool')

from src.sceps_core import *

# Create args with required arguments and default settings to run scEPS
args = create_sceps_default_args(adata='/path/to/the/single/cell/data',
                                gene_list='/path/to/the/gene/level/MAGMA/association/statistics')

# Specify additional settings as needed
args.cell_id_col = '<cell ID column in adata.obs>'
args.gene_id_col = '<gene ID column in adata.var>'
args.donor_id_col = '<donor ID column in adata.obs>'
args.pheno_file = '/path/to/an/external/phenotype/file'
args.pheno = '<phenotype column in adata.obs>'
args.covar = '<a list of covariates to adjust, separated by commas>'
args.out = '<output file name>'

# Set seed for random number generator
np.random.seed(args.seed)
random.seed(args.seed)

# Load the single-cell data
adata = load_single_cell_data(args)

# Load the list of genes with MAGMA Z-stat
gene_list = load_gene_stats(args)

# Intersect the genes in the adata with genes with MAGMA Z-stat
adata = subset_adata_genes(args, adata, gene_list)

# Determine the list of GWAS genes
gwas_genes = select_disease_gwas_genes(args, gene_list)

# Run scEPS for the first 10 cells in the single-cell data
start_idx, stop_idx = 0, 10
sceps_out = get_sceps_stats(args, adata, gwas_genes, start_idx=start_idx, stop_idx=stop_idx)

# Save results
out_suffix = '{}-{}'.format(start_idx, stop_idx)
save_sceps_stats(args, sceps_out, out_suffix)

Explanation of the columns of the scEPS ouput

The output of scEPS is saved in a compressed text file with suffix .sceps.omega.txt.gz. Each row of the output of scEPS represents analysis results for a cell neighborhood, anchored at individual cells. Below, we provide interpretations for each column of the output of scEPS. Although there are many columns in the scEPS output, the most important ones include CELL, which indexes the cell neighborhood, and OMEGA_DIFF, SE_OMEGA_DIFF, Z_OMEGA_DIFF, and P_Z_OMEGA_DIFF, which represent the estimated scEPS $d$ statistic, its standard error, and test statistics, respectively.

Column name Description
CELL Cell ID that indexes the cell neighborhood.
DONOR_ID Donor ID for the cell that indexes the neighborhood.
STEP_SIZE Size of the random walk on the k-NN graph used to define the cell neighborhood.
NEIGHBORHOOD_SIZE Number of cells in the neighborhood.
NUM_DONOR Number of donors represented in the neighborhood.
ELAPSED_TIME Time (in seconds) used to analyze the cell neighborhood.
MEAN_MEAN_EXPR_GWAS Mean expression of each GWAS gene across donors (prior to mean centering), averaged across all GWAS genes.
MEAN_MEAN_EXPR_CONTROL Mean expression of each control gene across donors (prior to mean centering), averaged across all control genes.
MEAN_MEAN_EXPR_REST Mean expression of each remaining (non-GWAS/control) gene across donors (prior to mean centering), averaged across all remaining genes.
MEAN_MEAN_EXPR_ALL Mean expression of each gene (GWAS, control, and rest) across donors (prior to mean centering), averaged across all genes.
MEAN_VAR_EXPR_GWAS Variance of the expression of each GWAS gene across donors, averaged across all GWAS genes.
MEAN_VAR_EXPR_CONTROL Variance of the expression of each control gene across donors, averaged across all control genes.
MEAN_VAR_EXPR_REST Variance of the expression of each remaining (non-GWAS/control) gene across donors, averaged across all remaining genes.
MEAN_VAR_EXPR_ALL Variance of the expression of each gene (GWAS, control, and rest) across donors, averaged across all genes.
NUM_GENE_GWAS Number of GWAS genes.
NUM_GENE_CONTROL Number of control genes.
NUM_GENE_REST Number of remaining (non-GWAS/control) genes.
OMEGA_GWAS Estimated $\omega^2_{gwas}$ statistic, representing the variance in disease phenotype explained by each GWAS gene.
SE_OMEGA_GWAS Standard error for the estimated $\omega^2_{gwas}$ statistic.
Z_OMEGA_GWAS Z-score for the estimated $\omega^2_{gwas}$ statistic.
P_Z_OMEGA_GWAS p-value testing whether the estimated $\omega^2_{gwas}$ is different from 0.
OMEGA_CONTROL Estimated $\omega^2_{ctrl}$ statistic, representing the variance in disease phenotype explained by each control gene.
SE_OMEGA_CONTROL Standard error for the estimated $\omega^2_{ctrl}$ statistic.
Z_OMEGA_CONTROL Z-score for the estimated $\omega^2_{ctrl}$ statistic.
P_Z_OMEGA_CONTROL p-value testing whether the estimated $\omega^2_{ctrl}$ is different from 0.
OMEGA_REST Estimated $\omega^2_{rest}$ statistic, representing the variance in disease phenotype explained by each remaining (non-GWAS/control) gene.
SE_OMEGA_REST Standard error for the estimated $\omega^2_{rest}$ statistic.
Z_OMEGA_REST Z-score for the estimated $\omega^2_{rest}$ statistic.
P_Z_OMEGA_REST p-value testing whether the estimated $\omega^2_{rest}$ is different from 0.
OMEGA_OVERALL Estimated $\omega^2_{overall}$ statistic, representing the variance in disease phenotype explained by each gene (GWAS, control, and rest).
SE_OMEGA_OVERALL Standard error for the estimated $\omega^2_{overall}$ statistic.
Z_OMEGA_OVERALL Z-score for the estimated $\omega^2_{overall}$ statistic.
P_Z_OMEGA_OVERALL p-value testing whether the estimated $\omega^2_{overall}$ is different from 0.
OMEGA_DIFF Estimated $d$ statistic, representing the difference between the estimated $\omega^2_{gwas}$ and $\omega^2_{ctrl}$.
SE_OMEGA_DIFF Standard error for the estimated $d$ statistic.
Z_OMEGA_DIFF Z-score for the estimated $d$ statistic.
P_Z_OMEGA_DIFF p-value testing whether the estimated $d$ is different from 0.

Explanation of all the options in the scEPS analysis

Below, we provide detailed explanations for all the available options implemented in the scEPS analysis.

Input related options

The following options are used to specify the inputs for the scEPS analysis.

Option Usage Description
--adata string, required This is used to specify the input single-cell RNA-seq data in h5ad format. The single-cell data should consist of multiple donors, showing variations in their disease phenotypes.
--cell-type-col string, optional, empty string by default This is used to specify the name of the column that represents cell types in the adata.obs data frame of the single-cell data. This is an emtpy string by default.
--focal-cell-type string, optional, empty string by default This is used to specify the focal cell type to analyze. If specified, scEPS will only analyze cell neighborhoods in the specified cell type. This is an emtpy string by default.
--gene-id-col string, optional, empty string by default This is used to specify the name of the column that represents gene symbols/IDs in the adata.var data frame of the single-cell data. If left empty, scEPS will use what's in adata.var.index as gene symbols/IDs.
--cell-id-col string, optional, empty string by default This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs.
--donor-id-col string, required This is used to specify the name of the column that represents donor IDs in the adata.obs data frame of the single-cell data.
--pheno-file ostring, ptional, empty string by default This is used to specify an external text file containing donor phenotypes. The text file should contain a donor ID column, same as the one used in the single-cell data, and one or more columns representing donor phenotypes.
- --batch-key string, optional, empty string by default This is used to specify batch information for the donors. If specified, scEPS will permute the donors within each batch, if using permutation-based testing procedure.
--gene-list string, required This is used to specify a text file containing the gene-level MAGMA associaiton statistics. The text file should contain a GENE column and a ZSTAT column.
--control-gene-list string, optional, empty string by default This is used to specify a text file containing a list of pre-determined control gene list. By default, scEPS will automatically randomly select a list of control genes.
--focal-cells string, optional, empty string by default This is used to specify a text file containing a list of focal cell neighborhoods to analyze. If specified, scEPS will only analyze cells in this file.

Parallelization related options

scEPS offers 2 different ways to parallelize the analysis.

Option 1. The user can specify the total number of parallel jobs spawned for the analysis. scEPS will automatically evenly split the work load across these parallel jobs.

Option 2. The user can provde a starting and stopping index of the cell neighborhooods to analyze.

Option Usage Description
--total-num-job integer, optional, None by default This is used to specify the total number of parallel jobs used for the scEPS analysis.
--job-idx integer, optional, None by default This is used to specify the index of the parallel job for analyzing a subset of the single-cell data.
--start-idx integer, optional, None by default This is used to specify the starting index (inclusive) of the cell neighborhood to analyze.
--stop-idx integer, optional, None by default This is used to specify the stopping index (exclusive) of the cell neighborhood to analyze.

QC related options

The following options is used to specify QC parameters used in the scEPS analysis.

Option Usage Description
--min-expr-var-thres float, optional, 1e-4 by default This is used to specify the minimum threshold on the variance of gene expression across donors. By default, genes with variance across donors less than 1e-4 are filtered.
--max-expr-var-thres float, optional, 100.0 by default This is used to specify the maximum threshold on the variance of gene expression across donors. By default, genes with variance across donors greater than 100.0 are filtered.

Neighborhood definition related options

The following options are used to specify how scEPS define cell neighborhoods based on the k-NN graph.

Option Usage Description
--min-num-donor integer, optional, 8 by default This is used to specify the minimum number of donors required in each cell neighborhood. By default, we set this number to 8.
--min-frac-donor float, optional, 0.333333 by default This is used the specify the fraction of total number of donors in the single-cell data required to be represented in each cell neighborhood. By default, we set this to 0.333333 (i.e., approximately 1/3).
--min-num-cell integer, optional, 5 by default is used to specify the minimum number of cells required in each cell neighborhood. By default, we set this to 5.
--neighborhood-definition-method string from {"nam", "soft threshold", "hard threshold"}, optional, "nam" by default This is used to specify the neighborhood definition method. The user can choose from "nam", "soft threshold", and "hard threshold". By default scEPS uses the "nam" approach, defining cell neighborhoods based on the neighborhood abundance matrix.
--check-pheno-in-neighborhood bool, optional, False by default If specified, scEPS will ensure that the variance of the phenotype across donors in the cell neighborhood is non-zero.
--max-step-size integer, optional, 15 by default This is used to specify the maximum number of random walk step size to test. By default, this is set to 15.
--prob-thres float, optional, 0.0001 by default This is used to specify the threshold for the transition probability, if --neighborhood-definition-method is set to "hard threshold".

GWAS gene selction related options

The following options are used to specify how scEPS determines the set of GWAS genes to analyze.

Option Usage Description
--auto-gene-selection bool, optional, , False by default If specified (recommended), scEPS will automatically select GWAS genes based on the gene-level MAGMA association statistics. The user can also bypass auto gene selection, using the --num-gwas-genes flag to specify the number of top GWAS genes.
--num-gwas-genes integer, optional, 500 by default This is used to specify the number of top GWAS genes, if the user chooses to manually select the top GWAS genes to analyze.
--magma-fdr-thres float, optional, 0.05 by default This is used to specify the FDR threshold to select GWAS genes, if the user chooses to use the automatically approach for selecting GWAS genes.
--min-num-gwas-genes integer, optional, 500 by default This is used to specify the minimum number of GWAS genes included in the scEPS analysis. By default, this is set to 500.
--max-num-gwas-genes integer, optional, 2,000 by default This is used to specify the maximum number of GWAS genes included in the scEPS analysis. By default, this is set to 2,000.

Control gene selection related options

The following options are used to specify how scEPS determines the list of control genes for the GWAS genes.

Option Usage Description
--enrich-low-score-control-genes bool, optional, False by default If specified, the control gene selection procedure will preferentially select control genes starting from the lowest gene-level MAGMA association statistics.
--include-gwas-genes-in-control-genes bool, optional, False by default If specified, the set of GWAS genes will also be included in the sampling of control genes.
--num-expr-bins integer, optional, 10 by default This is used to specify the number of bins for binning all the genes based on their mean expression. The default is 10.
--num-control-gene-set integer, optional, 1 by default This is used to specify the number of control gene sets. This should be set to 1 in general.
--num-control-gene-set-testing integer, optional, 0 by default This is used to specify the number of control gene sets use for permutation-based testing. By default, this number is set to 0.
If --match-expr-std bool, optional, False by default is specified, scEPS will also match the standard deviation of the gene expression of the control genes with that of GWAS genes.

Estimation related options

The following options are used to specify how scEPS estimates the variance components of the scEPS model using methods of moment.

Option Usage Description
--pheno string, required This is used to specify the column that represents phenotype in the adata.obs data frame of the single-cell data or in the text file external to the single-cell data.
--covar string, optional, empty string by default This is used to specify a list of columns in adata.obs that represent covariates to adjust. Multiple covariates names should be separated by commas. We recommend to include sceps.prop_cells_local in this flag, to adjust for the impact of neighborhood abundance on the phenotype. An example specification for --covar could be --covar "age,gender,sceps.prop_cells_local".
--num-var-comp integer from {2, 3}, optional, 3 by default This is used to specify the number of variance components. By default, this is set to 3 (recommended). And scEPS will decompose the phenotypic variance into GWAS genes, control genes, and remaining (non-GWAS/control) genes. If this is set to 2 (not recommended), scEPS will decompose the phenotypic variance into GWAS genes and control genes.
--num-bootstrap-disattenuation integer, optional, 100 by default This is used to specify the number of bootstap samples used to estimate the disattenuation factor. By default, this is set to 100.
--num-bootstrap-regression integer, optional, 1,000 by default This is used to specify the number of bootstrap samples used to estimate the standard errors for the scEPS statistics. By default, this is set to 1,000.
--use-ols bool, optional, False by default If specified, scEPS will use ordinary least square regression to estimate the variance components. By default, scEPS uses weighted least square regression.
--use-analytical-stderr bool, optional, False by default If specified, scEPS will use analytical approach to calculate standard errors. This is a lot faster than using the bootstrap approach, and can be used if computational resource is limited.
--scale-pheno bool, optional, False by default If specified (recommended), scEPS will mean-center and scale the phenotype to have unit variance, at the global level (i.e., before variance decomposition at individual cell neighborhood level). Otherwise, scEPS will only mean-center the phenotype.
--scale-pheno-neighborhood bool, optional, False by default If specified (recommended), scEPS will mean-center and scale the phenotype to have unit variance, at each cell neighborhood, on top of the global level standardization. Otherwise, scEPS will only mean-center the phenotype.
--exclude-diag bool, optional, False by default If specified, scEPS will only use cross products of the phenotypes between different individuals to estimate the variance components. By default, scEPS includes squared phenotypes of the same individuals.
--exclude-intercept bool, optional, False by default If specified, scEPS will not include an intercept term in the regression. By default, an intercept term is included in the regression.
--no-disattenuation bool, optional, False by default If specified (not recommended), scEPS will not apply disattenuation factor on the estimates.
--skip-regression bool, optional, False by default If specified, scEPS will skip the estimation procedure, but will record the summary statistics (e.g., number of cells/donors) for each neighborhood. This is typically used for diagnosis purposes only.
--permute-within-batch bool, optional, False by default If specified, scEPS will run permutation within each batch of donors. This is only used if the user chooses to use permutation to obtain p-values.

Output related options

The following options specify what scEPS output to save.

Option Usage Description
--save-neighborhood-cells bool, optional, False by default If specified, scEPS will save the cells in each neighborhood as a text file.
--save-neighborhood-donors bool, optional, False by default If specified, scEPS will save the donors in each neighborhood as a text file.
--save-target-nam bool, optional, False by default If specififed, scEPS will save the target neighborhood abundance matrix as a text file.
--save-realized-nam bool, optional, False by default If specified, scEPS will save the realized neighborhood abundance matrix as a text file.
--save-sceps-raw bool, optional, False by default If specified, scEPS will save the intermediate raw results as a text file.
--save-sceps-sigma bool, optional, False by default If specified, scEPS will save the $\sigma^2_{gwas}$, $\sigma^2_{ctrl}$, $\sigma^2_{rest}$, $\sigma^2_{overall}$, as well as the difference between $\sigma^2_{gwas}$ and $\sigma^2_{ctrl}$ as a text file.
--out string, required This is used to specify the prefix for the output file.

Miscellaneous options

The following are some miscellaneous options.

Option Usage Description
--seed integer, optional, 0 by default This is used to specify the seed for the random number generator. By default, this is set to 0.
--sample-frac float, optional, 1.0 by default This is used to down-sample the number of cell neighborhoods to analyze. By default, this is set to 1.0.
--force-rerun bool, optional, False by default This is used to force rerun the analysis. By default, if scEPS detects the output exists, it will not perform the analysis.
If --verbose bool, optional, False by default This is specified, scEPS will print intermediate output to the screen.

Simulation related options

The following option is used to evaluate scEPS in simulations, and should not be used in analysis of real traits.

Option Usage Description
--sample-frac-cell-in-neighborhood bool, optional, False by default This is used to down-sample cells within a neighborhood. By default, this is set to 1.0. This flag is used to assess the robustness of scEPS in simulations, and should not be used in analysis of real traits.