-
Notifications
You must be signed in to change notification settings - Fork 0
Step 1. Estimate scEPS statistics at each individual cell neighborhood
The first step in a scEPS analysis is to estimate the scEPS sceps.py. Here, we will go over:
scEPS requires the following as input:
-
(required) A text file containing the MAGMA association statistics for each gene. The file should contain a
GENEcolumn, which contains gene symbols or IDs that match those used in the single-cell data. The file should also contain aZSTATcolumn, which represent the MAGMA association statistics from step 0a. -
(required) A single-cell data in h5ad format. The single-cell data should contain at least 20 donors showing variations in their disease phenotype.
-
(optional) A text file containing disease phenotypes for the donors in the single-cell data. scEPS can analyze phenotypes that is stored in a file external to the single-cell data.
-
(optional) A list of cell IDs to apply scEPS analysis. If provided, scEPS will only estimate the scEPS statistics for cell neighborhoods indexed by these cells.
The following code snippet is a typical example shell script to run scEPS under default settings.
python <path to scEPS package>/sceps.py \
--adata <path to single-cell data> \
--cell-id-col <column in adata.obs representing cell ID> \
--gene-id-col <column in adata.var representing gene ID/symbol> \
--donor-id-col <column in adata.obs representing donor ID> \
--gene-list <path to the file containing MAGMA gene-level association statistics> \
--auto-gene-selection \
--pheno <column in adata.obs representing the phenotype> \
--scale-pheno \
--scale-pheno-neighborhood \
--covar <a list of covariates to adjust, separated by commas> \
--total-num-job <total number of parallel jobs used to analyze the single-cell data> \
--job-idx <index for the current parallel job> \
--out <prefix for the output file name>| Option | Usage | Description |
|---|---|---|
--adata |
string, required | This is used to specify the input single-cell RNA-seq data in h5ad format. The single-cell data should consist of multiple donors, showing variations in their disease phenotypes. |
--cell-id-col |
string, optional, empty string by default | This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs. |
--gene-id-col |
string, optional, empty string by default | This is used to specify the name of the column that represents gene symbols/IDs in the adata.var data frame of the single-cell data. If left empty, scEPS will use what's in adata.var.index as gene symbols/IDs. |
--donor-id-col |
string, required | This is used to specify the name of the column that represents donor IDs in the adata.obs data frame of the single-cell data. |
--gene-list |
string, required | This is used to specify a text file containing the gene-level MAGMA associaiton statistics. The text file should contain a GENE column and a ZSTAT column. |
--auto-gene-selection |
bool, optional, False by default |
If specified (recommended), scEPS will automatically select GWAS genes based on the gene-level MAGMA association statistics. The user can also bypass auto gene selection, using the --num-gwas-genes flag to specify the number of top GWAS genes. |
--pheno |
string, required | This is used to specify the column that represents phenotype in the adata.obs data frame of the single-cell data or in the text file external to the single-cell data. |
--scale-pheno |
bool, optional, , False by default |
If specified (recommended), scEPS will mean-center and scale the phenotype to have unit variance, at the global level (i.e., before variance decomposition at individual cell neighborhood level). Otherwise, scEPS will only mean-center the phenotype. |
--scale-pheno-neighborhood |
bool, optional, , False by default |
If specified (recommended), scEPS will mean-center and scale the phenotype to have unit variance, at each cell neighborhood, on top of the global level standardization. Otherwise, scEPS will only mean-center the phenotype. |
--covar |
string, optional | This is used to specify a list of columns in adata.obs that represent covariates to adjust. Multiple covariates names should be separated by commas. We recommend to include sceps.prop_cells_local in this flag, to adjust for the impact of neighborhood abundance on the phenotype. An example specification for --covar could be --covar "age,gender,sceps.prop_cells_local". |
--total-num-job |
integer, optional, None by default |
This is used to specify the total number of parallel jobs used for the scEPS analysis. |
--job-idx |
integer, optional, None by default |
This is used to specify the index of the parallel job for analyzing a subset of the single-cell data. |
--out |
string, required | This is used to specify the prefix for the output file names. |
If the phenotype is not stored in the single-cell h5ad file, the user can provide a text file with the phenotype, using the --pheno-file flag, and specify the phenotype to analyze using the --pheno flag.
The following is an example shell script to analyze phenotype saved in an external text file.
python <path to scEPS package>/sceps.py \
--adata <path to single-cell data> \
--cell-id-col <column in adata.obs representing cell ID> \
--gene-id-col <column in adata.var representing gene ID/symbol> \
--donor-id-col <column in adata.obs representing donor ID> \
--gene-list <path to the file containing MAGMA gene-level association statistics> \
--auto-gene-selection \
--pheno-file <an external text file containing the phenotype> \
--pheno <column in the phenotype file above representing the phenotype> \
--scale-pheno \
--scale-pheno-neighborhood \
--covar <a list of covariates to adjust, separated by commas> \
--total-num-job <total number of parallel jobs used to analyze the single-cell data> \
--job-idx <index for the current parallel job> \
--out <prefix for the output file name>Here, the --pheno-file flag is used to specify the external phenotype file, which should have a column representing the donor IDs and one or multiple columns representing the phenotypes. The name of the column for donor IDs should be the same as the one used in the single-cell data (i.e., specified by the --donor-id-col flag). And at least one of the names of the columns for phenotypes should be the one specified by the --pheno flag.
We also allow the user to call scEPS functions from within Python, instead of from command line. The following code snippet provides an example of how to do this. Briefly, we provide a function, create_sceps_default_args, to specify the default/recommended settings to run scEPS. The users can provide additional settings, including input files, cell IDs, and gene symbols, etc., by modifying the args object.
import sys
sys.path.append('/path/to/sceps_tool')
from src.sceps_core import *
# Create args with required arguments and default settings to run scEPS
args = create_sceps_default_args(adata='/path/to/the/single/cell/data',
gene_list='/path/to/the/gene/level/MAGMA/association/statistics')
# Specify additional settings as needed
args.cell_id_col = '<cell ID column in adata.obs>'
args.gene_id_col = '<gene ID column in adata.var>'
args.donor_id_col = '<donor ID column in adata.obs>'
args.pheno_file = '/path/to/an/external/phenotype/file'
args.pheno = '<phenotype column in adata.obs>'
args.covar = '<a list of covariates to adjust, separated by commas>'
args.out = '<output file name>'
# Set seed for random number generator
np.random.seed(args.seed)
random.seed(args.seed)
# Load the single-cell data
adata = load_single_cell_data(args)
# Load the list of genes with MAGMA Z-stat
gene_list = load_gene_stats(args)
# Intersect the genes in the adata with genes with MAGMA Z-stat
adata = subset_adata_genes(args, adata, gene_list)
# Determine the list of GWAS genes
gwas_genes = select_disease_gwas_genes(args, gene_list)
# Run scEPS for the first 10 cells in the single-cell data
start_idx, stop_idx = 0, 10
sceps_out = get_sceps_stats(args, adata, gwas_genes, start_idx=start_idx, stop_idx=stop_idx)
# Save results
out_suffix = '{}-{}'.format(start_idx, stop_idx)
save_sceps_stats(args, sceps_out, out_suffix)The output of scEPS is saved in a compressed text file with suffix .sceps.omega.txt.gz. Each row of the output of scEPS represents analysis results for a cell neighborhood, anchored at individual cells. Below, we provide interpretations for each column of the output of scEPS. Although there are many columns in the scEPS output, the most important ones include CELL, which indexes the cell neighborhood, and OMEGA_DIFF, SE_OMEGA_DIFF, Z_OMEGA_DIFF, and P_Z_OMEGA_DIFF, which represent the estimated scEPS
| Column name | Description |
|---|---|
CELL |
Cell ID that indexes the cell neighborhood. |
DONOR_ID |
Donor ID for the cell that indexes the neighborhood. |
STEP_SIZE |
Size of the random walk on the k-NN graph used to define the cell neighborhood. |
NEIGHBORHOOD_SIZE |
Number of cells in the neighborhood. |
NUM_DONOR |
Number of donors represented in the neighborhood. |
ELAPSED_TIME |
Time (in seconds) used to analyze the cell neighborhood. |
MEAN_MEAN_EXPR_GWAS |
Mean expression of each GWAS gene across donors (prior to mean centering), averaged across all GWAS genes. |
MEAN_MEAN_EXPR_CONTROL |
Mean expression of each control gene across donors (prior to mean centering), averaged across all control genes. |
MEAN_MEAN_EXPR_REST |
Mean expression of each remaining (non-GWAS/control) gene across donors (prior to mean centering), averaged across all remaining genes. |
MEAN_MEAN_EXPR_ALL |
Mean expression of each gene (GWAS, control, and rest) across donors (prior to mean centering), averaged across all genes. |
MEAN_VAR_EXPR_GWAS |
Variance of the expression of each GWAS gene across donors, averaged across all GWAS genes. |
MEAN_VAR_EXPR_CONTROL |
Variance of the expression of each control gene across donors, averaged across all control genes. |
MEAN_VAR_EXPR_REST |
Variance of the expression of each remaining (non-GWAS/control) gene across donors, averaged across all remaining genes. |
MEAN_VAR_EXPR_ALL |
Variance of the expression of each gene (GWAS, control, and rest) across donors, averaged across all genes. |
NUM_GENE_GWAS |
Number of GWAS genes. |
NUM_GENE_CONTROL |
Number of control genes. |
NUM_GENE_REST |
Number of remaining (non-GWAS/control) genes. |
OMEGA_GWAS |
Estimated |
SE_OMEGA_GWAS |
Standard error for the estimated |
Z_OMEGA_GWAS |
Z-score for the estimated |
P_Z_OMEGA_GWAS |
p-value testing whether the estimated |
OMEGA_CONTROL |
Estimated |
SE_OMEGA_CONTROL |
Standard error for the estimated |
Z_OMEGA_CONTROL |
Z-score for the estimated |
P_Z_OMEGA_CONTROL |
p-value testing whether the estimated |
OMEGA_REST |
Estimated |
SE_OMEGA_REST |
Standard error for the estimated |
Z_OMEGA_REST |
Z-score for the estimated |
P_Z_OMEGA_REST |
p-value testing whether the estimated |
OMEGA_OVERALL |
Estimated |
SE_OMEGA_OVERALL |
Standard error for the estimated |
Z_OMEGA_OVERALL |
Z-score for the estimated |
P_Z_OMEGA_OVERALL |
p-value testing whether the estimated |
OMEGA_DIFF |
Estimated |
SE_OMEGA_DIFF |
Standard error for the estimated |
Z_OMEGA_DIFF |
Z-score for the estimated |
P_Z_OMEGA_DIFF |
p-value testing whether the estimated |
Below, we provide detailed explanations for all the available options implemented in the scEPS analysis.
The following options are used to specify the inputs for the scEPS analysis.
| Option | Usage | Description |
|---|---|---|
--adata |
string, required | This is used to specify the input single-cell RNA-seq data in h5ad format. The single-cell data should consist of multiple donors, showing variations in their disease phenotypes. |
--cell-type-col |
string, optional, empty string by default | This is used to specify the name of the column that represents cell types in the adata.obs data frame of the single-cell data. This is an emtpy string by default. |
--focal-cell-type |
string, optional, empty string by default | This is used to specify the focal cell type to analyze. If specified, scEPS will only analyze cell neighborhoods in the specified cell type. This is an emtpy string by default. |
--gene-id-col |
string, optional, empty string by default | This is used to specify the name of the column that represents gene symbols/IDs in the adata.var data frame of the single-cell data. If left empty, scEPS will use what's in adata.var.index as gene symbols/IDs. |
--cell-id-col |
string, optional, empty string by default | This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs. |
--donor-id-col |
string, required | This is used to specify the name of the column that represents donor IDs in the adata.obs data frame of the single-cell data. |
--pheno-file |
ostring, ptional, empty string by default | This is used to specify an external text file containing donor phenotypes. The text file should contain a donor ID column, same as the one used in the single-cell data, and one or more columns representing donor phenotypes. |
- --batch-key
|
string, optional, empty string by default | This is used to specify batch information for the donors. If specified, scEPS will permute the donors within each batch, if using permutation-based testing procedure. |
--gene-list |
string, required | This is used to specify a text file containing the gene-level MAGMA associaiton statistics. The text file should contain a GENE column and a ZSTAT column. |
--control-gene-list |
string, optional, empty string by default | This is used to specify a text file containing a list of pre-determined control gene list. By default, scEPS will automatically randomly select a list of control genes. |
--focal-cells |
string, optional, empty string by default | This is used to specify a text file containing a list of focal cell neighborhoods to analyze. If specified, scEPS will only analyze cells in this file. |
scEPS offers 2 different ways to parallelize the analysis.
Option 1. The user can specify the total number of parallel jobs spawned for the analysis. scEPS will automatically evenly split the work load across these parallel jobs.
Option 2. The user can provde a starting and stopping index of the cell neighborhooods to analyze.
| Option | Usage | Description |
|---|---|---|
--total-num-job |
integer, optional, None by default |
This is used to specify the total number of parallel jobs used for the scEPS analysis. |
--job-idx |
integer, optional, None by default |
This is used to specify the index of the parallel job for analyzing a subset of the single-cell data. |
--start-idx |
integer, optional, None by default |
This is used to specify the starting index (inclusive) of the cell neighborhood to analyze. |
--stop-idx |
integer, optional, None by default |
This is used to specify the stopping index (exclusive) of the cell neighborhood to analyze. |
The following options is used to specify QC parameters used in the scEPS analysis.
| Option | Usage | Description |
|---|---|---|
--min-expr-var-thres |
float, optional, 1e-4 by default | This is used to specify the minimum threshold on the variance of gene expression across donors. By default, genes with variance across donors less than 1e-4 are filtered. |
--max-expr-var-thres |
float, optional, 100.0 by default | This is used to specify the maximum threshold on the variance of gene expression across donors. By default, genes with variance across donors greater than 100.0 are filtered. |
The following options are used to specify how scEPS define cell neighborhoods based on the k-NN graph.
| Option | Usage | Description |
|---|---|---|
--min-num-donor |
integer, optional, 8 by default | This is used to specify the minimum number of donors required in each cell neighborhood. By default, we set this number to 8. |
--min-frac-donor |
float, optional, 0.333333 by default | This is used the specify the fraction of total number of donors in the single-cell data required to be represented in each cell neighborhood. By default, we set this to 0.333333 (i.e., approximately 1/3). |
--min-num-cell |
integer, optional, 5 by default | is used to specify the minimum number of cells required in each cell neighborhood. By default, we set this to 5. |
--neighborhood-definition-method |
string from {"nam", "soft threshold", "hard threshold"}, optional, "nam" by default | This is used to specify the neighborhood definition method. The user can choose from "nam", "soft threshold", and "hard threshold". By default scEPS uses the "nam" approach, defining cell neighborhoods based on the neighborhood abundance matrix. |
--check-pheno-in-neighborhood |
bool, optional, False by default |
If specified, scEPS will ensure that the variance of the phenotype across donors in the cell neighborhood is non-zero. |
--max-step-size |
integer, optional, 15 by default | This is used to specify the maximum number of random walk step size to test. By default, this is set to 15. |
--prob-thres |
float, optional, 0.0001 by default | This is used to specify the threshold for the transition probability, if --neighborhood-definition-method is set to "hard threshold". |
The following options are used to specify how scEPS determines the set of GWAS genes to analyze.
| Option | Usage | Description |
|---|---|---|
--auto-gene-selection |
bool, optional, , False by default |
If specified (recommended), scEPS will automatically select GWAS genes based on the gene-level MAGMA association statistics. The user can also bypass auto gene selection, using the --num-gwas-genes flag to specify the number of top GWAS genes. |
--num-gwas-genes |
integer, optional, 500 by default | This is used to specify the number of top GWAS genes, if the user chooses to manually select the top GWAS genes to analyze. |
--magma-fdr-thres |
float, optional, 0.05 by default | This is used to specify the FDR threshold to select GWAS genes, if the user chooses to use the automatically approach for selecting GWAS genes. |
--min-num-gwas-genes |
integer, optional, 500 by default | This is used to specify the minimum number of GWAS genes included in the scEPS analysis. By default, this is set to 500. |
--max-num-gwas-genes |
integer, optional, 2,000 by default | This is used to specify the maximum number of GWAS genes included in the scEPS analysis. By default, this is set to 2,000. |
The following options are used to specify how scEPS determines the list of control genes for the GWAS genes.
| Option | Usage | Description |
|---|---|---|
--enrich-low-score-control-genes |
bool, optional, False by default |
If specified, the control gene selection procedure will preferentially select control genes starting from the lowest gene-level MAGMA association statistics. |
--include-gwas-genes-in-control-genes |
bool, optional, False by default |
If specified, the set of GWAS genes will also be included in the sampling of control genes. |
--num-expr-bins |
integer, optional, 10 by default | This is used to specify the number of bins for binning all the genes based on their mean expression. The default is 10. |
--num-control-gene-set |
integer, optional, 1 by default | This is used to specify the number of control gene sets. This should be set to 1 in general. |
--num-control-gene-set-testing |
integer, optional, 0 by default | This is used to specify the number of control gene sets use for permutation-based testing. By default, this number is set to 0. |
If --match-expr-std
|
bool, optional, False by default |
is specified, scEPS will also match the standard deviation of the gene expression of the control genes with that of GWAS genes. |
The following options are used to specify how scEPS estimates the variance components of the scEPS model using methods of moment.
| Option | Usage | Description |
|---|---|---|
--pheno |
string, required | This is used to specify the column that represents phenotype in the adata.obs data frame of the single-cell data or in the text file external to the single-cell data. |
--covar |
string, optional, empty string by default | This is used to specify a list of columns in adata.obs that represent covariates to adjust. Multiple covariates names should be separated by commas. We recommend to include sceps.prop_cells_local in this flag, to adjust for the impact of neighborhood abundance on the phenotype. An example specification for --covar could be --covar "age,gender,sceps.prop_cells_local". |
--num-var-comp |
integer from {2, 3}, optional, 3 by default | This is used to specify the number of variance components. By default, this is set to 3 (recommended). And scEPS will decompose the phenotypic variance into GWAS genes, control genes, and remaining (non-GWAS/control) genes. If this is set to 2 (not recommended), scEPS will decompose the phenotypic variance into GWAS genes and control genes. |
--num-bootstrap-disattenuation |
integer, optional, 100 by default | This is used to specify the number of bootstap samples used to estimate the disattenuation factor. By default, this is set to 100. |
--num-bootstrap-regression |
integer, optional, 1,000 by default | This is used to specify the number of bootstrap samples used to estimate the standard errors for the scEPS statistics. By default, this is set to 1,000. |
--use-ols |
bool, optional, False by default |
If specified, scEPS will use ordinary least square regression to estimate the variance components. By default, scEPS uses weighted least square regression. |
--use-analytical-stderr |
bool, optional, False by default |
If specified, scEPS will use analytical approach to calculate standard errors. This is a lot faster than using the bootstrap approach, and can be used if computational resource is limited. |
--scale-pheno |
bool, optional, False by default |
If specified (recommended), scEPS will mean-center and scale the phenotype to have unit variance, at the global level (i.e., before variance decomposition at individual cell neighborhood level). Otherwise, scEPS will only mean-center the phenotype. |
--scale-pheno-neighborhood |
bool, optional, False by default |
If specified (recommended), scEPS will mean-center and scale the phenotype to have unit variance, at each cell neighborhood, on top of the global level standardization. Otherwise, scEPS will only mean-center the phenotype. |
--exclude-diag |
bool, optional, False by default |
If specified, scEPS will only use cross products of the phenotypes between different individuals to estimate the variance components. By default, scEPS includes squared phenotypes of the same individuals. |
--exclude-intercept |
bool, optional, False by default |
If specified, scEPS will not include an intercept term in the regression. By default, an intercept term is included in the regression. |
--no-disattenuation |
bool, optional, False by default |
If specified (not recommended), scEPS will not apply disattenuation factor on the estimates. |
--skip-regression |
bool, optional, False by default |
If specified, scEPS will skip the estimation procedure, but will record the summary statistics (e.g., number of cells/donors) for each neighborhood. This is typically used for diagnosis purposes only. |
--permute-within-batch |
bool, optional, False by default |
If specified, scEPS will run permutation within each batch of donors. This is only used if the user chooses to use permutation to obtain p-values. |
The following options specify what scEPS output to save.
| Option | Usage | Description |
|---|---|---|
--save-neighborhood-cells |
bool, optional, False by default |
If specified, scEPS will save the cells in each neighborhood as a text file. |
--save-neighborhood-donors |
bool, optional, False by default |
If specified, scEPS will save the donors in each neighborhood as a text file. |
--save-target-nam |
bool, optional, False by default |
If specififed, scEPS will save the target neighborhood abundance matrix as a text file. |
--save-realized-nam |
bool, optional, False by default |
If specified, scEPS will save the realized neighborhood abundance matrix as a text file. |
--save-sceps-raw |
bool, optional, False by default |
If specified, scEPS will save the intermediate raw results as a text file. |
--save-sceps-sigma |
bool, optional, False by default |
If specified, scEPS will save the |
--out |
string, required | This is used to specify the prefix for the output file. |
The following are some miscellaneous options.
| Option | Usage | Description |
|---|---|---|
--seed |
integer, optional, 0 by default | This is used to specify the seed for the random number generator. By default, this is set to 0. |
--sample-frac |
float, optional, 1.0 by default | This is used to down-sample the number of cell neighborhoods to analyze. By default, this is set to 1.0. |
--force-rerun |
bool, optional, False by default |
This is used to force rerun the analysis. By default, if scEPS detects the output exists, it will not perform the analysis. |
If --verbose
|
bool, optional, False by default |
This is specified, scEPS will print intermediate output to the screen. |
The following option is used to evaluate scEPS in simulations, and should not be used in analysis of real traits.
| Option | Usage | Description |
|---|---|---|
--sample-frac-cell-in-neighborhood |
bool, optional, False by default |
This is used to down-sample cells within a neighborhood. By default, this is set to 1.0. This flag is used to assess the robustness of scEPS in simulations, and should not be used in analysis of real traits. |