Step 4. Calculate the correlation between scEPS statistics and other attributes

Overview

We provide a tool, sceps_corr.py, for calculating the correlation between scEPS statistics with gene expression across cells, which index cell neighborhoods. Below, we will go over:

Input required for running the tool
Example script to run the tool
Explanation of the output
Explanation of all the options in the tool

Input required for running the tool

This tool requires the following data as input:

A single-cell data, containing log-normalized gene expression matrix, in h5ad format
A text file containing a data frame of estimated scEPS statistics across all cell neighborhoods
(optional) A text file (with header) containing a list of cell IDs, across which the correlation is calculated

Example script to run the tool

The following shell script calculates the correlation (Spearman's $\rho$) between the expression of each gene with scEPS statistics across all cells.

python <path to the scEPS package>/sceps_corr.py \
    --adata <single-cell data containing log-noramlized gene expression matrix> \
    --sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
    --gene-id-col <column in adata.var representing gene ID/symbol> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --out <prefix for the output file name>

Option	Usage	Description
`--adata`	string, required	This is used to specify the path to the single-cell data containing log-noramlized gene expression matrix.
`--sceps-result`	string, required	This is used to specify a text file containing the data frame for estimates of scEPS statistics across all cell neighborhoods.
`--cell-id-col`	string, optional, empty string by default	This is used to specify the name of the column that represents cell IDs in the `adata.obs` data frame of the single-cell data. If left empty, scEPS will use what's in `adata.obs.index` as cell IDs.
`--gene-id-col`	string, optional, empty string by default	This is used to specify the name of the column that represents gene symbols/IDs in the `adata.var` data frame of the single-cell data. If left empty, scEPS will use what's in `adata.var.index` as gene symbols/IDs.
`--out`	string, required	This is used to specify the prefix of the output file name.

The following shell script calculates the correlation (Spearman's $\rho$) between the expression of each gene with scEPS statistics across cells of a particular cell type.

python <path to the scEPS package>/sceps_corr.py \
    --adata <single-cell data containing log-noramlized gene expression matrix> \
    --sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
    --gene-id-col <column in adata.var representing gene ID/symbol> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --cell-type-col <column in adata.obs representing cell types> \
    --focal-cell-type <focal cell type to analyze> \
    --out <prefix for the output file name>

Option	Usage	Description
`--cell-type-col`	string, optional, empty string by default	This is used to specify the name of the column that represents cell types in the `adata.obs` data frame of the single-cell data. This is an emtpy string by default.
`--focal-cell-type`	string, optional, empty string by default	This is used to specify the focal cell type to analyze. If specified, scEPS will only analyze cell neighborhoods in the specified cell type. This is an emtpy string by default.

By default, we don't provide test statistics, testing the significance of the correlations, because it's difficult to derive a test that accounts for the correlations in both the scEPS statistics and the gene expression across cells. Nevertheless, we allow the user to obtain a test statistics, by bootstrapping across approximately indepdent neighborhood blocks (see example script below). However, the user should be cautious when interpreting the test statistics.

python <path to the scEPS package>/sceps_corr.py \
    --adata <single-cell data containing log-noramlized gene expression matrix> \
    --sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
    --gene-id-col <column in adata.var representing gene ID/symbol> \
    --cell-id-col <column in adata.obs representing cell ID> \
    --neighborhood-clusters <a text file mapping cell neighborhoods to blocks> \
    --num-bootstrap 1000 \
    --out <prefix for the output file name>

Option	Usage	Description
`--neighborhood-clusters`	string, optional, empty string by default	This is used to specify the text file from step 2, representing a pre-computed mapping of cell neighborhoods to approximately independent blocks of cell neighborhoods. If left empty, the tool will use regular bootstrap instead of block bootstrap.
`--num-bootstrap`	integer, optional, 0 by default	This is used to specify the number of bootstrap samples (0 by default, not calculating test statistics).

Explanation of the output

The typical shell script to run sceps_corr.py would create a text file with the following columns:

Column name	Description
`GENE`	Gene name.
`R_OMEGA_GWAS`	Correlation between gene expression and scEPS $\omega^2_{gwas}$ statistics across cells.
`R_OMEGA_CONTROL`	Correlation between gene expression and scEPS $\omega^2_{ctrl}$ statistics across cells.
`R_OMEGA_REST`	Correlation between gene expression and scEPS $\omega^2_{rest}$ statistics across cells.
`R_OMEGA_OVERALL`	Correlation between gene expression and scEPS $\omega^2_{overall}$ statistics across cells.
`R_OMEGA_DIFF`	Correlation between gene expression and scEPS $d$ statistics across cells.

If the user chooses to obtain test statistics for each of the correlation, the following additional columns will be included in the output text file:

Column name	Description
`SE_R_OMEGA_GWAS`	Standard error of the correlation between gene expression and scEPS $\omega^2_{gwas}$ statistics across cells.
`Z_R_OMEGA_GWAS`	Z-score of the correlation between gene expression and scEPS $\omega^2_{gwas}$ statistics across cells.
`P_R_OMEGA_GWAS`	p-value of the correlation between gene expression and scEPS $\omega^2_{gwas}$ statistics across cells.
`SE_R_OMEGA_CONTROL`	Standard error of the correlation between gene expression and scEPS $\omega^2_{ctrl}$ statistics across cells.
`Z_R_OMEGA_CONTROL`	Z-score of the correlation between gene expression and scEPS $\omega^2_{ctrl}$ statistics across cells.
`P_R_OMEGA_CONTROL`	p-value of the correlation between gene expression and scEPS $\omega^2_{ctrl}$ statistics across cells.
`SE_R_OMEGA_REST`	Standard error of the correlation between gene expression and scEPS $\omega^2_{rest}$ statistics across cells.
`Z_R_OMEGA_REST`	Z-score of the correlation between gene expression and scEPS $\omega^2_{rest}$ statistics across cells.
`P_R_OMEGA_REST`	p-value of the correlation between gene expression and scEPS $\omega^2_{rest}$ statistics across cells.
`SE_R_OMEGA_OVERALL`	Standard error of the correlation between gene expression and scEPS $\omega^2_{overall}$ statistics across cells.
`Z_R_OMEGA_OVERALL`	Z-score of the correlation between gene expression and scEPS $\omega^2_{overall}$ statistics across cells.
`P_R_OMEGA_OVERALL`	p-value of the correlation between gene expression and scEPS $\omega^2_{overall}$ statistics across cells.
`SE_R_OMEGA_DIFF`	Standard error of the correlation between gene expression and scEPS $d$ statistics across cells.
`Z_R_OMEGA_DIFF`	Z-score of the correlation between gene expression and scEPS $d$ statistics across cells.
`P_R_OMEGA_DIFF`	p-value of the correlation between gene expression and scEPS $d$ statistics across cells.

Explanation of all the options in the tool

The following options are used to specify the input for calculating the correlations.

Option	Usage	Description
`--sceps-result`	string, required	This is used to specify a text file containing the data frame for estimates of scEPS statistics across all cell neighborhoods.
`--adata`	string, required	This is used to specify the path to the single-cell data containing log-noramlized gene expression matrix.
`--cell-id-col`	string, optional, empty string by default	This is used to specify the name of the column that represents cell IDs in the `adata.obs` data frame of the single-cell data. If left empty, scEPS will use what's in `adata.obs.index` as cell IDs.
`--gene-id-col`	string, optional, empty string by default	This is used to specify the name of the column that represents gene symbols/IDs in the `adata.var` data frame of the single-cell data. If left empty, scEPS will use what's in `adata.var.index` as gene symbols/IDs.
`--lognorm-count`	bool, optional, `False` by default	If specified, the tool will apply log-normalization on the single-cell data. This flag should only be specified if the data contains raw read count.
`--cell-type-col`	string, optional, empty string by default	This is used to specify the name of the column that represents cell types in the `adata.obs` data frame of the single-cell data. This is an emtpy string by default.
`--focal-cell-type`	string, optional, empty string by default	This is used to specify the focal cell type to analyze. If specified, scEPS will only analyze cell neighborhoods in the specified cell type. This is an emtpy string by default.
`--focal-cells`	string, optional, empty string by default	This is used to specify a text file containing a list of focal cell neighborhoods to analyze. If specified, scEPS will only analyze cells in this file. The header of this file should be the same as specified by `--cell-id-col`.

The following options are used for job parallelization purposes.

Option	Usage	Description
`--total-num-job`	integer, optional, `None` by default	This is used to specify the total number of parallel jobs used for the scEPS analysis.
`--job-idx`	integer, optional, `None` by default	This is used to specify the index of the parallel job for analyzing a subset of the single-cell data.
`--start-idx`	integer, optional, `None` by default	This is used to specify the starting index (inclusive) of the cell neighborhood to analyze.
`--stop-idx`	integer, optional, `None` by default	is used to specify the stopping index (exclusive) of the cell neighborhood to analyze.

The following options are used to specify how to calculate the correlation.

Option	Usage	Description
`--corr-type`	string from {"spearman", "pearson"}, optional, "spearman" by default	This is used to specify the correlation type. The user can choose from "spearman" and "pearson" with "spearman" as the default.
`--min-num-nonzero`	integer, optional, 100 by default	This is used to specify the minimum number of cells the gene needs to be expressed in (100 by default).
`--neighborhood-clusters`	string, optional, empty string by default	This is used to specify the text file from step 2, representing a pre-computed mapping of cell neighborhoods to approximately independent blocks of cell neighborhoods. If left empty, the tool will use regular bootstrap instead of block bootstrap.
`--num-bootstrap`	integer, optional, 0 by default	This is used to specify the number of bootstrap samples (0 by default, not calculating test statistics).
`--block-bootstrap`	string, optional, `sceps.neighborhood_cluster` by default	This is used to specify the column that represents approximately independent cell neighborhood blocks (`sceps.neighborhood_cluster` by default). If left empty, the tool will fall back to regular bootstrap across individual (instead of blocks of) cell neighborhoods.
`--seed`	integer, optional, 0 by default	This is used to specify the seed for the random number generator (0 by default).
`--out`	string, required	This is used to specify the prefix of the output file name.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 4. Calculate the correlation between scEPS statistics and other attributes

Overview

Input required for running the tool

Example script to run the tool

Explanation of the output

Explanation of all the options in the tool

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally