-
Notifications
You must be signed in to change notification settings - Fork 0
Step 4. Calculate the correlation between scEPS statistics and other attributes
We provide a tool, sceps_corr.py, for calculating the correlation between scEPS statistics with gene expression across cells, which index cell neighborhoods. Below, we will go over:
This tool requires the following data as input:
-
A single-cell data, containing log-normalized gene expression matrix, in h5ad format
-
A text file containing a data frame of estimated scEPS statistics across all cell neighborhoods
-
(optional) A text file (with header) containing a list of cell IDs, across which the correlation is calculated
The following shell script calculates the correlation (Spearman's
python <path to the scEPS package>/sceps_corr.py \
--adata <single-cell data containing log-noramlized gene expression matrix> \
--sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
--gene-id-col <column in adata.var representing gene ID/symbol> \
--cell-id-col <column in adata.obs representing cell ID> \
--out <prefix for the output file name>| Option | Usage | Description |
|---|---|---|
--adata |
string, required | This is used to specify the path to the single-cell data containing log-noramlized gene expression matrix. |
--sceps-result |
string, required | This is used to specify a text file containing the data frame for estimates of scEPS statistics across all cell neighborhoods. |
--cell-id-col |
string, optional, empty string by default | This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs. |
--gene-id-col |
string, optional, empty string by default | This is used to specify the name of the column that represents gene symbols/IDs in the adata.var data frame of the single-cell data. If left empty, scEPS will use what's in adata.var.index as gene symbols/IDs. |
--out |
string, required | This is used to specify the prefix of the output file name. |
The following shell script calculates the correlation (Spearman's
python <path to the scEPS package>/sceps_corr.py \
--adata <single-cell data containing log-noramlized gene expression matrix> \
--sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
--gene-id-col <column in adata.var representing gene ID/symbol> \
--cell-id-col <column in adata.obs representing cell ID> \
--cell-type-col <column in adata.obs representing cell types> \
--focal-cell-type <focal cell type to analyze> \
--out <prefix for the output file name>| Option | Usage | Description |
|---|---|---|
--cell-type-col |
string, optional, empty string by default | This is used to specify the name of the column that represents cell types in the adata.obs data frame of the single-cell data. This is an emtpy string by default. |
--focal-cell-type |
string, optional, empty string by default | This is used to specify the focal cell type to analyze. If specified, scEPS will only analyze cell neighborhoods in the specified cell type. This is an emtpy string by default. |
By default, we don't provide test statistics, testing the significance of the correlations, because it's difficult to derive a test that accounts for the correlations in both the scEPS statistics and the gene expression across cells. Nevertheless, we allow the user to obtain a test statistics, by bootstrapping across approximately indepdent neighborhood blocks (see example script below). However, the user should be cautious when interpreting the test statistics.
python <path to the scEPS package>/sceps_corr.py \
--adata <single-cell data containing log-noramlized gene expression matrix> \
--sceps-result <estimated scEPS statistics across all the cell neighborhoods> \
--gene-id-col <column in adata.var representing gene ID/symbol> \
--cell-id-col <column in adata.obs representing cell ID> \
--neighborhood-clusters <a text file mapping cell neighborhoods to blocks> \
--num-bootstrap 1000 \
--out <prefix for the output file name>| Option | Usage | Description |
|---|---|---|
--neighborhood-clusters |
string, optional, empty string by default | This is used to specify the text file from step 2, representing a pre-computed mapping of cell neighborhoods to approximately independent blocks of cell neighborhoods. If left empty, the tool will use regular bootstrap instead of block bootstrap. |
--num-bootstrap |
integer, optional, 0 by default | This is used to specify the number of bootstrap samples (0 by default, not calculating test statistics). |
The typical shell script to run sceps_corr.py would create a text file with the following columns:
| Column name | Description |
|---|---|
GENE |
Gene name. |
R_OMEGA_GWAS |
Correlation between gene expression and scEPS |
R_OMEGA_CONTROL |
Correlation between gene expression and scEPS |
R_OMEGA_REST |
Correlation between gene expression and scEPS |
R_OMEGA_OVERALL |
Correlation between gene expression and scEPS |
R_OMEGA_DIFF |
Correlation between gene expression and scEPS |
If the user chooses to obtain test statistics for each of the correlation, the following additional columns will be included in the output text file:
| Column name | Description |
|---|---|
SE_R_OMEGA_GWAS |
Standard error of the correlation between gene expression and scEPS |
Z_R_OMEGA_GWAS |
Z-score of the correlation between gene expression and scEPS |
P_R_OMEGA_GWAS |
p-value of the correlation between gene expression and scEPS |
SE_R_OMEGA_CONTROL |
Standard error of the correlation between gene expression and scEPS |
Z_R_OMEGA_CONTROL |
Z-score of the correlation between gene expression and scEPS |
P_R_OMEGA_CONTROL |
p-value of the correlation between gene expression and scEPS |
SE_R_OMEGA_REST |
Standard error of the correlation between gene expression and scEPS |
Z_R_OMEGA_REST |
Z-score of the correlation between gene expression and scEPS |
P_R_OMEGA_REST |
p-value of the correlation between gene expression and scEPS |
SE_R_OMEGA_OVERALL |
Standard error of the correlation between gene expression and scEPS |
Z_R_OMEGA_OVERALL |
Z-score of the correlation between gene expression and scEPS |
P_R_OMEGA_OVERALL |
p-value of the correlation between gene expression and scEPS |
SE_R_OMEGA_DIFF |
Standard error of the correlation between gene expression and scEPS |
Z_R_OMEGA_DIFF |
Z-score of the correlation between gene expression and scEPS |
P_R_OMEGA_DIFF |
p-value of the correlation between gene expression and scEPS |
The following options are used to specify the input for calculating the correlations.
| Option | Usage | Description |
|---|---|---|
--sceps-result |
string, required | This is used to specify a text file containing the data frame for estimates of scEPS statistics across all cell neighborhoods. |
--adata |
string, required | This is used to specify the path to the single-cell data containing log-noramlized gene expression matrix. |
--cell-id-col |
string, optional, empty string by default | This is used to specify the name of the column that represents cell IDs in the adata.obs data frame of the single-cell data. If left empty, scEPS will use what's in adata.obs.index as cell IDs. |
--gene-id-col |
string, optional, empty string by default | This is used to specify the name of the column that represents gene symbols/IDs in the adata.var data frame of the single-cell data. If left empty, scEPS will use what's in adata.var.index as gene symbols/IDs. |
--lognorm-count |
bool, optional, False by default |
If specified, the tool will apply log-normalization on the single-cell data. This flag should only be specified if the data contains raw read count. |
--cell-type-col |
string, optional, empty string by default | This is used to specify the name of the column that represents cell types in the adata.obs data frame of the single-cell data. This is an emtpy string by default. |
--focal-cell-type |
string, optional, empty string by default | This is used to specify the focal cell type to analyze. If specified, scEPS will only analyze cell neighborhoods in the specified cell type. This is an emtpy string by default. |
--focal-cells |
string, optional, empty string by default | This is used to specify a text file containing a list of focal cell neighborhoods to analyze. If specified, scEPS will only analyze cells in this file. The header of this file should be the same as specified by --cell-id-col. |
The following options are used for job parallelization purposes.
| Option | Usage | Description |
|---|---|---|
--total-num-job |
integer, optional, None by default |
This is used to specify the total number of parallel jobs used for the scEPS analysis. |
--job-idx |
integer, optional, None by default |
This is used to specify the index of the parallel job for analyzing a subset of the single-cell data. |
--start-idx |
integer, optional, None by default |
This is used to specify the starting index (inclusive) of the cell neighborhood to analyze. |
--stop-idx |
integer, optional, None by default |
is used to specify the stopping index (exclusive) of the cell neighborhood to analyze. |
The following options are used to specify how to calculate the correlation.
| Option | Usage | Description |
|---|---|---|
--corr-type |
string from {"spearman", "pearson"}, optional, "spearman" by default | This is used to specify the correlation type. The user can choose from "spearman" and "pearson" with "spearman" as the default. |
--min-num-nonzero |
integer, optional, 100 by default | This is used to specify the minimum number of cells the gene needs to be expressed in (100 by default). |
--neighborhood-clusters |
string, optional, empty string by default | This is used to specify the text file from step 2, representing a pre-computed mapping of cell neighborhoods to approximately independent blocks of cell neighborhoods. If left empty, the tool will use regular bootstrap instead of block bootstrap. |
--num-bootstrap |
integer, optional, 0 by default | This is used to specify the number of bootstrap samples (0 by default, not calculating test statistics). |
--block-bootstrap |
string, optional, sceps.neighborhood_cluster by default |
This is used to specify the column that represents approximately independent cell neighborhood blocks (sceps.neighborhood_cluster by default). If left empty, the tool will fall back to regular bootstrap across individual (instead of blocks of) cell neighborhoods. |
--seed |
integer, optional, 0 by default | This is used to specify the seed for the random number generator (0 by default). |
--out |
string, required | This is used to specify the prefix of the output file name. |