Skip to content

QingrunZhangLab/SCOPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCOPE

Stabilized COre gene and Pathway Election

Analyzing gene expression data collected from paired disease-control tissues plays a critical role in characterizing disease mechanisms. Although differential expression analysis can prioritize individual genes as candidate for further investigation, joint analysis of multiple genes is yet a problem to be thoroughly tackled. Currently, state-of-the-art methods can be categorized onto two streams: (1) co-expression network analyses focusing on correlations between genes; (2) multiple linear regressions (usually regularized) to select genes jointly. Both methods suffer from the problem of stability: a slight change of parameterization or dataset could lead to dramatic alternation of outcomes, leaving uncertainties in biological interpretation. In this work, we propose Stabilized COre gene and Pathway Election, or SCOPE, a stabilized model integrating bootstrapped LASSO and co-expression analysis, leading to robust outcomes insensitive to variations in data. As a proof-of-concept, we applied SCOPE to six cancer expressions datasets in The Cancer Genome Atlas (TCGA). Our analysis identified core genes that are critical to cancers and revealed pathways shared across cancers (in contrast to the distinctive outcomes generated by standard analyses alone). As an example, we highlighted the pivotal role of CD63 as an oncogenic driver and a potential therapeutic target in kidney cancer.

The R code for the SCOPE pipeline and instructions on how to run it are included in this repository.


Example Script and Data

The SampleData directory in this repo contains two files:

  • SampleExpressionData.csv
Gene 1 Gene 2 ... Gene k
... ... ... ...
  • SamplePhenotypes.csv
phen
0
1
...

The SampleScript.R uses this data to demonstrate running SCOPE over a dataset containing gene expression data with some basic QC.


SCOPE Functions

The SCOPE_Functions.R script contains all the functions needed to run SCOPE on an expression dataset. The primary function scope is a wrapper function that runs all the components of SCOPE in a single command. However, should the user require customization at any step beyond the arguments provided by scope each individual function can be modified/chained together to produce the required results. Since all the individual functions contain the same arguments as scope, we detail all the arguments of scope below:

  • exprData
    Expression data for SCOPE to be run on. Must contain genes/probes as columns with rows indicating samples. Phenotypes and Sample IDs must not be included. (i.e. this must be a pure expression matrix)

  • phenotype
    Phenotypes provided as a vector in the same order of samples as in exprData. Phenotypes can be of any type accepted by glmnetUtils as a valid dependent variable.

  • removeZeroVar
    Boolean (TRUE/FALSE) value indicating if to remove any genes/probes (columns) that have zero variance. Please note that including such data could result in errors and thus it is recommended to be set to TRUE.

  • removeLowVar
    Boolean (TRUE/FALSE) value indicating if to remove any genes/probes (columns) that have low variance. This threshold can be specified in lowVarPercentile.

  • lowVarPercentile
    Indicates the n^th^ percentile of variance to be removed from the expression data. This can help speed up the process by removing genes/probes that are most likely background noise.

  • formula
    Formula to be specified to glmnetUtils in running the LASSO model. Must be of the form "phenotype ~ ." since the function internally renames the phenotype to "phenotype". Refer to glmnetUtils or glmnet package for further details in specifying the formula.

  • seed
    Random seed for stabilized LASSO to be initialized. With a large number of iterations, the starting seed should not matter for the end result. However, for exact reproducibility, it is encouraged.

  • iterations
    Number of iterations to run stabilized LASSO. Recommended number is >100. For particularly noisy/correlated data, higher numbers may be ideal at the cost of time.

  • splitRatio
    Proportion of data to be considered in each subsample for each LASSO run.

  • cvFolds
    Number of folds for cross-validation in determining optimal $\lambda $$ for each LASSO model. Refer to glmnetUtils or glmnet package for further details.

  • parallel
    Boolean (TRUE/FALSE) indicating if to use a parallel back end for glmentUtils. Refer to glmnetUtils or glmnet package for further details.

  • family
    Either a character string representing one of the built-in families, or else a glm() family object to be specified to the glmnetUtils function. Refer to glmnetUtils or glmnet package for further details.

  • propCutoff
    Proporting of models $\theta $$ that a gene needs to be selected in to be considered a core gene. Refer to the manuscript for further details.

  • corrIterations
    Number of iterations to randomly select genes to build the null co-expression distribution.

  • corrProbesPerIter
    Number of genes to select in each iteration to build the null co-expression distribution.

  • corrQuantileCutoff
    Percentile threshold above which genes are considered co-expressed to be included in a Core Gene Network. Refer to the manuscript for further details.

  • diffCorrIterations
    Number of iterations to randomly select genes to build the null differential coexpression distribution.

  • diffCorrProbesPerIter
    Number of genes to select in each iteration to build the null differential co-expression distribution.

  • diffCorrQuantileCutoff
    Percentile threshold above which genes are considered differentially co-expressed to be included in a Core Gene Network. Refer to the manuscript for further details.

  • corrGenesCGN
    Boolean (TRUE/FALSE) indicating if co-expressed genes are to be included in Core Gene Networks.

  • diffCorrGenesCGN
    Boolean (TRUE/FALSE) indicating if differentially co-expressed genes are to be included in Core Gene Networks.

  • enrichmentOrganism
    Parameter for pathway enrichment in WebGestaltR package. Refer to "organism" argument in WebGestaltR.

  • enrichmentEnrichDatabase
    Parameter for pathway enrichment in WebGestaltR package. Refer to "enrichDatabase" argument in WebGestaltR.

  • enrichmentInterestGeneType
    Parameter for pathway enrichment in WebGestaltR package. Refer to "interestGeneType" argument in WebGestaltR.

  • enrichmentReferenceGeneType
    Parameter for pathway enrichment in WebGestaltR package. Refer to "referenceGeneType" argument in WebGestaltR.

  • enrichmentReferenceSet
    Parameter for pathway enrichment in WebGestaltR package. Refer to "referenceSet" argument in WebGestaltR.

  • enrichmentIsOutput
    Parameter for pathway enrichment in WebGestaltR package. Refer to "isOutput" argument in WebGestaltR.


References

  1. P. Kossinna, W. Cai, X. Lu, C. S. Shemanko, Q. Zhang, Stabilized core gene and pathway election uncovers pan-cancer shared pathways and a cancer specific driver (2021), Under revision for Science Advances. doi:10.1101/2021.12.21.473727.

  2. H. Ooi, glmnetUtils: Utilities for "Glmnet" (2021), (available at https://CRAN.R-project.org/package=glmnetUtils).

  3. J. Wang, Y. Liao, WebGestaltR: Gene Set Analysis Toolkit WebGestaltR (2020), (available at https://CRAN.R-project.org/package=WebGestaltR)

About

Stabilized COre gene and Pathway Election

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages