This repository contains R code for various feature selection methods. The code is organized as outlined below.
-
Model Fits & GOF: Includes functions for fitting the Cox, PO & YP models, as well as testing for goodness-of-fit. For the Cox and PO cases, there are options for adjusting for age and stage, if desired.
- Required: R packages (survival, timereg, YPmodel)
- Inputs:
- predictor, x (gene expression)
- data frame with n rows (number of subjects) and columns with survival information (time = survival time, censor = censoring information) and p genes
- Output: model coefficient (beta), coefficient standard error, significance p-value, GOF p-value
-
Pseudo-R2 Measures: Includes functions for calculating each pseudo-R2 measures (PO, CO, CH, ModCH and PH). Note, each measure has a separate function. In the future, we plan to combine the code into one R2 function with option
type = c("PO", "CO", "CH", "ModCH", "PH")
.- Required: R package (survival)
- Inputs:
- predictor, x (gene expression)
- survival time
- censoring indicator
- Output: R2 measure
-
R2LR & R2I Measures: This code contains functions for R2LR, R2IPO & R2IPH. There are options for adjusting for age and stage, if desired.
- Required: R packages (survival, timereg)
- Inputs:
- predictor, x (gene expression)
- survival time
- censoring indicator
- Output: R2 measures
-
I Measures: This code contains functions for IPO and IYP. There are options for adjusting for age and stage for IPO, if desired.
- Required: R packages (survival, timereg, YPmodel)
- This code assumes that your data is in the the following form:
- Column 1 = survival time (time)
- Column 2 = censoring indicator (censor)
- Columns 3+ = genes
- Output:
- IPO, outPO (I, I test statistic, I p-value)
- IYP, outYP (I, I test statistic, I p-value)
-
Youden & AUC: Computes Youden & AUC values based on gene ranking by a specified feature selection method
- Required: R packages (MESS)
- Inputs:
- data frame with n rows (number of genes) and 1 column (feature selection measure)
- effectGenes: number of significant genes
- Output option: Specificity, Sensitivity, Youden & AUC
-
Venn Diagrams: This code creates Venn Diagrams showing various interestions between different feature selection measures.
- Required: R packages (gpplots, VennDiagram, latex2exp)
- Inputs:
- Obtain a data frame named "out" with n rows (number of genes) and 1 column for each of the measures listed above (except IYP). All measures here are based on continuous gene expression.
- Obtain a data frame named "out2" with n rows (number of genes) and 1 column for each of the measures based on dichotomized expression (IYP, IPO, concreg).
- Venn Diagrams created:
- IPO, IYP, & concreg (dichotomized case is coded separately)
- R2PO, R2IPO & R2LR
- R2PO, R2ModCH & R2CO
-
Other Existing Measures: This code contains functions for computing some existing measures in the literature.
- Required: R packages (concreg, survAUC, survival, pec, timereg)
- Measures computed:
- Concreg (Dunkler et al. 2010)
- Uno's C (Uno et al. 2011)
- R2G - PH & PO cases (Graf et al. 1999; Gerds & Schumacher 2006)
- R2SH (Schemper & Henderson 2000)
- Inputs:
- predictor, x (gene expression)
- survival time
- censoring indicator
- Output: Concreg (absolute effect size), Uno's C, R2G (PH), R2G (PO) and R2SH.
-
Simulations: Contains R code for simulating data
- Scheme 1: Univariate aproach; genes linked to survival one at a time
- Scheme 2: Multivariate approach; incorporates correlation between features
- For both schemes, there are options to simulate from the following models: LN, LL1, LL2, W1, W2
-
Complete Example: This example does the following
- Creates a simulated data set (Scheme 1 - W, 33% censoring)
- Obtains PH, PO & YP model fits and GOF
- Computes all proposed measures for feature selection:
- I measures - IPO, IYP
- R2_measures - R2LR, R2IPO, R2IPH, R2PO, R2CO, R2ModCH
- R2_measures (by Rouam et al. 2010, 2011) - R2CH, R2PH
- Computes other existing measures (concreg, Uno's C, R2G & R2SH)
- Computes Sensitivity, Specificity, Youden & AUC for each measure.
- Creates venn diagrams showing overlaps between measures.
- Note: Before running this example, some functions need to be run from the other R code in this repository. All required functions are noted thoughout the example.
- Required: R packages (survival, timereg, YPmodel, concreg, survAUC, pec, MESS, gplots, VennDiagram, latex2exp)
Copyright © Lauren Spirko-Burns and Karthik Devarajan
Spirko, L.N., Devarajan, K. Unified methods for variable selection in large-scale genomic studies with censored survival outcomes. Under review. COBRA pre-print series, Article 120 (June 2019). http://biostats.bepress.com/cobra/art120.
Spirko, L. (2017). Variable Selection and Supervised Dimension Reduction for Large-Scale Genomic Data with Censored Survival Outcomes. Ph.D. Dissertation. Department of Statistical Science, Temple University, Philadelphia.
This work by Lauren Spirko-Burns and Karthik Devarajan is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.