Skip to content

lburns27/Feature-Selection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 

Repository files navigation

Feature-Selection

This repository contains R code for various feature selection methods. The code is organized as outlined below.

  • Model Fits & GOF: Includes functions for fitting the Cox, PO & YP models, as well as testing for goodness-of-fit. For the Cox and PO cases, there are options for adjusting for age and stage, if desired.

    • Required: R packages (survival, timereg, YPmodel)
    • Inputs:
      • predictor, x (gene expression)
      • data frame with n rows (number of subjects) and columns with survival information (time = survival time, censor = censoring information) and p genes
    • Output: model coefficient (beta), coefficient standard error, significance p-value, GOF p-value
  • Pseudo-R2 Measures: Includes functions for calculating each pseudo-R2 measures (PO, CO, CH, ModCH and PH). Note, each measure has a separate function. In the future, we plan to combine the code into one R2 function with option type = c("PO", "CO", "CH", "ModCH", "PH").

    • Required: R package (survival)
    • Inputs:
      • predictor, x (gene expression)
      • survival time
      • censoring indicator
    • Output: R2 measure
  • R2LR & R2I Measures: This code contains functions for R2LR, R2IPO & R2IPH. There are options for adjusting for age and stage, if desired.

    • Required: R packages (survival, timereg)
    • Inputs:
      • predictor, x (gene expression)
      • survival time
      • censoring indicator
    • Output: R2 measures
  • I Measures: This code contains functions for IPO and IYP. There are options for adjusting for age and stage for IPO, if desired.

    • Required: R packages (survival, timereg, YPmodel)
    • This code assumes that your data is in the the following form:
      • Column 1 = survival time (time)
      • Column 2 = censoring indicator (censor)
      • Columns 3+ = genes
    • Output:
      • IPO, outPO (I, I test statistic, I p-value)
      • IYP, outYP (I, I test statistic, I p-value)
  • Youden & AUC: Computes Youden & AUC values based on gene ranking by a specified feature selection method

    • Required: R packages (MESS)
    • Inputs:
      • data frame with n rows (number of genes) and 1 column (feature selection measure)
      • effectGenes: number of significant genes
    • Output option: Specificity, Sensitivity, Youden & AUC
  • Venn Diagrams: This code creates Venn Diagrams showing various interestions between different feature selection measures.

    • Required: R packages (gpplots, VennDiagram, latex2exp)
    • Inputs:
      • Obtain a data frame named "out" with n rows (number of genes) and 1 column for each of the measures listed above (except IYP). All measures here are based on continuous gene expression.
      • Obtain a data frame named "out2" with n rows (number of genes) and 1 column for each of the measures based on dichotomized expression (IYP, IPO, concreg).
    • Venn Diagrams created:
      • IPO, IYP, & concreg (dichotomized case is coded separately)
      • R2PO, R2IPO & R2LR
      • R2PO, R2ModCH & R2CO
  • Other Existing Measures: This code contains functions for computing some existing measures in the literature.

    • Required: R packages (concreg, survAUC, survival, pec, timereg)
    • Measures computed:
      • Concreg (Dunkler et al. 2010)
      • Uno's C (Uno et al. 2011)
      • R2G - PH & PO cases (Graf et al. 1999; Gerds & Schumacher 2006)
      • R2SH (Schemper & Henderson 2000)
    • Inputs:
      • predictor, x (gene expression)
      • survival time
      • censoring indicator
    • Output: Concreg (absolute effect size), Uno's C, R2G (PH), R2G (PO) and R2SH.
  • Simulations: Contains R code for simulating data

    • Scheme 1: Univariate aproach; genes linked to survival one at a time
    • Scheme 2: Multivariate approach; incorporates correlation between features
    • For both schemes, there are options to simulate from the following models: LN, LL1, LL2, W1, W2
  • Complete Example: This example does the following

    1. Creates a simulated data set (Scheme 1 - W, 33% censoring)
    2. Obtains PH, PO & YP model fits and GOF
    3. Computes all proposed measures for feature selection:
      • I measures - IPO, IYP
      • R2_measures - R2LR, R2IPO, R2IPH, R2PO, R2CO, R2ModCH
      • R2_measures (by Rouam et al. 2010, 2011) - R2CH, R2PH
    4. Computes other existing measures (concreg, Uno's C, R2G & R2SH)
    5. Computes Sensitivity, Specificity, Youden & AUC for each measure.
    6. Creates venn diagrams showing overlaps between measures.
    • Note: Before running this example, some functions need to be run from the other R code in this repository. All required functions are noted thoughout the example.
    • Required: R packages (survival, timereg, YPmodel, concreg, survAUC, pec, MESS, gplots, VennDiagram, latex2exp)

Copyright & Citations

Copyright © Lauren Spirko-Burns and Karthik Devarajan

Spirko, L.N., Devarajan, K. Unified methods for variable selection in large-scale genomic studies with censored survival outcomes. Under review. COBRA pre-print series, Article 120 (June 2019). http://biostats.bepress.com/cobra/art120.

Spirko, L. (2017). Variable Selection and Supervised Dimension Reduction for Large-Scale Genomic Data with Censored Survival Outcomes. Ph.D. Dissertation. Department of Statistical Science, Temple University, Philadelphia.

License

Creative Commons License
This work by Lauren Spirko-Burns and Karthik Devarajan is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

About

Methods for feature selection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages