An implementation of a post-hoc test for subset selection that uses a Neyman-Pearson hypothesis test to identify important variables.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
img
matlab
.gitignore
LICENSE
README.md

README.md

Neyman-Pearson Based Feature Selection (NPFS) post-hoc test

There are types of feature subset selection problems that require that the size of the subset be specied prior to running the selection algorithm. NPFS works with the decisions of a base subset selection algorithm to determine an appropriate number of features to select given an initial starting point. NPFS uses the FEAST feature selection toolbox; however, the approach is not limited to using the this toolbox. The Matlab script npfs_post.m is provide for those who have already run the base feature selection algorithm and would like to apply the NPFS routine.

Installation

The scripts can be copied into the default Matlab path (usually something like ~/Matlab/), or add the path of where you placed the scripts into Matlab's working path (e.g., addpath('/path/to/NPFS/scripts/')).

Example

This tutorial assumes that the FEAST feature selection toolbox for Matlab has been compiled and installed to the Matlab path. The current implementation of NPFS uses FEAST; however, you are not limited to using FEAST's feature selection implementations. The get_features.m script can be modified to include different base variable selection algorithms. Just replace the feast function call with anothor function that returns n_select of n_features indices that are relevant.

Generating the Data

First, let us begin by generating some data, and just as importantly generating some data that we can easily interpret when it comes time to apply feature selection. We are going to generate some n_features that are integers in the range 0 to 10, and the labels are going to be chosen such that if the sum of the 1st n_relevant features is ggreat than some threshold, then the example is labeled 1 and otherwise recieves the label 2. The code to do this is shown below.

  n_features = 50;
  n_observations = 1000; 
  n_relevant = 10;

  % feast wants 
  data = round(10*rand(n_observations, n_features));
  label_sum = sum(data(:, 1:n_relevant), 2);
  delta = mean(label_sum);
  labels = zeros(n_observations, 1);
  labels(label_sum > delta) = 1;
  labels(label_sum <= delta) = 2;

Running NPFS

The npfs.m function assumes that FEAST has already been compiled and added to the Matlab path. Once this is done, NPFS can be called as follows.

  n_select = 5;
  n_boots = 100;
  beta = 0;  % haven't published on this term
  alpha = 0.01;
  method = 'mim';
  idx = npfs(data, labels, method, n_select, n_boots, alpha, beta);

However, running the above code could take a long time to run depending on how many bootstraps you choose to use. As discussed in the manuscript NPFS can easily be parallelized. Parallelism is also implemented in npfs.m. To take advantage of this, run:

  matlabpool open local 12
  idx = npfs(data, labels, method, n_select, n_boots, alpha, beta);
  matlabpool close force

Note that you're limited to the number of parallel workers that you can open. Hence the above code may not work on laptops.

The beta variable specified in the above code is used to bias the hypothesis test. For the results presented in the original NPFS manuscipt beta=0.

Citing NPFS

  • Gregory Ditzler, Robi Polikar, Gail Rosen, "A Bootstrap Based Neyman–Pearson Test for Identifying Variable Importance," 2014, In press.

Word Cloud

alt tag