Neyman-Pearson Based Feature Selection (NPFS) post-hoc test
There are types of feature subset selection problems that require that the size of the subset be specied prior to running the selection algorithm. NPFS works with the decisions of a
base subset selection algorithm to determine an appropriate number of features to select given an initial starting point. NPFS uses the FEAST feature selection toolbox; however, the approach is not limited to using the this toolbox. The Matlab script
npfs_post.m is provide for those who have already run the base feature selection algorithm and would like to apply the NPFS routine.
The scripts can be copied into the default Matlab path (usually something like
~/Matlab/), or add the path of where you placed the scripts into Matlab's working path (e.g.,
This tutorial assumes that the FEAST feature selection toolbox for Matlab has been compiled and installed to the Matlab path. The current implementation of NPFS uses FEAST; however, you are not limited to using FEAST's feature selection implementations. The
get_features.m script can be modified to include different base variable selection algorithms. Just replace the
feast function call with anothor function that returns
n_features indices that are relevant.
Generating the Data
First, let us begin by generating some data, and just as importantly generating some data that we can easily interpret when it comes time to apply feature selection. We are going to generate some
n_features that are integers in the range 0 to 10, and the labels are going to be chosen such that if the sum of the 1st
n_relevant features is ggreat than some threshold, then the example is labeled 1 and otherwise recieves the label 2. The code to do this is shown below.
n_features = 50; n_observations = 1000; n_relevant = 10; % feast wants data = round(10*rand(n_observations, n_features)); label_sum = sum(data(:, 1:n_relevant), 2); delta = mean(label_sum); labels = zeros(n_observations, 1); labels(label_sum > delta) = 1; labels(label_sum <= delta) = 2;
npfs.m function assumes that FEAST has already been compiled and added to the Matlab path. Once this is done,
NPFS can be called as follows.
n_select = 5; n_boots = 100; beta = 0; % haven't published on this term alpha = 0.01; method = 'mim'; idx = npfs(data, labels, method, n_select, n_boots, alpha, beta);
However, running the above code could take a long time to run depending on how many bootstraps you choose to use. As discussed in the manuscript
NPFS can easily be parallelized. Parallelism is also implemented in
npfs.m. To take advantage of this, run:
matlabpool open local 12 idx = npfs(data, labels, method, n_select, n_boots, alpha, beta); matlabpool close force
Note that you're limited to the number of parallel workers that you can open. Hence the above code may not work on laptops.
beta variable specified in the above code is used to bias the hypothesis test. For the results presented in the original NPFS manuscipt
- Gregory Ditzler, Robi Polikar, Gail Rosen, "A Bootstrap Based Neyman–Pearson Test for Identifying Variable Importance," 2014, In press.