# Feature Selection

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

## Remove Low Var Features

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

Again we are starting to see fit and fit_transform pop up again. Sklearn provides a ton of functionality that's not just prediction. Some of the functionality is preprocessing the data. Again these are like models (they can only rely on the training data) but don't really predict anything. Thus they do have a fit method, but don't have a predict method. We will see two examples of this type of paradigm below.

In [6]:
from sklearn.feature_selection import VarianceThreshold

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]

sel = VarianceThreshold(threshold=(.8 * (1 - .8)))

sel.fit(X)

sel.transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

In [8]:
sel.fit_transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

## Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method:
* SelectKBest removes all but the k highest scoring features
* SelectPercentile removes all but a user-specified highest scoring percentage of features
* using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
* GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):

* For regression: f_regression, mutual_info_regression
* For classification: chi2, f_classif, mutual_info_classif

The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.

In [10]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

SelectKBest?

In [4]:
X, y = load_iris(return_X_y=True)

sel = SelectKBest(chi2, k=2)

sel.fit(X, y)

SelectKBest(k=2, score_func=<function chi2 at 0x00000260FF6C9048>)

In [5]:
sel.transform(X).shape

(150, 2)

In [6]:
sel.scores_

array([ 10.81782088,   3.7107283 , 116.31261309,  67.0483602 ])

## Recursive feature elimination

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

So it is very important to normalize these features in linear models!

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import RFECV

RFECV?

In [6]:
m = RFECV(RandomForestClassifier(), scoring='accuracy')

In [7]:
m.fit(X, y)

NameError: name 'X' is not defined

In [8]:
m.score?

## Feature selection using SelectFromModel

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

For examples on how it is to be used refer to the sections below.

In [9]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

SelectFromModel?

In [14]:
m = SelectFromModel(LinearSVC(C=0.01, penalty='l1', dual=False))

m.fit(X, y)

SelectFromModel(estimator=LinearSVC(C=0.01, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.0001,
     verbose=0),
        prefit=False, threshold=None)

In [15]:
m.transform(X).shape

(150, 3)

A little bit more complex!

In [16]:
from sklearn.linear_model import LassoCV
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

print X.shape

m = SelectFromModel(LassoCV())

m.fit(X, y)

m.transform(X).shape

(506, 13)


(506, 10)