Selecting the right features(variables) can improve the learning process in data science by reducing the amount of noise (useless information) that can influence the learner’s estimates. Variable selection, therefore, can effectively reduce the variance of predictions. In order to involve just the useful variables in training and leave out the redundant ones, you can use these techniques:

**Univariate Approach**: Select the variables most related to the target outcome.

**Greedy or Backward Approach**: Keep only the variables that you can remove from the learning process without damaging its performance.

### Selecting by univariate measures

If you decide to select a variable by its level of association with its target, the <font color='blue'> **sklearn.feature_selection.SelectPercentile** </font> provides an automatic procedure for keeping only a certain percentage of the best, associated features. The available metrics for association are

**f_regression**: Used only for numeric targets and based on linear regression performance.

**f_classif**: Used only for categorical targets and based on the Analysis of Variance (ANOVA) statistical test.

**chi2**: Performs the chi-square statistic for categorical targets, which is less sensible to the nonlinear relationship between the predictive variable and its target.

When evaluating candidates for a classification problem, **f_classif** and **chi2** tend to provide the same set of top variables. It’s still a good practice to test the selections from both the association metrics.

Apart from applying a direct selection of the top percentile associations, SelectPercentile can also rank the best variables to make it easier to decide at what percentile to exclude a feature from participating in the learning process. The <font color='blue'> **sklearn.feature_selection.SelectKBest** </font> is analogous in its functionality, but it selects the top **k** variables, where k is a number, not a percentile.

- For regression: f_regression
- For classification: chi2 or f_classif

In [25]:
from sklearn.datasets import load_boston
boston = load_boston()

X = boston.data
y = boston.target
# print(boston.DESCR)

In [29]:
from sklearn.feature_selection import f_regression

F_score, p_value = f_regression(X, y)

for i in range(len(X[0])):
    print F_score[i], p_value[i]

88.1512417809 2.08355011081e-19
75.257642299 5.71358415308e-17
153.954883136 4.90025998175e-31
15.9715124204 7.39062317052e-05
112.59148028 7.06504158626e-24
471.846739876 2.48722887101e-74
83.4774592192 1.56998220919e-18
33.5795703259 1.20661172734e-08
85.9142776698 5.46593256965e-19
141.761356577 5.63773362769e-29
175.105542876 1.60950947847e-34
63.0542291125 1.31811273408e-14
601.61787111 5.08110339439e-88


In [4]:
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression

Selector_f = SelectPercentile(f_regression, percentile=25)
Selector_f.fit(X,y)

SelectPercentile(percentile=25,
         score_func=<function f_regression at 0x7f812fdc0848>)

In [5]:
for n, s in zip(boston.feature_names, Selector_f.scores_):
    print "F-score: %3.2f\t for feature" %s, n

F-score: 88.15	 for feature CRIM
F-score: 75.26	 for feature ZN
F-score: 153.95	 for feature INDUS
F-score: 15.97	 for feature CHAS
F-score: 112.59	 for feature NOX
F-score: 471.85	 for feature RM
F-score: 83.48	 for feature AGE
F-score: 33.58	 for feature DIS
F-score: 85.91	 for feature RAD
F-score: 141.76	 for feature TAX
F-score: 175.11	 for feature PTRATIO
F-score: 63.05	 for feature B
F-score: 601.62	 for feature LSTAT


In [6]:
for n, s in zip(boston.feature_names, Selector_f.pvalues_):
    print "p-value: %13.12f\t for feature" %s, n

p-value: 0.000000000000	 for feature CRIM
p-value: 0.000000000000	 for feature ZN
p-value: 0.000000000000	 for feature INDUS
p-value: 0.000073906232	 for feature CHAS
p-value: 0.000000000000	 for feature NOX
p-value: 0.000000000000	 for feature RM
p-value: 0.000000000000	 for feature AGE
p-value: 0.000000012066	 for feature DIS
p-value: 0.000000000000	 for feature RAD
p-value: 0.000000000000	 for feature TAX
p-value: 0.000000000000	 for feature PTRATIO
p-value: 0.000000000000	 for feature B
p-value: 0.000000000000	 for feature LSTAT


Using the level of association output helps you to choose the most important variables for your machine-learning model, but you should watch out for these possible problems:

- Some variables with high association could also be highly correlated, introducing duplicated information, which     acts as noise in the learning process.

- Some variables may be penalized, especially binary ones (variables indicating a status or characteristic using the value 1 when it is present, 0 when it is not). For example, notice that the output shows the binary variable CHAS as the least associated with the target variable (but you know from previous examples that it’s influential from the cross-validation phase).

The univariate selection process can give you a **real advantage** when you have **a huge number of variables** to select from and all other methods turn computationally infeasible. The best procedure is to reduce the value of **SelectPercentile** by half or more of the available variables, reduce the number of variables to a manageable number, and consequently allow the use of a more sophisticated and more precise method such as a greedy search.

### Using a greedy search

When using a univariate selection, you have to decide for yourself how many variables to keep: Greedy selection automatically reduces the number of features involved in a learning model on the basis of their effective contribution to the performance measured by the error measure.

The RFECV class, fitting the data, can provide you with information on the number of useful features, point them out to you, and automatically transform the X data, by the method transform, into a reduced variable set, as shown in the following example:

It’s possible to obtain an index to the optimum variable set by calling the attribute **support_** from the RFECV class after you fit it.

In [60]:
print boston.feature_names[selector.support_]

['CHAS' 'NOX' 'RM' 'DIS' 'PTRATIO' 'LSTAT']


Notice that CHAS is now included among the most predictive features, which contrasts with the result from the univariate search. The RFECV method can detect whether a variable is important, no matter whether it is binary, categorical, or numeric, because it directly evaluates the role played by the feature in the prediction.

The RFECV method is certainly more efficient, when compared to the univariate approach, because it considers highly correlated features and is tuned to optimize the evaluation measure (which usually is not Chi-square or F-score). Being a greedy process, it’s computationally demanding and may only approximate the best set of predictors.

As RFECV learns the best set of variables from data, the selection may overfit, which is what happens with all other machine-learning algorithms. Trying RFECV on different samples of the training data can confirm the best variables to use.

In [77]:
from sklearn import linear_model
from sklearn.feature_selection import RFECV

regression = linear_model.LinearRegression()
selector = RFECV(estimator = regression,
                        cv = 10,
                   scoring = "mean_squared_error")
selector.fit(X, y)

print selector.estimator_
print selector.n_features_
print selector.support_
print selector.ranking_
print selector.grid_scores_

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
6
[False False False  True  True  True False  True False False  True False
  True]
[3 5 4 1 1 1 8 1 2 6 1 7 1]
[-74.15075364 -58.91200179 -46.98911866 -39.70772115 -38.45325241
 -31.60974549 -33.08806978 -35.29131256 -34.99567387 -34.85887256
 -36.40257637 -35.57595111 -34.76309151]
