Selecting the right features(variables) can improve the learning process in data science by reducing the amount of noise (useless information) that can influence the learner’s estimates. Variable selection, therefore, can effectively reduce the variance of predictions. In order to involve just the useful variables in training and leave out the redundant ones, you can use these techniques:

**Univariate Approach**: Select the variables most related to the target outcome.

**Greedy or Backward Approach**: Keep only the variables that you can remove from the learning process without damaging its performance.

### Selecting by univariate measures

If you decide to select a variable by its level of association with its target, the <font color='blue'> **sklearn.feature_selection.SelectPercentile** </font> provides an automatic procedure for keeping only a certain percentage of the best, associated features. The available metrics for association are

**f_regression**: Used only for numeric targets and based on linear regression performance.

**f_classif**: Used only for categorical targets and based on the Analysis of Variance (ANOVA) statistical test.

**chi2**: Performs the chi-square statistic for categorical targets, which is less sensible to the nonlinear relationship between the predictive variable and its target.

When evaluating candidates for a classification problem, **f_classif** and **chi2** tend to provide the same set of top variables. It’s still a good practice to test the selections from both the association metrics.

Apart from applying a direct selection of the top percentile associations, SelectPercentile can also rank the best variables to make it easier to decide at what percentile to exclude a feature from participating in the learning process. The <font color='blue'> **sklearn.feature_selection.SelectKBest** </font> is analogous in its functionality, but it selects the top **k** variables, where k is a number, not a percentile.

Using the level of association output helps you to choose the most important variables for your machine-learning model, but you should watch out for these possible problems:

- Some variables with high association could also be highly correlated, introducing duplicated information, which     acts as noise in the learning process.

- Some variables may be penalized, especially binary ones (variables indicating a status or characteristic using the value 1 when it is present, 0 when it is not). For example, notice that the output shows the binary variable CHAS as the least associated with the target variable (but you know from previous examples that it’s influential from the cross-validation phase).

The univariate selection process can give you a **real advantage** when you have **a huge number of variables** to select from and all other methods turn computationally infeasible. The best procedure is to reduce the value of **SelectPercentile** by half or more of the available variables, reduce the number of variables to a manageable number, and consequently allow the use of a more sophisticated and more precise method such as a greedy search.

### Using a greedy search

When using a univariate selection, you have to decide for yourself how many variables to keep: Greedy selection automatically reduces the number of features involved in a learning model on the basis of their effective contribution to the performance measured by the error measure.

The RFECV class, fitting the data, can provide you with information on the number of useful features, point them out to you, and automatically transform the X data, by the method transform, into a reduced variable set, as shown in the following example:

It’s possible to obtain an index to the optimum variable set by calling the attribute **support_** from the RFECV class after you fit it.

Notice that CHAS is now included among the most predictive features, which contrasts with the result from the univariate search. The RFECV method can detect whether a variable is important, no matter whether it is binary, categorical, or numeric, because it directly evaluates the role played by the feature in the prediction.

The RFECV method is certainly more efficient, when compared to the univariate approach, because it considers highly correlated features and is tuned to optimize the evaluation measure (which usually is not Chi-square or F-score). Being a greedy process, it’s computationally demanding and may only approximate the best set of predictors.

As RFECV learns the best set of variables from data, the selection may overfit, which is what happens with all other machine-learning algorithms. Trying RFECV on different samples of the training data can confirm the best variables to use.

In [3]:
import pandas as pd
import numpy as np

DFr = pd.read_csv("winequality-red.csv")
DFw = pd.read_csv("winequality-white.csv")

In [4]:
Xr = DFr.ix[:,0:10].as_matrix()
yr = DFr.ix[:,  11].as_matrix()

Xw = DFw.ix[:,0:10].as_matrix()
yw = DFw.ix[:,  11].as_matrix()
len(yr), len(yw)

(1599, 4898)

In [5]:
from sklearn.feature_selection import chi2

chi, pvalue1 = chi2(Xr, yr)

for i in range(len(Xr[0])):
    print chi[i], pvalue1[i]

#selector = SelectPercentile(f_classif, percentile=25)
#selector.fit(Xr, yr)
#print selector.pvalues_
#print selector.scores_ 

11.2606523696 0.0464500416471
15.5802890515 0.00815035154205
13.0256651036 0.0231394417254
4.12329473592 0.531804674961
0.75242557946 0.979968039781
161.936036048 3.82728810062e-33
2755.55798423 0.0
0.000230432045444 0.999999999957
0.154654735634 0.999526491006
4.5584877468 0.47209632134


In [6]:
from sklearn.feature_selection import f_classif

fvalue, pvalue = f_classif(Xr, yr)

for i in range(len(Xr[0])):
    print fvalue[i], pvalue[i]

6.28308115822 8.79396662384e-06
60.9139928316 3.32646506052e-58
19.6906644662 4.4210915745e-19
1.05337357785 0.384618775429
6.03563859236 1.52653902486e-05
4.75423310399 0.000257082723402
25.4785095183 8.53359844527e-25
13.3963569138 8.12439442344e-13
4.34176430319 0.000628438870133
22.2733760896 1.22589009185e-21


In [7]:
from sklearn.feature_selection import SelectPercentile, f_classif

selector = SelectPercentile(f_classif, percentile=25)
selector.fit(Xr, yr)

for i in range(len(selector.pvalues_)):
    print selector.pvalues_[i], selector.scores_[i] 

8.79396662384e-06 6.28308115822
3.32646506052e-58 60.9139928316
4.4210915745e-19 19.6906644662
0.384618775429 1.05337357785
1.52653902486e-05 6.03563859236
0.000257082723402 4.75423310399
8.53359844527e-25 25.4785095183
8.12439442344e-13 13.3963569138
0.000628438870133 4.34176430319
1.22589009185e-21 22.2733760896


In [8]:
from sklearn.cross_validation import StratifiedKFold
from sklearn.svm import SVC

clf = SVC()
selector = RFECV(estimator = clf,
                        cv = StratifiedKFold(yr, 3),
                   scoring = 'roc_auc')

selector.fit(Xr, yr)

print selector.estimator_
print selector.n_features_
print selector.support_
print selector.ranking_
print selector.grid_scores_

RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes

In [9]:
from sklearn.decomposition import PCA

pcaf = PCA(copy=True, n_components="mle", whiten=False)
pcat = PCA(copy=True, n_components="mle", whiten=True)

Xrf = pcaf.fit_transform(X)
Xrt = pcat.fit_transform(X)

NameError: name 'X' is not defined

In [10]:
from sklearn import linear_model
from sklearn.feature_selection import RFECV

regression = linear_model.LinearRegression()
selector = RFECV(estimator = regression,
                        cv = 10,
                   scoring = "mean_squared_error")
selector.fit(X, y)

print selector.estimator_
print selector.n_features_
print selector.support_
print selector.ranking_
print selector.grid_scores_

NameError: name 'X' is not defined

In [6]:
from sklearn.cross_validation import KFold, cross_val_score

In [7]:
from sklearn.svm import LinearSVC
clf = LinearSVC(loss = 'squared_hinge')

kfold = KFold(len(X), n_folds = 5)
scores= cross_val_score(clf, X, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.7002 (+/- 0.4422)


In [8]:
scores= cross_val_score(clf, Xrf, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.9384 (+/- 0.2027)


In [9]:
scores= cross_val_score(clf, Xrt, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.9495 (+/- 0.1814)


In [10]:
from sklearn.svm import SVC
clf = SVC()

kfold = KFold(len(X), n_folds = 5)
scores= cross_val_score(clf, X, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.1060 (+/- 0.3811)


In [11]:
scores= cross_val_score(clf, Xrf, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.1005 (+/- 0.3841)


In [14]:
scores= cross_val_score(clf, Xrt, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.8646 (+/- 0.3082)


In [35]:
from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier(n_neighbors=5)

In [36]:
scores= cross_val_score(kNN, X, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.6100 (+/- 0.5566)


In [37]:
scores= cross_val_score(kNN, Xrf, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.6100 (+/- 0.5566)


In [38]:
scores= cross_val_score(kNN, Xrt, y, cv=kfold, n_jobs=-1)
print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))

Accuracy: 0.6106 (+/- 0.4637)


In [None]:
# fit(X[, y])           Fit the model with X.
# fit_transform(X[, y]) Fit the model with X and apply the dimensionality reduction on X.
# get_covariance()      Compute data covariance with the generative model.
# get_params([deep])    Get parameters for this estimator.
# get_precision()       Compute data precision matrix with the generative model.
# inverse_transform(X)  Transform data back to its original space, i.e.,
# score(X[, y])         Return the average log-likelihood of all samples
# score_samples(X)      Return the log-likelihood of each sample
# set_params(**params)  Set the parameters of this estimator.
# transform(X)          Apply the dimensionality reduction on X.

In [45]:
for i in range(1, 13):
    pcai = PCA(n_components=i, whiten=True)
    print pcai
    Xri = pcai.fit_transform(X)
    scores = cross_val_score(clf, Xri, y, cv=kfold, n_jobs=-1)
    print("Accuracy: %0.4f (+/- %0.4f)" %(np.mean(scores), np.sqrt(np.std(scores))))
    print(sum(pcai.explained_variance_ratio_))
    print(pcai.noise_variance_)
    print

PCA(copy=True, n_components=1, whiten=True)
Accuracy: 0.5487 (+/- 0.5617)
0.998091230492
15.7208047352

PCA(copy=True, n_components=2, whiten=True)
Accuracy: 0.5483 (+/- 0.5372)
0.999827146117
1.55306269038

PCA(copy=True, n_components=3, whiten=True)
Accuracy: 0.6208 (+/- 0.5672)
0.999922105074
0.769859900136

PCA(copy=True, n_components=4, whiten=True)
Accuracy: 0.8986 (+/- 0.1979)
0.99997232243
0.303940080331

PCA(copy=True, n_components=5, whiten=True)
Accuracy: 0.8986 (+/- 0.2309)
0.999984686115
0.189189889935

PCA(copy=True, n_components=6, whiten=True)
Accuracy: 0.8979 (+/- 0.2905)
0.999993148245
0.0967400468496

PCA(copy=True, n_components=7, whiten=True)
Accuracy: 0.8983 (+/- 0.2800)
0.99999595506
0.0666290119774

PCA(copy=True, n_components=8, whiten=True)
Accuracy: 0.8978 (+/- 0.3478)
0.99999747814
0.0498486524068

PCA(copy=True, n_components=9, whiten=True)
Accuracy: 0.8644 (+/- 0.3264)
0.999998605971
0.0344440636005

PCA(copy=True, n_components=10, whiten=True)
Accuracy: 0