# Demo of feature selection methods in Python

Three methods for feature selection are exemplified in this notebook.
* SelectKBest (filter)
* RFE with logistic regression (wrapper)
* Ridge (embedded)

The code is inspired by: https://www.datacamp.com/community/tutorials/feature-selection-python and improves the notation.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# load data: from URL or local file
# uri = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
uri = "pima-indians-diabetes-no-header.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(uri, names=names)

In [3]:
df.head(6)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0


In [4]:
df.shape

(768, 9)

There are 8 different features, and the outcomes (last column) are labeled 1 and 0:
* 1 denotes that the observed person has diabetes
* 0 denotes that the observed person *does not* have diabetes 

Note. The original dataset is known to have missing values. Specifically, there are missing observations for some columns that are marked as a zero value.  For this demo, we use a preprocessed version of the dataset.  It can be obtained from the URL above.

In [5]:
# convert the DataFrame object to a NumPy array to achieve faster computation
pima_array = df.values
# separate features from labels
X = pima_array[:,0:8]
Y = pima_array[:,8]

## 1. Select best features with Chi-squared

The Chi-Squared statistical test for *non-negative* features will select 4 of the best features from the dataset. The Chi-Squared test belongs the class of filter methods.

Scikit-learn provides the `SelectKBest` class that can be used with a suite of different statistical tests (in our case, Chi-Squared) to select a specific number of features.

In [6]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [7]:
# Feature selection -- without transformation
sel = SelectKBest(score_func = chi2, k = 4)
sel.fit(X, Y)

SelectKBest(k=4, score_func=<function chi2 at 0x00000154D650DCA0>)

In [8]:
# Summarize scores
np.set_printoptions(precision=3)
print(sel.scores_)
print(sel.get_support())
print(sel.get_support(indices=True))

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[False  True False False  True  True False  True]
[1 4 5 7]


We can see the scores for each attribute.  The four attributes chosen are those with the highest scores (plas, test, mass, and age).  And with `get_support()` we get to see which ones were selected.

In [9]:
X_new = sel.transform(X)

In [10]:
# Display the first 4 lines with the 4 selected features
print(X_new[0:4])

[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]]


In [11]:
# Let's try to do it on the df, to see the column names
# https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
sel.fit(df[names[0:8]], df[names[8:9]])
print(sel.scores_)
cols = sel.get_support(indices=True)
print(cols)
df_features_new = df[names[0:8]].iloc[:,cols]
df_features_new.head(4) # we get the df and names, but haven't added the class (Y)

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[1 4 5 7]


Unnamed: 0,plas,test,mass,age
0,148,0,33.6,50
1,85,0,26.6,31
2,183,0,23.3,32
3,89,94,28.1,21


## 2. Select best features with Recursive Feature Elimination

RFE is a type of wrapper feature selection method. It works by recursively removing attributes and building a model on those attributes that remain.  It uses the outputs of the model  to identify which attributes contribute the most to predicting the target attribute.

In [12]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# We will use RFE with the Logistic Regression classifier to select the top 3 features

In [13]:
# Feature extraction
estimator = LogisticRegression(solver='lbfgs', max_iter=200) # increased 'max_iter' to avoid an error message
sel2 = RFE(estimator, n_features_to_select=3)
sel2.fit(X, Y)
print("Number of features:", sel2.n_features_)
print("Selected features: ", sel2.support_)
print("Feature ranking:   ", sel2.ranking_)
# Reminder: names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

Number of features: 3
Selected features:  [ True False False False False  True  True False]
Feature ranking:    [1 2 4 6 5 1 1 3]


RFE chose the following top 3 features: preg, mass, and pedi.  These are marked True in the support array and marked with a choice “1” in the ranking array. This, in turn, indicates the strength of the other features.

## 3. Select best features with Ridge regression

Ridge regression is basically a regularization technique with L2 norm, and can be used as an embedded feature selection technique as well.
See [this article](https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#three) for an excellent explanation on Ridge regression. You can also check scikit-learn's official documentation on Ridge regression.

In [14]:
from sklearn.linear_model import Ridge
# Linear least squares with l2 regularization.

In [15]:
model = Ridge(alpha=1.0)
model.fit(X,Y)
# Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001)

Ridge()

In order to better understand the results of Ridge regression, we use a helper function that prints the results so that we can interpret them easily.

In [16]:
def pretty_print_coefs(coefs, names = None, sort = False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name) for coef, name in lst)

In [17]:
# pass Ridge model's coefficient terms to this function

print ("Ridge model:", pretty_print_coefs(model.coef_))

Ridge model: 0.021 * X0 + 0.006 * X1 + -0.002 * X2 + 0.0 * X3 + -0.0 * X4 + 0.013 * X5 + 0.145 * X6 + 0.003 * X7


This shows all the coefficients appended with the features. They can help to choose the most important features by taking the highest coefficients.  Here, features 6 (especially), then 0 and then 5 are ranked high.  With the previous methods, we had 1-4-5-7, so it's not the same.

In [18]:
from sklearn.feature_selection import SelectFromModel

In [19]:
sel3 = SelectFromModel(estimator=Ridge(), threshold="0.5*mean")
sel3.fit(X, Y)

SelectFromModel(estimator=Ridge(), threshold='0.5*mean')

In [20]:
print(sel3.estimator_.coef_)
print(sel3.threshold_)
print(sel3.get_support())
print(sel3.get_support(indices=True))
# not available : sel3.n_features_, sel3.support_, sel3.ranking_

[ 0.021  0.006 -0.002  0.    -0.     0.013  0.145  0.003]
0.01190245478065201
[ True False False False False  True  True False]
[0 5 6]


### Let's try with Lasso too.

In [21]:
from sklearn.linear_model import Lasso

In [22]:
sel4 = SelectFromModel(estimator=Lasso(alpha=1.0), threshold="mean")

In [23]:
sel4.fit(X, Y)

SelectFromModel(estimator=Lasso(), threshold='mean')

In [24]:
print(sel4.estimator_.coef_)
print(sel4.get_support(indices=True))

[ 0.     0.006  0.     0.    -0.     0.     0.     0.   ]
[1]
