In [1]:
import pandas as pd

In [2]:
url = r"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names = names)

In [3]:
data.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
data.values

array([[   6.   ,  148.   ,   72.   , ...,    0.627,   50.   ,    1.   ],
       [   1.   ,   85.   ,   66.   , ...,    0.351,   31.   ,    0.   ],
       [   8.   ,  183.   ,   64.   , ...,    0.672,   32.   ,    1.   ],
       ..., 
       [   5.   ,  121.   ,   72.   , ...,    0.245,   30.   ,    0.   ],
       [   1.   ,  126.   ,   60.   , ...,    0.349,   47.   ,    1.   ],
       [   1.   ,   93.   ,   70.   , ...,    0.315,   23.   ,    0.   ]])

# 1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

The example below uses the chi squared (chi^2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

In [5]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

array = data.values
x = array[:,:8]
y = array[:,8]

In [6]:
# feature extraction

model= SelectKBest(score_func = chi2, k =4) # here 4 is number of features we want to extract
model.fit(x, y)

SelectKBest(k=4, score_func=<function chi2 at 0x00000207EF127D90>)

In [7]:
# summarizing the scores
model.scores_        # features with high scores are considered to be best features

array([  111.51969064,  1411.88704064,    17.60537322,    53.10803984,
        2175.56527292,   127.66934333,     5.39268155,   181.30368904])

In [8]:
print(model.transform(x[:5]))

[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]


# 2. Recursive Feature Elimination(RFE)

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.

The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
lr = LogisticRegression()
model = RFE(lr , 3)
model.fit(x, y)
print("Num features : {}".format(model.n_features_))
print("Selected Features : {}".format(model.support_))
print("Features ranking: {}".format(model.ranking_)) # You can see that RFE provides 1 ranking and True for best features i.e. preg, mass, age

Num features : 3
Selected Features : [ True False False False False  True  True False]
Features ranking: [1 2 3 5 6 1 1 4]


# 3. PCA

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In the example below, we use PCA and select 3 principal components.

In [27]:
from sklearn.decomposition import PCA
model = PCA(n_components = 3)
model.fit(x, y)
print("Explained Variance : {}".format(model.explained_variance_ratio_))
print("Components : {}".format(model.components_))

Explained Variance : [ 0.88854663  0.06159078  0.02579012]
Components : [[ -2.02176587e-03   9.78115765e-02   1.60930503e-02   6.07566861e-02
    9.93110844e-01   1.40108085e-02   5.37167919e-04  -3.56474430e-03]
 [ -2.26488861e-02  -9.72210040e-01  -1.41909330e-01   5.78614699e-02
    9.46266913e-02  -4.69729766e-02  -8.16804621e-04  -1.40168181e-01]
 [ -2.24649003e-02   1.43428710e-01  -9.22467192e-01  -3.07013055e-01
    2.09773019e-02  -1.32444542e-01  -6.39983017e-04  -1.25454310e-01]]


# 4. Feature Importance
Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier

In [29]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
et= ExtraTreesClassifier()
et.fit(x, y)
print(et.feature_importances_)

[ 0.11436446  0.22786127  0.09549275  0.07803735  0.07690532  0.13947117
  0.11851198  0.1493557 ]
