# Feature Selection
Feature selection is a process where you automatically select those features in your data that contribute most to the prection variable or output in which you are interested. Having irrelevant features in your data can decrease the accracy of many models, especially linear algorithms line linear and logistic regression.

Three benefits of perfroming feature selection are:
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Tim: Less data means that algorithms train faster.

### Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the SelectKBest class2 that can be used with a suite of di↵erent statistical tests to select a specific number of features.

In [1]:
#Feature Extraction with Univariate Statistical Test (Chi-squared for classification)
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names = names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
#Feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)
#summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
print(features[0:5,:])

[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]
[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]


You can see the scores for each attribute and the 4 attributes chosen (those with highest scores; names manually mapped to the the index of the attribute names)

### Recursive Feature Elimination
The Recursive Feature Elimination (RFE) works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contirbute the most to the predicting the target attribute..

In [5]:
#Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
#load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names = names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
#Feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d" %fit.n_features_)
print("Selected Features: %s" %fit.support_)
print("Feature Ranking %s" %fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking [1 2 3 5 6 1 1 4]


You can see that RFE chose the top 3 fatures as preg, mass, pedi - these are marked True in the support_ array and marked with a choice 1 in the ranking_ array. 

### Principal Component Analysis (PCA)
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. 

In [6]:
#Feature Extraction with PCA
from pandas import read_csv
from sklearn.decomposition import PCA
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names = names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
#Feature Extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
#Summarize Components
print("Explained Variance : %s" %fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance : [ 0.889  0.062  0.026]
[[ -2.022e-03   9.781e-02   1.609e-02   6.076e-02   9.931e-01   1.401e-02
    5.372e-04  -3.565e-03]
 [ -2.265e-02  -9.722e-01  -1.419e-01   5.786e-02   9.463e-02  -4.697e-02
   -8.168e-04  -1.402e-01]
 [ -2.246e-02   1.434e-01  -9.225e-01  -3.070e-01   2.098e-02  -1.324e-01
   -6.400e-04  -1.255e-01]]


### Feature Importance
Bagged decision tress like Trandom Forest and Extra Trees can be used to estimate the importance of features.  

In [8]:
#Feature Importance with Extra Tress Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
#Load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names = names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
#Feature Extraction
model = ExtraTreesClassifier()
model.fit(X,Y)
print(model.feature_importances_)

[ 0.108  0.253  0.093  0.071  0.069  0.141  0.116  0.149]


You can see that we are given an importance score for each attribute where the larger the score, the more important the attribute.