# Feature Selection For Machine Learning
- The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.
- Irrelevant or partially relevant features can negatively impact model performance.


1. Univariate Selection.
2. Recursive Feature Elimination.
3. Principle Component Analysis.
4. Feature Importance.

# Why feature selection
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modeling your data are:
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.
http://scikit-learn.org/stable/modules/feature_selection.html

## Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class2 that can be used with a suite of different statistical tests to select a specific number of features. 

The example below uses the chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests/chi-square-goodness-of-fit-tests/v/pearson-s-chi-square-test-goodness-of-fit


In [3]:
# Pima Indians Diabetes Dataset
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [4]:
#Loading dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv('pima-indians-diabetes.data',names=names)

In [5]:
# separate array into input and output components
X = df.drop('class',axis='columns')
y = df['class']

In [4]:
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,y)

In [8]:
scr = fit.scores_

In [11]:
colss = X.columns

In [10]:
new_dict = dict(zip(colss, scr))

In [11]:
new_dict.

{'age': 181.30368904430023,
 'mass': 127.66934333103606,
 'pedi': 5.39268154697145,
 'plas': 1411.8870406441411,
 'preg': 111.51969063588255,
 'pres': 17.605373215320718,
 'skin': 53.108039836324338,
 'test': 2175.5652729220137}

In [6]:
features = fit.transform(X)
features = pd.DataFrame(features)
features.head()

Unnamed: 0,0,1,2,3
0,148.0,0.0,33.6,50.0
1,85.0,0.0,26.6,31.0
2,183.0,0.0,23.3,32.0
3,89.0,94.0,28.1,21.0
4,137.0,168.0,43.1,33.0


#### You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores): plas, test, mass and age.

## Recursive Feature Elimination
- The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.
- It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute
- The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

In [12]:
# Feature Extraction with RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [13]:
model = LogisticRegression()
rfe = RFE(model,n_features_to_select=3)
fit = rfe.fit(X,y)

In [15]:
print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)

3
[ True False False False False  True  True False]
[1 2 3 5 6 1 1 4]


In [16]:
feat_dict = dict(zip(colss,fit.ranking_))
feat_dict

{'age': 4,
 'mass': 1,
 'pedi': 1,
 'plas': 2,
 'preg': 1,
 'pres': 3,
 'skin': 5,
 'test': 6}

#### You can see that RFE chose the top 3 features as preg, mass and pedi.

## Principal Component Analysis
- Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique.
- A property of PCA is that you can choose the number of dimensions or principal components in the transformed result.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html


In [1]:
from sklearn.decomposition import PCA

In [6]:
pca = PCA(n_components=3)
fit = pca.fit(X)

In [7]:
print(fit.explained_variance_ratio_)
fit.components_

[ 0.88854663  0.06159078  0.02579012]


array([[ -2.02176587e-03,   9.78115765e-02,   1.60930503e-02,
          6.07566861e-02,   9.93110844e-01,   1.40108085e-02,
          5.37167919e-04,  -3.56474430e-03],
       [ -2.26488861e-02,  -9.72210040e-01,  -1.41909330e-01,
          5.78614699e-02,   9.46266913e-02,  -4.69729766e-02,
         -8.16804621e-04,  -1.40168181e-01],
       [ -2.24649003e-02,   1.43428710e-01,  -9.22467192e-01,
         -3.07013055e-01,   2.09773019e-02,  -1.32444542e-01,
         -6.39983017e-04,  -1.25454310e-01]])

# Feature Importance
- Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In [8]:
from sklearn.ensemble import RandomForestClassifier

In [9]:
rf = RandomForestClassifier()
rf.fit(X,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [10]:
rf.feature_importances_

array([ 0.08306331,  0.26764388,  0.09741018,  0.08595467,  0.06843436,
        0.13552816,  0.11346537,  0.14850008])

In [13]:
imp_feats = dict(zip(colss,rf.feature_importances_))
feats_df = pd.DataFrame.from_dict(imp_feats,orient='index')
# https://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe

In [17]:
feats_df.sort_values(0,ascending=False)

Unnamed: 0,0
plas,0.267644
age,0.1485
mass,0.135528
pedi,0.113465
pres,0.09741
skin,0.085955
preg,0.083063
test,0.068434


### The scores suggest at the importance of plas, age and mass.