Source: https://machinelearningmastery.com/feature-selection-machine-learning-python/

# Feature Selection 
- The data features used to train the ML model have a huge influence on the performace achieved: irrelevant or partially relevant features can negatively impact the ML model, especially linear algorithms like linear or logistic regression   
- DEF process where you automatically select features in the data that contribute the most to the prediction variable or output you are interested in 
- Three benefits:
1. Reduces overfitting: less redundant data, fewer decisions based on noise
2. Improves accuracy: less misleading data, modeling accuracy improves
3. Reduces training time: less data, faster training 

# 1. Univariate selection
- statistical tests used to select those features with the strongest relationship with the output variable 

In [2]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])



[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


- The 4 chosen are the ones with the highest scores: plas, test, mass and age 

# 2. Recursive Feature Elimination
- RFE: recursively removes attributes and builds a model on those attributes that remain
- uses model accuracy to ID which attributes (and combos of attributes) contribute the most to predicting the target attribute 

In [3]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression 
model = LogisticRegression(solver='lbfgs')
rfe = RFE(model,3)
fit = rfe.fit(X,Y)
print("Num Features: %d" % fit.n_features_)
print ("Selected Features: %s" % fit.support_)
print ("Feature Ranking: %s" % fit.ranking_)


Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 5 6 1 1 3]




In [6]:
dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


- Top 3 Features: preg, mass, pedi 

## 3. Principal Component Analysis 
- uses linear algebra to transform the data into a compressed form 
- data reduction technique
- select the # of prinicpal components 

In [8]:
# feature extraction with PCA
from sklearn.decomposition import PCA
# feature extraction 
pca = PCA(n_components=3)
fit = pca.fit(X)
#summarize components
print ("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


# 4. Feature Importance 
- using Bagged decision trees like Random Forest and Extra Trees to estimate importance of features


In [10]:
# Feature Importance with Extra Trees Classifier
from sklearn.ensemble import ExtraTreesClassifier
# feature extraction
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X,Y)
print(model.feature_importances_)

[0.104 0.227 0.096 0.074 0.07  0.14  0.127 0.161]


- The larger the score the more important the attribute 
- Scores suggest the importance of plas, age, and mass