## Feature Selection For Machine Learning

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [24]:
data = pd.read_csv('dataset/diabetes.csv')
X = data.iloc[:, 0:8].values
y = data.iloc[:, 8].values

<b>Feature Selection: </b> <br>
A process where you automatically select those features in your data that contribute most to the prediction variable
- Reduces Overfitting
- Improves Accuracy
- Reduces Training Time

### Univariate Selection
Select those features that have the strongest relationship with the output

In [25]:
from sklearn.feature_selection import SelectKBest, chi2

# chi2 test for non-negative values
selector = SelectKBest(score_func=chi2, k=4)
X_new = selector.fit_transform(X, y)
np.set_printoptions(precision=3)

X_new

array([[148. ,   0. ,  33.6,  50. ],
       [ 85. ,   0. ,  26.6,  31. ],
       [183. ,   0. ,  23.3,  32. ],
       ...,
       [121. , 112. ,  26.2,  30. ],
       [126. ,   0. ,  30.1,  47. ],
       [ 93. ,   0. ,  30.4,  23. ]])

### Recursive Feature Elimination
RFA works by recursively removing attributes

In [26]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(estimator=model, n_features_to_select=3)
fit = rfe.fit(X, y)

print(f'Num Features: {fit.n_features_}')
print(f'Selected Features: {fit.support_}')
print(f'Feature Ranking: {fit.ranking_}')

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 5 6 1 1 3]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Principal Component Analysis
Transform the dataset into a compressed form. It's called data reduction technique.

In [30]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
fit = pca.fit(X)

# how much variance each principle component explains
# PCA keeps the components with the most variance. These components explain the most important patterns in the data
print(f'Explained Variance: {fit.explained_variance_ratio_}')

# how much each feature contributes to each principal component
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [ 2.265e-02  9.722e-01  1.419e-01 -5.786e-02 -9.463e-02  4.697e-02
   8.168e-04  1.402e-01]
 [ 2.246e-02 -1.434e-01  9.225e-01  3.070e-01 -2.098e-02  1.324e-01
   6.400e-04  1.255e-01]]


### Feature Importance
Gives an importance score for each attribute where the larger the score, the more important the attribute.

In [28]:
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()
model.fit(X, y)
print(model.feature_importances_)

[0.11  0.236 0.1   0.076 0.076 0.144 0.118 0.142]
