## Chapter 8 Feature Selection For Machine Learning

#### 1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the `SelectKBest` class that can be used with a suite of diﬀerent statistical tests to select a specific number of features. The example below uses the chi-squared (chi2) statistical test for non-neg

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2

# load data
filename = 'data/pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(filename, names=names)
array = df.values
X = array[:,:-1]
Y = array[:,-1]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[:5, :])

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


#### 2. Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
You can learn more about the RFE class3 in the scikit-learn documentation. The example below uses `RFE` with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

In [10]:
# feature extraction with RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# feature extraction
model = LogisticRegression(max_iter=500)
rfe = RFE(model, n_features_to_select=3)
fit = rfe.fit(X, Y)
print(f'Num Features: {fit.n_features_}')
print(f'Selected Features: {fit.support_}')
print(f'Feature Ranking: {fit.ranking_}')

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 6 5 1 1 3]


#### 3. Principle Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the example below, we use PCA and select 3 principal components.

In [11]:
# feature extraction with PCA
from sklearn.decomposition import PCA

# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print(f'Explained Variance: {fit.explained_variance_ratio_}')
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [ 2.265e-02  9.722e-01  1.419e-01 -5.786e-02 -9.463e-02  4.697e-02
   8.168e-04  1.402e-01]
 [ 2.246e-02 -1.434e-01  9.225e-01  3.070e-01 -2.098e-02  1.324e-01
   6.400e-04  1.255e-01]]


Explained variance tell you how much of the total variance in your data is captured by each of the 3 principal components.
- PC1 (first component): 0.889 → captures 88.9% of the total variance
- PC2: 0.062 → captures 6.2%
- PC3: 0.026 → captures 2.6%

Each row in `fit.components_` corresponds to one principal component,
and each column corresponds to an original feature (in your case: preg, plas, pres, skin, test, mass, pedi, age).

These are the weights (or loadings) that define how each principal component is constructed from the original variables.
- In the first principal component (PC1),
the largest weight is for the test feature (≈ 0.9931).
→ This means PC1 is primarily driven by the test variable — it contributes most to the variance.
- In the second component (PC2), the largest weight is for plas (≈ 0.9722),
→ meaning PC2 mainly represents variation in the plas variable.
- In the third component (PC3), the largest weight is for pres (≈ 0.9225),
→ so PC3 captures variation related to blood pressure.

#### 4. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features. In the example below we construct a `ExtraTreesClassifier` classifier for the Pima Indians onset of diabetes dataset.

In [13]:
# feature importance with Extra Tree Classifier
from sklearn.ensemble import ExtraTreesClassifier

# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[0.108 0.238 0.101 0.078 0.077 0.135 0.121 0.142]


You can see that we are given an importance score for each attribute where the larger the score, the more important the attribute. The scores suggest at the importance of plas, age and mass.