# Notes on Feature Selection

- [Machine Learning Mastery: Feature Selection for Machine Learning](https://machinelearningmastery.com/feature-selection-machine-learning-python/)
- [Machine Learning Mastery: How to Choose a Feature Selection Method for Machine Learning](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)
- [Sklearn Feature Selection article](https://machinelearningmastery.com/feature-selection-machine-learning-python/)
- [TowardsDataScience: Finding the best features](https://towardsdatascience.com/the-art-of-finding-the-best-features-for-machine-learning-a9074e2ca60d)

## Introduction on Feature Selection

Feature selection allows a researcher to automatically select features that contribute most to the prediction in which they are interested. Irrelevant features can decrease the accuracy of many models, especially linear algorithms such as linear and logistic regressions.

Three benefits of performing feature selection before modeling the data are:
1. **Reduction of Overfitting**: Less redundant data means less opportunity to make decisions based on noise.
2. **Improvement of Accuracy**: Less misleading data means modeling accuracy improves.
3. **Reduction of Traing Time**: Less data means that algorithms train faster.

In this tutorial, we consider 4 feature selection recipes for ML.

In [1]:
import pandas as pd

filename = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
colnames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pd.read_csv(filename, names=colnames)

In [9]:
data.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   preg    768 non-null    int64  
 1   plas    768 non-null    int64  
 2   pres    768 non-null    int64  
 3   skin    768 non-null    int64  
 4   test    768 non-null    int64  
 5   mass    768 non-null    float64
 6   pedi    768 non-null    float64
 7   age     768 non-null    int64  
 8   class   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


### 1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn provides the `SelectKBest` class that can be used with a suite of different statistical tests to select a specific number of features.

Different statistical tests can be used with this selection method. For example, the ANOVA F-value method is appropriate for numerical inputs and categorical data (as in the dataset in our example). This can be used via the `f_classify()` function.

In the example below, we select the 4 attributes with the highest score.

In [17]:
# Feature Selection with Univariate Statistical Tests

from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest, f_classif

# Load data
X = data.iloc[:, :8]
y = data.iloc[:, 8]

# Feature Extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, y)

# Summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)

# Summarize selected features
print(features[0:5, :])

[ 39.67  213.162   3.257   4.304  13.281  71.772  23.871  46.141]
[[  6.  148.   33.6  50. ]
 [  1.   85.   26.6  31. ]
 [  8.  183.   23.3  32. ]
 [  1.   89.   28.1  21. ]
 [  0.  137.   43.1  33. ]]


### 2. Recursive Feature Elimination

RFE recursively removes attributes and builds a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

The next example, we use RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as logn as it is skillful and consistent.

In [6]:
# Feature extraction with RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Load data
X = data.iloc[:, :8]
y = data.iloc[:, 8]

# Feature extraction
model = LogisticRegression(solver='lbfgs', max_iter=3000)
rfe = RFE(model, n_features_to_select=3)
fit = rfe.fit(X, y)

print(f'Num features: {fit.n_features_}')
print(f'Selected features: {fit.support_}')
print(f'Feature ranking: {fit.ranking_}')

Num features: 3
Selected features: [ True False False False False  True  True False]
Feature ranking: [1 2 4 6 5 1 1 3]


### 3. Principal Component Analysis

PCA uses linear algebra to transform the dataset into a compressed form. PCA is a data reduction technique.

A property of PCA is that you can choose the number of dimensions or principal components in the transformed data.

In [10]:
# Feature extraction with PCA
from sklearn.decomposition import PCA

# Load data
X = data.iloc[:, :8]
y = data.iloc[:, 8]

# Feature extraction
pca = PCA(n_components=5)
fit = pca.fit(X)

# Summarize components
print(f'Explained Variance {fit.explained_variance_ratio_}')
print(fit.components_)

Explained Variance [0.88854663 0.06159078 0.02579012 0.01308614 0.00744094]
[[-2.02176587e-03  9.78115765e-02  1.60930503e-02  6.07566861e-02
   9.93110844e-01  1.40108085e-02  5.37167919e-04 -3.56474430e-03]
 [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01  5.78614699e-02
   9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01]
 [-2.24649003e-02  1.43428710e-01 -9.22467192e-01 -3.07013055e-01
   2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]
 [-4.90459604e-02  1.19830016e-01 -2.62742788e-01  8.84369380e-01
  -6.55503615e-02  1.92801728e-01  2.69908637e-03 -3.01024330e-01]
 [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01  2.59973487e-01
  -1.72312241e-04  2.14744823e-02  1.64080684e-03  9.20504903e-01]]


### 4. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In [11]:
# Feature importance with Extra Trees Classifier
from sklearn.ensemble import ExtraTreesClassifier

# Load data
X = data.iloc[:, :8]
y = data.iloc[:, 8]

# Feature extraction
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X, y)
print(model.feature_importances_)

[0.10781201 0.2363128  0.09790084 0.07782284 0.0726687  0.13750356
 0.12681048 0.14316875]


## How to Choose a Feature Selection Method For Machine Learning

Feature Selection is the process of reducing the number of input variables when developing a predictive model. This allows to reduce the computational cost of modeling and, in some cases, improve the performance of the model.


There are two main types of feature selection techniques:
1. Supervised, and
2. Unsupervised.
 
Supervised methods may be divided into *wrapper, filter and intrinsic*.

*Filter-based feature selection methods* used statistical measures to score the correlation or dependence between the input variables that can be filtered to choose the most relevant features. Statistical measures for feature selection must be carefully chosen based on the data type of the input variable or response variable.

*Statistical-based feature selection methods* involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and the output variables.

Unsupervised methods do not use the target variable, and it is related to dimensionality reduction -- the main difference is that feature selection chooses featuers to keep or remove from the dataset, whereas dimensionality reduction creates a projection of the data resulting in new input features. As such, dimensionality reduction is an alternative to feature selection.

This tutorial is divided into 4 parts:
1. Feature selection methods
2. Statistics for filter feature selection methods
    1. Numerical input, numerical output
    2. Numerical input, categorical output
    3. Categorical input, numerical output
    4. Categorical input, categorical output
3. Tips and trics for feature selection
    1. Correlation statistics
    2. Selection method
    3. Transform variables
    4. What is the best method?
4. Worked examples
    1. Regression feature selection
    2. Classification feature selection

### 1. Feature Selection Methods

- Feature Selection: Select a subset of input features from the dataset
    - Unsupservised: Do not use the target variable (e.g., remove redundant variables)
        - Correlation
    - Supervised: Use the target variable (e.g., remove irrelevant variables)
        - Wrapper: Search for well-performing subsets of features (create different models with different subsets of input features and select those features that result in the best preforming model according a metric)
            - RFE
        - Filter: Select subsets of features based on their relationship with the target
            - Statistical methods
            - Feature importance methods
        - Intrinsic: Algorithms that perform automatic feature selection during training
            - Decision trees
- Dimensionality Reduction: Project input data into a lower-dimensional feature space.      