<a href="https://colab.research.google.com/github/AbrahamOtero/MLiB/blob/main/3_FeatureSelection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Feature selection

##Set up

We import the libraries that we are going to need

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

We will first import the iris data set:

In [2]:
url = 'https://raw.githubusercontent.com/AbrahamOtero/MLiB/main/datasets/iris.csv'

iris = pd.read_csv(url)

## Filter feature selection methods

To implement different filtering method strategies, **SelectKBest** can be used, which selects the number of attributes that we indicate in its constructor (through the parameter k) based on some score function. In this case, the chi-square function will be used. If the score_func used is **'mutual_info_classif'** it will use the information gain criterion. In the case where the class is metric, the **'f_regression'** criterion can be used.

In [3]:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# X will be the matrix with the features that we are going to evaluate and y the class
X = iris.drop('class', axis=1)
y = iris['class']

# Apply SelectKBest with chi2 to select the 2 best attributes
best_features = SelectKBest(score_func=chi2, k=2)
fit = best_features.fit(X, y)

# Get the indexes of the selected attributes
feature_indices = fit.get_support(indices=True)

# Print the names of the selected attributes
print(X.columns[feature_indices])


Index(['petal.length', 'petal.width'], dtype='object')


From the SlectKBest Estimator we can also see the scores obtained for each attribute:

In [4]:
# Get the chi2 scores for each feature
feature_scores = fit.scores_

# Create a DataFrame to store the scores and feature names
df_scores = pd.DataFrame({'Feature': X.columns, 'Chi2 Score': feature_scores})

# Sort the DataFrame by Chi2 Score in descending order
df_scores = df_scores.sort_values(by='Chi2 Score', ascending=False)

# Print the sorted scores
print(df_scores)


        Feature  Chi2 Score
2  petal.length  116.312613
3   petal.width   67.048360
0  sepal.length   10.817821
1   sepal.width    3.710728


Let's now see what scores we would get using information gain:

In [5]:
from sklearn.feature_selection import mutual_info_classif

# Apply mutual_info_classif to get the scores
feature_scores = mutual_info_classif(X, y)

# Create a DataFrame to store the scores and feature names
df_scores = pd.DataFrame({'Feature': X.columns, 'Mutual Info Score': feature_scores})

# Sort the DataFrame by Mutual Info Score in descending order
df_scores = df_scores.sort_values(by='Mutual Info Score', ascending=False)

# Print the sorted scores
print(df_scores)


        Feature  Mutual Info Score
2  petal.length           0.984524
3   petal.width           0.976850
0  sepal.length           0.493449
1   sepal.width           0.245123


## Wrapper feature selection methods

To carry out feature selection based on model wrappers, we can use the **RFE** (Recursive feature elimination)class, to which we must pass the model we want to use for the selection. In the example below, the model will be a decision tree.

In [6]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

# The model we will use will be a decision tree
estimator = DecisionTreeClassifier()

# Create RFE object to select 2 attributes based on decision tree
selector = RFE(estimator, n_features_to_select=2)

# Fitting the RFE object to the data
selector = selector.fit(X, y)

# Get the indexes of the selected attributes
feature_indices = selector.get_support(indices=True)

# Print the names of the selected attributes
print(X.columns[feature_indices])
print(selector.support_)
print(selector.ranking_)


Index(['petal.length', 'petal.width'], dtype='object')
[False False  True  True]
[3 2 1 1]


RFE starts from all attributes and tries to eliminate them. If we want to use the opposite strategy (start from a set of attributes and add them) we can use **SequentialFeatureSelector**. The following example applies this strategy, also using a decision tree.

In [7]:
from sklearn.feature_selection import SequentialFeatureSelector

# The model we will use will be a decision tree
estimator = DecisionTreeClassifier()

# Create SFS object to select 2 attributes based on decision tree
sfs = SequentialFeatureSelector(estimator, n_features_to_select=2)

# Fitting the SFS object to the data
sfs = sfs.fit(X, y)

# Get the indexes of the selected attributes
feature_indices = sfs.get_support(indices=True)

# Print the names of the selected attributes
print(X.columns[feature_indices])


Index(['petal.length', 'petal.width'], dtype='object')


## A full example

Let's now look at a complete example where we will use several techniques to evaluate the attributes of the diabetes dataset. As you will see, as is often the case, the results are not completely consistent across techniques. Remember that these techniques are just heuristics.

In [8]:
url = 'https://raw.githubusercontent.com/AbrahamOtero/MLiB/main/datasets/diabetes.csv'

diabetes = pd.read_csv(url)

In [9]:
# X will be the matrix with the features that we are going to evaluate and y the class
X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']

# Apply SelectKBest with mutual_info_classif to get the scores
best_features = SelectKBest(score_func=mutual_info_classif, k=5)
fit = best_features.fit(X, y)

# Get the indexes of the selected attributes
feature_indices = fit.get_support(indices=True)
print(X.columns[feature_indices])

# Get the mutual_info_classif scores for each feature and print them
feature_scores = fit.scores_
df_scores = pd.DataFrame({'Feature': X.columns, 'Mutual Info Score': feature_scores})
df_scores = df_scores.sort_values(by='Mutual Info Score', ascending=False)
print(df_scores)

# Now let's move to wrapper methods
# The model we will use will be a decision tree
estimator = DecisionTreeClassifier()

# Create RFE object to select 4 attributes based on decision tree
selector = RFE(estimator, n_features_to_select=4)

# Fitting the RFE object to the data
selector = selector.fit(X, y)

# Get the indexes of the selected attributes
feature_indices = selector.get_support(indices=True)
print(X.columns[feature_indices])

# Create SequentialFeatureSelector object to select 4 attributes based on decision tree
sfs = SequentialFeatureSelector(estimator, n_features_to_select=4)
sfs = sfs.fit(X, y)

# Get the indexes of the selected attributes
feature_indices = sfs.get_support(indices=True)
print(X.columns[feature_indices])


Index(['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'Age'], dtype='object')
                    Feature  Mutual Info Score
1                   Glucose           0.107223
5                       BMI           0.076873
0               Pregnancies           0.073308
7                       Age           0.059826
4                   Insulin           0.023931
3             SkinThickness           0.023025
6  DiabetesPedigreeFunction           0.020021
2             BloodPressure           0.000000
Index(['Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age'], dtype='object')
Index(['Glucose', 'BloodPressure', 'SkinThickness',
       'DiabetesPedigreeFunction'],
      dtype='object')


Based on the results, it is clear that Glucose is the most relevant attribute, as it has been consistently selected by all methods. Insulin has not been selected by either of the two wrapping methods, nor evaluated very well by the Information Gain method. BMI and Age have been selected by one of the wrapping methods, and have been evaluated well by the filter method. For the rest of the attributes, the value of their values ​​is not so clear.

# Exercises

Load the diabetes dataset.

In [None]:
# Your code goes here

Select the 4 attributes that provide the most information based on the chi-square statistic:

In [None]:
# Your code goes here

Select the 4 attributes that provide the most information based on the information gain criterion:

In [None]:
# Your code goes here

Display the scores obtained by all attributes for both the chi-square statistic and the information gain.


In [None]:
# Your code goes here

Use feature selection based on a model wrapper using a decision tree. Use the backward search strategy.

In [None]:
# Your code goes here

Repeat the previous exercise but using the forward search strategy.

In [None]:
# Your code goes here

What general conclusions do you draw from the importance of the different attributes to predict the class in this dataset?