# Sequential Feature Selector

Sequential Feature Selection (SFS) is a family of greedy search algorithms that are used to reduce the dimensionality of the input feature space for machine learning models. The goal is to find the best subset of features that results in the optimal performance of a model according to a specified criterion (usually predictive accuracy).

SFS algorithms can be divided into two categories:

- **Sequential Forward Selection (SFS):** This method starts with an empty set of features and sequentially adds features one by one. At each step, it adds the feature that provides the most significant improvement to the model performance until a desired number of features is reached or performance improvement is no longer statistically significant.

- **Sequential Backward Selection (SBS):** This method starts with the full set of features and sequentially removes the least important feature at each step. It eliminates the feature whose removal causes the least degradation in the model performance until the desired number of features is left or further removal of features degrades the model performance beyond a certain threshold.

Both methods are considered greedy because they make the locally optimal choice at each step with the hope of finding the global optimum. However, since they do not consider all possible subsets of features, they do not guarantee to find the best possible subset but often yield a good subset that improves model performance and reduces overfitting.

The main steps in sequential feature selection are:

1. Initialization: Start with an empty set of features (for SFS) or all features (for SBS).

1. Iteration: For each step:

  - In SFS, evaluate all possible additions of a single feature to the current set of features and choose the one that maximizes the performance criterion.
  - In SBS, evaluate all possible removals of a single feature from the current set of features and choose the one that has the least impact on the performance criterion.
1. Termination: Stop when a predetermined stopping criterion is met, which could be a set number of features, a performance threshold, or if there is no improvement in performance.

1. Final Evaluation: The selected subset of features is used to train the final model, and the performance is evaluated.

In Python, sequential feature selection can be performed using libraries like scikit-learn, which provides a SequentialFeatureSelector transformer that works with any estimator that has a fit method (such as classifiers or regressors). The SFS can be a powerful method when dealing with high-dimensional data where feature selection is necessary to improve the interpretability of the model, reduce overfitting, and possibly enhance the model's predictive performance.

In [1]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector

# Load the dataset
iris = load_iris(as_frame=True)
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create a logistic regression classifier
logreg = LogisticRegression(max_iter=1000)

# Create the SequentialFeatureSelector object
sfs = SequentialFeatureSelector(logreg, n_features_to_select=2, direction='forward')

# Fit the SequentialFeatureSelector
sfs.fit(X_train, y_train)

# Transform the dataset to only include the selected features
X_train_selected = sfs.transform(X_train)
X_test_selected = sfs.transform(X_test)

# Fit the logistic regression classifier to the training set with the selected features
logreg.fit(X_train_selected, y_train)

# Print which features were selected
selected_features = sfs.get_support()
print("Selected features:", [iris.feature_names[i] for i in range(len(selected_features)) if selected_features[i]])

# Evaluate the model on the test set with the selected features
print(f"Model score with selected features: {logreg.score(X_test_selected, y_test):.3f}")

Selected features: ['petal length (cm)', 'petal width (cm)']
Model score with selected features: 0.974


In [2]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [3]:
iris.frame

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
