<P> <B> <font color=red size="6"> Stepwise Regression with Forward Selection Using mlxtend </Font></B> </P>

<b>The mlxtend library provides a convenient implementation of stepwise regression with forward selection using its SequentialFeatureSelector class. This tool can perform both forward and backward feature selection based on a given metric, such as R², mean squared error, or cross-validation scores.</B>

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import pandas as pd
import numpy as np

In [2]:
# Load the wine dataset
wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = wine.target + 0.1 * wine.data[:, 0]  # Convert target to continuous for regression

In [3]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [5]:
# Initialize the Linear Regression model
lr = LinearRegression()

<b>Sequential Feature Selector:</b>

    The SequentialFeatureSelector class from mlxtend performs stepwise selection:
        forward=True: Enables forward selection.
        floating=False: Disables backward steps; use floating=True for stepwise regression with both forward and backward steps.
    scoring='r2': Uses R² as the evaluation metric. You can also use metrics like 'neg_mean_squared_error'.
<b>Automatic Feature Selection:</b>

    k_features='best': Automatically selects the optimal number of features based on cross-validation.

<b>Cross-Validation:</b>

    The feature selection process uses 5-fold cross-validation to evaluate the performance of feature subsets.

In [6]:
# Perform forward selection using mlxtend
sfs = SFS(
    lr,
    k_features='best',  # Automatically determine the best number of features
    forward=True,       # Perform forward selection
    floating=False,     # Do not allow backward steps
    scoring='r2',       # Use R² as the evaluation metric
    cv=5,               # 5-fold cross-validation
    n_jobs=-1           # Use all available CPU cores
)

In [7]:
# Fit the feature selector
sfs = sfs.fit(X_train_scaled, y_train)

In [8]:
# Get the selected feature indices and names
selected_indices = sfs.k_feature_idx_
selected_features = [X.columns[i] for i in selected_indices]

In [9]:
# Train a model using the selected features
X_train_selected = X_train_scaled[:, selected_indices]
X_test_selected = X_test_scaled[:, selected_indices]
lr.fit(X_train_selected, y_train)
y_pred_test = lr.predict(X_test_selected)

In [10]:
# Evaluate the model
mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

print("Selected Features:", selected_features)
print(f"Test Set Mean Squared Error: {mse_test:.4f}")
print(f"Test Set R² Score: {r2_test:.4f}")

Selected Features: ['alcohol', 'malic_acid', 'alcalinity_of_ash', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'color_intensity', 'od280/od315_of_diluted_wines', 'proline']
Test Set Mean Squared Error: 0.0689
Test Set R² Score: 0.8730


<b>Customization:</B>

    Use floating=True for stepwise regression with both forward and backward steps.
    Adjust cv for different levels of cross-validation robustness.