After potentially creating many new features during feature engineering, or if your initial dataset has a large number of features, you might want to select only the most relevant ones. Feature selection aims to choose a subset of the original features that are most useful for predicting the target variable.

**Goals of Feature Selection:**

* **Improve Model Performance:** Reduce overfitting by removing irrelevant or redundant features (noise).
* **Reduce Training Time:** Fewer features mean models train faster.
* **Enhance Interpretability:** Simpler models with fewer features are often easier to understand.
* **Reduce Dimensionality:** Mitigate the "curse of dimensionality".

`Scikit-learn` provides several methods, broadly categorized as Filter, Wrapper, and Embedded methods.

## Feature Selection Techniques

This document covers:

* **Goal:** Explains why feature selection is important (improving performance, reducing complexity, etc.).
* **Filter Methods:** Demonstrates `VarianceThreshold` (removing low/zero variance features) and `SelectKBest` (using univariate statistical tests like `f_classif` or `f_regression`).
* **Wrapper Methods:** Shows `RFE` (Recursive Feature Elimination) using a model's coefficients or importances to iteratively select features. Mentions `RFECV` for automatic selection of the number of features.
* **Embedded Methods:** Illustrates how L1 regularization (`Lasso` for regression) inherently performs feature selection by shrinking some coefficients to zero, and how tree-based models (`RandomForestClassifier`) provide `feature_importances_`. Mentions `SelectFromModel`.
* **Considerations:** Emphasizes fitting selectors only on training data and the pros/cons of different method categories.

---

Choosing the right features can significantly impact your model's success.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler # For scaling before some methods
# Filter Methods
from sklearn.feature_selection import VarianceThreshold, SelectKBest, SelectPercentile, f_classif, f_regression, mutual_info_regression, mutual_info_classif
# Wrapper Methods
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression, Lasso # Lasso is also embedded
# Embedded Methods
from sklearn.ensemble import RandomForestClassifier
# For visualization
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline


# --- 1. Load and Prepare Data ---
print("--- Loading Data ---")
# Classification Example: Iris dataset
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
y_iris = iris.target
print("Iris Dataset Features:\n", X_iris.head())

# Regression Example: California Housing
housing = fetch_california_housing(as_frame=True)
X_housing = housing.data
y_housing = housing.target
# Add a noisy feature for demonstration
np.random.seed(42)
X_housing['NoisyFeature'] = np.random.randn(X_housing.shape[0]) * 0.1
print("\nHousing Dataset Features (with added noise):\n", X_housing.head())

# Split data (important to fit selectors only on training data)
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris)
X_housing_train, X_housing_test, y_housing_train, y_housing_test = train_test_split(
    X_housing, y_housing, test_size=0.3, random_state=42)

# Scale data (needed for some models used in selection, like Lasso, RFE with linear models)
scaler_housing = StandardScaler()
X_housing_train_scaled = scaler_housing.fit_transform(X_housing_train)
X_housing_test_scaled = scaler_housing.transform(X_housing_test)
# Convert back to DataFrame for clarity
X_housing_train_scaled = pd.DataFrame(X_housing_train_scaled, columns=X_housing.columns, index=X_housing_train.index)

scaler_iris = StandardScaler()
X_iris_train_scaled = scaler_iris.fit_transform(X_iris_train)
X_iris_test_scaled = scaler_iris.transform(X_iris_test)
X_iris_train_scaled = pd.DataFrame(X_iris_train_scaled, columns=X_iris.columns, index=X_iris_train.index)

print("-" * 30)


# --- 2. Filter Methods ---
# Select features based on statistical properties, independent of any specific model.
# Fast and computationally inexpensive.

print("--- Filter Methods ---")

# a) Variance Threshold: Remove features with low variance.
print("\n--- a) Variance Threshold ---")
# Remove features with zero variance (constant features) - default threshold=0
# Add a constant feature for demonstration
X_housing_train_const = X_housing_train_scaled.copy()
X_housing_train_const['ConstantFeat'] = 0
print(f"Shape before VarianceThreshold: {X_housing_train_const.shape}")

selector_var = VarianceThreshold(threshold=0.0) # Removes constant features
selector_var.fit(X_housing_train_const) # Fit on training data

# Get boolean mask of features to keep
features_to_keep_mask = selector_var.get_support()
kept_features = X_housing_train_const.columns[features_to_keep_mask]
print(f"Features kept after VarianceThreshold(0): {list(kept_features)}")

# Transform data (selects the columns)
X_train_high_variance = selector_var.transform(X_housing_train_const)
print(f"Shape after VarianceThreshold(0): {X_train_high_variance.shape}")
# Can also set a higher threshold to remove quasi-constant features
print("-" * 20)


# b) Univariate Selection: Select features based on statistical tests against the target variable.
print("\n--- b) Univariate Selection (SelectKBest) ---")
# SelectKBest: Selects the top 'k' features.
# SelectPercentile: Selects the top 'percentile' features.
# Common scoring functions:
# - For Regression: f_regression, mutual_info_regression
# - For Classification: f_classif (ANOVA F-value), chi2 (for non-negative features), mutual_info_classif

# Example: Select top 2 features for Iris classification using f_classif
k_best = 2
selector_kbest_iris = SelectKBest(score_func=f_classif, k=k_best)
selector_kbest_iris.fit(X_iris_train_scaled, y_iris_train) # Fit on training data

# Get scores and selected features
scores_iris = selector_kbest_iris.scores_
selected_indices_iris = selector_kbest_iris.get_support(indices=True)
selected_features_iris = X_iris_train.columns[selected_indices_iris]

print(f"\nIris feature scores (f_classif): {scores_iris.round(2)}")
print(f"Top {k_best} Iris features selected by SelectKBest: {list(selected_features_iris)}")

# Transform data to keep only selected features
X_train_iris_kbest = selector_kbest_iris.transform(X_iris_train_scaled)
X_test_iris_kbest = selector_kbest_iris.transform(X_iris_test_scaled) # Use same fitted selector
print(f"Shape after SelectKBest(k=2): {X_train_iris_kbest.shape}")

# Example: Select top 5 features for Housing regression using f_regression
k_best_reg = 5
selector_kbest_housing = SelectKBest(score_func=f_regression, k=k_best_reg)
selector_kbest_housing.fit(X_housing_train_scaled, y_housing_train)
selected_features_housing = X_housing_train.columns[selector_kbest_housing.get_support(indices=True)]
print(f"\nTop {k_best_reg} Housing features selected by SelectKBest (f_regression): {list(selected_features_housing)}")
print("-" * 30)


# --- 3. Wrapper Methods ---
# Use a specific machine learning model to evaluate the usefulness of feature subsets.
# More computationally expensive than filter methods but can lead to better performance.

print("--- Wrapper Methods ---")

# a) Recursive Feature Elimination (RFE)
# Iteratively trains a model, removes the least important feature(s), and repeats.
# Requires an estimator with `coef_` or `feature_importances_`.
print("\n--- a) Recursive Feature Elimination (RFE) ---")

# Example: Use RFE with Logistic Regression to select 2 features for Iris
model_for_rfe = LogisticRegression(solver='liblinear', random_state=42)
# n_features_to_select: Number of features to keep.
# step: Number of features to remove at each iteration.
rfe_selector = RFE(estimator=model_for_rfe, n_features_to_select=2, step=1)
rfe_selector.fit(X_iris_train_scaled, y_iris_train) # Fit on training data

selected_features_rfe = X_iris_train.columns[rfe_selector.support_]
print(f"Features selected by RFE (Logistic Regression, k=2): {list(selected_features_rfe)}")
print(f"Feature ranking (lower is better): {rfe_selector.ranking_}")

# Transform data
X_train_iris_rfe = rfe_selector.transform(X_iris_train_scaled)
print(f"Shape after RFE: {X_train_iris_rfe.shape}")

# b) RFECV: RFE with cross-validation to automatically find the optimal number of features.
# print("\n--- b) RFECV (RFE with Cross-Validation) ---")
# rfecv_selector = RFECV(estimator=model_for_rfe, step=1, cv=StratifiedKFold(3), scoring='accuracy')
# rfecv_selector.fit(X_iris_train_scaled, y_iris_train)
# print(f"Optimal number of features found by RFECV: {rfecv_selector.n_features_}")
# selected_features_rfecv = X_iris_train.columns[rfecv_selector.support_]
# print(f"Features selected by RFECV: {list(selected_features_rfecv)}")
# plt.figure()
# plt.plot(range(1, len(rfecv_selector.cv_results_['mean_test_score']) + 1), rfecv_selector.cv_results_['mean_test_score'])
# plt.xlabel("Number of features selected")
# plt.ylabel("Cross validation score (accuracy)")
# plt.title("RFECV Performance")
# plt.show()
print("\n(RFECV automatically finds the best number of features using CV - see commented code)")
print("-" * 30)


# --- 4. Embedded Methods ---
# Feature selection is an intrinsic part of the model training process.

print("--- Embedded Methods ---")

# a) L1 Regularization (Lasso)
# The L1 penalty forces some feature coefficients to become exactly zero.
print("\n--- a) L1 Regularization (Lasso) ---")
# Use Lasso for regression (housing data)
lasso = Lasso(alpha=0.05, random_state=42) # Alpha controls regularization strength
lasso.fit(X_housing_train_scaled, y_housing_train)

lasso_coefs = pd.Series(lasso.coef_, index=X_housing_train.columns)
print("Lasso Coefficients:\n", lasso_coefs)
selected_features_lasso = lasso_coefs[lasso_coefs != 0].index
print(f"\nFeatures selected by Lasso (non-zero coefs): {list(selected_features_lasso)}")
# Note: The noisy feature might be eliminated depending on alpha.

# Can use LogisticRegression with penalty='l1' for classification.

# b) Tree-based Feature Importance
# Models like RandomForest calculate importance based on how much each feature
# contributes to reducing impurity (e.g., Gini impurity) across all trees.
print("\n--- b) Tree-based Feature Importance ---")
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_iris_train_scaled, y_iris_train) # Fit on training data

importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': X_iris_train.columns, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("Feature Importances from RandomForest:\n", feature_importance_df)

# Select features based on importance threshold (e.g., keep features contributing > X%)
# Or use SelectFromModel which does this automatically
# from sklearn.feature_selection import SelectFromModel
# sfm = SelectFromModel(rf, threshold=0.1, prefit=True) # Use prefit=True as rf is already fitted
# X_train_iris_sfm = sfm.transform(X_iris_train_scaled)
# print(f"\nShape after SelectFromModel (threshold=0.1): {X_train_iris_sfm.shape}")
# selected_features_sfm = X_iris_train.columns[sfm.get_support()]
# print(f"Features selected by SelectFromModel: {list(selected_features_sfm)}")
print("\n(SelectFromModel can automatically select based on importance - see commented code)")
print("-" * 30)


# --- 5. Final Considerations ---
print("--- Final Considerations ---")
print("- Feature selection should generally be done *after* train-test split.")
print("- Fit selectors/models ONLY on the training data.")
print("- Transform both training and test sets using the *same* fitted selector.")
print("- Filter methods are fast but don't consider feature interactions.")
print("- Wrapper methods are more thorough but computationally expensive.")
print("- Embedded methods offer a balance, integrating selection into training.")
print("- The best method depends on the dataset, model, and goals.")
print("- Often beneficial to include feature selection within a Pipeline, especially when using cross-validation.")
print("-" * 30)

--- Loading Data ---
Iris Dataset Features:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

Housing Dataset Features (with added noise):
    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85



Lasso Coefficients:
 MedInc          0.738636
HouseAge        0.138543
AveRooms       -0.000000
AveBedrms       0.000000
Population      0.000000
AveOccup       -0.000000
Latitude       -0.263678
Longitude      -0.222547
NoisyFeature   -0.000000
dtype: float64

Features selected by Lasso (non-zero coefs): ['MedInc', 'HouseAge', 'Latitude', 'Longitude']

--- b) Tree-based Feature Importance ---
Feature Importances from RandomForest:
              Feature  Importance
3   petal width (cm)    0.454892
2  petal length (cm)    0.400227
0  sepal length (cm)    0.120608
1   sepal width (cm)    0.024273

(SelectFromModel can automatically select based on importance - see commented code)
------------------------------
--- Final Considerations ---
- Feature selection should generally be done *after* train-test split.
- Fit selectors/models ONLY on the training data.
- Transform both training and test sets using the *same* fitted selector.
- Filter methods are fast but don't consider feature inter