Feature selection is the process of identifying and selecting a subset of relevant features for use in model construction. The goal is to enhance the model's performance by reducing overfitting, improving accuracy, and reducing training time.

Improved Model Performance: 
By removing irrelevant or redundant features, we can improve the accuracy of the model.
Reduced Overfitting: 
With fewer features, the model is less likely to learn noise from the training data.
Faster Computation: 
Reducing the number of features decreases the computational cost and training time.



Types of Feature Selection Methods

Filter Methods: Filter methods use statistical techniques to evaluate the relevance of features independently of the model. Common techniques include correlation coefficients, chi-square tests, and mutual information.

Wrapper Methods: Wrapper methods use a predictive model to evaluate feature subsets and select the best-performing combination. Techniques include recursive feature elimination (RFE) and forward/backward feature selection.

Embedded Methods: Embedded methods perform feature selection during the model training process. Examples include Lasso (L1 regularization) and feature importance from tree-based models.



Feature Selection Techniques with Scikit-Learn
Scikit-Learn provides several tools for feature selection, including:

Univariate Selection: Univariate selection evaluates each feature individually to determine its importance. Techniques like 'SelectKBest' and 'SelectPercentile' can be used to select the top features based on statistical tests.

Recursive Feature Elimination (RFE): RFE is a wrapper method that recursively removes the least important features based on a model's performance. It repeatedly builds a model and eliminates the weakest features until the desired number of features is reached.

Feature Importance from Tree-based Models: Tree-based models like 'decision trees' and 'random forests' can provide feature importance scores, indicating the importance of each feature in making predictions.


In [53]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
print(data.feature_names)
print(data.target_names)
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
print(X.head(5))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Shapes: X_train:{X_train.shape}, X_test:{X_test.shape}, y_train:{y_train.shape}, y_test:{y_test.shape}")

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
Shapes: X_train:(105, 4), X_test:(45, 4), y_train:(105,), y_test:(45,)


We'll use 'SelectKBest' with the chi-square test to select the top 2 features.

In [56]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply SelectKBest with chi2
select_k_best = SelectKBest(score_func=chi2, k=2)
X_train_k_best = select_k_best.fit_transform(X_train, y_train)

print("Selected features:", X_train.columns[select_k_best.get_support()])


Selected features: Index(['petal length (cm)', 'petal width (cm)'], dtype='object')


we'll use RFE with a logistic regression model to select the top 2 features.

In [58]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Apply RFE with logistic regression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=2)
X_train_rfe = rfe.fit_transform(X_train, y_train)

print("Selected features:", X_train.columns[rfe.get_support()])


Selected features: Index(['petal length (cm)', 'petal width (cm)'], dtype='object')


we'll use a random forest classifier to determine feature importance.

In [60]:
from sklearn.ensemble import RandomForestClassifier

# Train random forest and get feature importances
model = RandomForestClassifier()
model.fit(X_train, y_train) # train model
importances = model.feature_importances_ # extra feature importances

# Display feature importances
feature_importances = pd.Series(importances, index=X_train.columns)
print(feature_importances.sort_values(ascending=False))


petal length (cm)    0.449597
petal width (cm)     0.422919
sepal length (cm)    0.094227
sepal width (cm)     0.033257
dtype: float64


Scikit-Learn provides a variety of tools to help with feature selection, including univariate selection, recursive feature elimination, and feature importance from tree-based models. Implementing these techniques can significantly improve your model's performance and computational efficiency.