# Feature selection methods: Backward elimination, forward selection, and LASSO

Feature selection is an essential part of building efficient machine learning models. By selecting the most relevant features, you can improve model performance, reduce overfitting, and enhance interpretability.

## Backward elimination

The goal is eliminate features that do not contribute much to the predictive power of a given model.

Steps of backward elimination:
1. Fit the model - e.g., linear regression - with all the features in the dataset
2. Calculate p-values to determine how statistically siginicant each feature is
3. Remove the least siginificant feature - i.e., the feature with the highest p-value
4. Repeat

In [1]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

In [2]:
# Sample dataset
data = {
    'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'PrevExamScore': [30, 40, 45, 50, 60, 65, 70, 75, 80, 85],
    'Pass': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # 0 = Fail, 1 = Pass
}

df = pd.DataFrame(data)

# Features and target variable
X = df[['StudyHours', 'PrevExamScore']]
y = df['Pass']

In [3]:
# Add a constant to the model
X = sm.add_constant(X)

Fit the intial model

In [5]:
# Fit the model using Ordinary Least Squares (OLS)
model = sm.OLS(y, X).fit()

# Display the summary of the model
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Pass   R-squared:                       0.758
Model:                            OLS   Adj. R-squared:                  0.688
Method:                 Least Squares   F-statistic:                     10.94
Date:                Sat, 19 Jul 2025   Prob (F-statistic):            0.00701
Time:                        15:41:24   Log-Likelihood:               -0.17258
No. Observations:                  10   AIC:                             6.345
Df Residuals:                       7   BIC:                             7.253
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.3333      1.464     -0.228

  res = hypotest_fun_out(*samples, **kwds)


The goal is to start with all features and then progressively remove the least significant ones. The output will show a summary of the model, including the p-values for each feature. The p-value helps you determine the statistical significance of each feature: features with high p-values are considered less significant and should be removed.

Steps:
1. Fit the model with all features
2. Identitfy the features with the highest p-value
3. Remove the features with the highest p-value
4. Refit the model and reapeat until all remaining features are statistically siginificant

In [6]:
# Define the significance level
siginificance_level = 0.05

# Perform backward elimination
while True:
    # Fit the model
    model = sm.OLS(y, X).fit()
    # Get the highest p-value
    max_p_value = model.pvalues.max()

    # Check if the highest p-value is greater than the significance level
    if max_p_value > siginificance_level:
        # Identify the feature with the highest p-value
        feature_to_remove = model.pvalues.idxmax()
        print(f"Removing feature: {feature_to_remove} with p-value: {max_p_value}")

        # Drop the feature
        X = X.drop(columns=[feature_to_remove])
    else:
        break

# Display the final model summary
print(model.summary())

Removing feature: PrevExamScore with p-value: 0.9999999999999964
Removing feature: const with p-value: 0.11419580126842216
                                 OLS Regression Results                                
Dep. Variable:                   Pass   R-squared (uncentered):                   0.831
Model:                            OLS   Adj. R-squared (uncentered):              0.812
Method:                 Least Squares   F-statistic:                              44.31
Date:                Sat, 19 Jul 2025   Prob (F-statistic):                    9.31e-05
Time:                        16:39:49   Log-Likelihood:                         -1.8294
No. Observations:                  10   AIC:                                      5.659
Df Residuals:                       9   BIC:                                      5.961
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                

  res = hypotest_fun_out(*samples, **kwds)
