Two predictive models enter the testing phase. One is bogged down with irrelevant features, performs poorly, and misses key insights. The other uses only the most critical features, runs efficiently, and delivers sharp predictions. What made the difference? In this video, we'll explore feature selection techniques like backward elimination, forward selection, and lasso to ensure your model is on the winning side.

We'll start with backward elimination, a step-by-step process where we begin with all features and gradually remove those that are not statistically significant based on their p-values, a measure of central tendency. Knowledge of this value permits a statistician or model to accept or reject a data point based on the likelihood that it is statistically significant.

In [5]:
%pip install statsmodels


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/anaconda/envs/azureml_py310_sdkv2/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np

# Create 50 rows of data instead of 10
data = {
    'StudyHours': np.random.randint(1, 10, 50),
    'PrevExamScore': np.random.randint(30, 100, 50),
    'Pass': np.random.randint(0, 2, 50)
}
df = pd.DataFrame(data)

# Now when you run your model, the warning will vanish!

In [3]:

# Features and target variable
X = df[['StudyHours', 'PrevExamScore']]
y = df['Pass']

# Add constant (intercept) to the model
X = sm.add_constant(X)

# Fit the OLS model
model = sm.OLS(y, X).fit()

# Display model summary with p-values
print(model.summary())

# Remove the least significant feature (highest p-value)
if model.pvalues['StudyHours'] > 0.05:
    X = X.drop(columns='StudyHours')
    model = sm.OLS(y, X).fit()

print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                   Pass   R-squared:                       0.092
Model:                            OLS   Adj. R-squared:                  0.053
Method:                 Least Squares   F-statistic:                     2.380
Date:                Sat, 06 Dec 2025   Prob (F-statistic):              0.104
Time:                        13:55:47   Log-Likelihood:                -32.857
No. Observations:                  50   AIC:                             71.71
Df Residuals:                      47   BIC:                             77.45
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.9131      0.296      3.083

Forward selection

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

def forward_selection(X, y):
    remaining_features = set(X.columns)
    selected_features = []
    current_score = 0.0
    
    while remaining_features:
        scores_with_candidates = []
        
        for feature in remaining_features:
            features_to_test = selected_features + [feature]
            X_train, X_test, y_train, y_test = train_test_split(X[features_to_test], y, test_size=0.2, random_state=42)
            
            # Train the model
            model = LinearRegression()
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            score = r2_score(y_test, y_pred)
            
            scores_with_candidates.append((score, feature))
        
        # Select the feature with the best score
        scores_with_candidates.sort(reverse=True)
        best_score, best_feature = scores_with_candidates[0]
        
        if current_score < best_score:
            remaining_features.remove(best_feature)
            selected_features.append(best_feature)
            current_score = best_score
        else:
            break
    
    return selected_features

best_features = forward_selection(X, y)
print(f"Selected features: {best_features}")



Selected features: []


Lasso

In [5]:
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the LASSO model with a regularization strength (alpha)
lasso_model = Lasso(alpha=0.1)

# Train the LASSO model
lasso_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lasso_model.predict(X_test)

# Evaluate the model's performance
r2 = r2_score(y_test, y_pred)
print(f'R-squared score: {r2}')

# Show the coefficients
print(f'LASSO Coefficients: {lasso_model.coef_}')




R-squared score: -0.00841181729560092
LASSO Coefficients: [ 0.         -0.00727535]


This way, LASSO selects important features while maintaining model simplicity. To wrap up, remember these key points. Using feature selection techniques like backward elimination, forward selection, and LASSO helps you focus on the most important features, improving your model's performance. Streamlining your dataset reduces complexity and prevents overfitting, making your model both efficient and accurate. 