Consider a dataset of 500 individuals containing their gender, height, weight and Body Mass Index (BMI). Here, Index is the dependent variable and Gender, Height and Weight are independent variables. We will be using Pandas library for its implementation.

In [23]:
import pandas as pd

data=pd.read_csv('BMI.csv')
print(data.head())

   Gender  Height  Weight  Index
0    Male     174      96      4
1    Male     189      87      2
2  Female     185     110      4
3  Female     195     104      3
4    Male     149      61      3


VIF = 1: No multicollinearity.

VIF between 1 and 5: Moderate correlation, probably fine.

VIF > 5 (or >10): High multicollinearity → consider removing that variable.
Statsmodels internally does this for each feature 
Xi:
Takes one column (X_i) as the dependent variable (target).

Uses all the other columns in x as independent variables (predictors).

Fits a linear regression:
𝑋𝑖=𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝛽𝑘𝑋𝑘+𝜖Xi=β0+β1X1+β2X2+⋯+βkXk+ϵ excluding 𝑋𝑖Xi itself from the predictors)

Calculates 𝑅2
 (coefficient of determination) from that regression — this measures how much of 𝑋i
	​


In [24]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Convert categorical variable to numeric
data['Gender'] = data['Gender'].map({'Male': 0, 'Female': 1})

# Select features
x = data[['Gender', 'Height', 'Weight']]

# Drop missing and infinite values properly
#x = x.dropna()
#x = x.replace([float('inf'), float('-inf')], pd.NA).dropna()

# Reset index (optional but safe)
#x = x.reset_index(drop=True)

# Calculate VIF
vif = pd.DataFrame()
vif['features'] = x.columns
vif['VIF'] = [variance_inflation_factor(x.values, i) for i in range(len(x.columns))]

print(vif)


  features        VIF
0   Gender   2.028864
1   Height  11.623103
2   Weight  10.688377


REGULARIZATION

In [33]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

x,y=make_regression(n_samples=100,n_features=5,noise=0.1,random_state=42) #noise=0.1 → adds random Gaussian noise (mean = 0, std = 0.1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

lasso=Lasso(alpha=0.1)
lasso.fit(x_train,y_train)

y_prediction=lasso.predict(x_test)

mse=mean_squared_error(y_test,y_prediction)
print(f"Mean Squared Error:{mse}")
print("Coefficients:",lasso.coef_)

Mean Squared Error:0.06362439921332456
Coefficients: [60.50305581 98.52475354 64.3929265  56.96061238 35.52928502]


In [34]:
from sklearn.linear_model import Ridge
#other libraries...

ridge=Ridge(alpha=1.0)
ridge.fit(x_train,y_train)
y_pred2=ridge.predict(x_test)

mse1=mean_squared_error(y_test,y_pred2)
print("Mean Squared Error:",mse1)
print("Coefficients:",ridge.coef_)

Mean Squared Error: 4.114050771972589
Coefficients: [59.87954432 97.15091098 63.24364738 56.31999433 35.34591136]


In [35]:
from sklearn.linear_model import ElasticNet

model=ElasticNet(alpha=1.0,l1_ratio=0.5)
model.fit(x_train,y_train)

y_pred3=model.predict(x_test)
mse2=mean_squared_error(y_test,y_pred3)

print("Mean Squared Error",mse2)
print("Coefficients:",model.coef_)

Mean Squared Error 2662.3292683761697
Coefficients: [41.2685658  60.6166494  34.45391474 37.4873701  26.29561474]
