# Phase 4 — Ridge and Lasso Regression

This notebook extends the baseline OLS analysis by applying Ridge and Lasso
regularization techniques to assess coefficient stability and robustness.
Penalized regression methods are used to examine whether the substantive
conclusions persist under shrinkage.

## Data Source

This analysis uses the cleaned and constructed dataset produced in `01_data_cleaning.ipynb`.
All composite variables and preprocessing steps are documented there. The same predictor set (X1, X2, X3) and outcome variable (Y) used in the baseline OLS analysis are retained to ensure comparability across methods.


In [1]:
# Import
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV, LassoCV

In [2]:
# Load cleaned data
df = pd.read_csv("E:/kaust_fellowship_bootcamp/projects/undergraduate_thesis_python/data/processed/ta_christian_constructed.csv")

X = df[["X1", "X2", "X3"]].values
y = df["Y"].values

In [3]:
# Scale predictors for penalized regression
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Standardization

Ridge and Lasso regression require standardized predictors to ensure that the
penalty term is applied uniformly across coefficients.

In [4]:
# Re-fit OLS for coefficient comparison
X_ols = sm.add_constant(X)
ols_model = sm.OLS(y, X_ols).fit()

In [5]:
alphas = np.logspace(-3, 3, 100)

ridge = RidgeCV(alphas=alphas, cv=5)
ridge.fit(X_scaled, y)

ridge_coef = pd.Series(
    ridge.coef_,
    index=["X1", "X2", "X3"]
)

ridge_coef

X1   -0.001854
X2    0.064729
X3   -0.004889
dtype: float64

### Ridge Regression Results

Ridge regression shrinks all coefficient magnitudes toward zero while retaining
all predictors in the model. The coefficient associated with knowledge of the
US–China trade war (X2) remains the largest in magnitude, indicating its dominant
role in explaining investment decision behavior.

In contrast, investment knowledge (X1) and risk perception (X3) exhibit
substantial shrinkage, suggesting limited explanatory contribution once
regularization is applied.


In [6]:
lasso = LassoCV(
    alphas=alphas,
    cv=5,
    max_iter=10000
)
lasso.fit(X_scaled, y)

lasso_coef = pd.Series(
    lasso.coef_,
    index=["X1", "X2", "X3"]
)

lasso_coef


X1    0.0
X2    0.0
X3    0.0
dtype: float64

### Lasso Regression Results

Lasso regression with 5-fold cross-validation shrinks all coefficients to zero,
selecting an intercept-only model. This indicates that, under an L1 penalty,
none of the predictors provide sufficient predictive power to justify inclusion
in the model.

This result suggests that while knowledge of the US–China trade war (X2) appears
statistically significant under classical OLS inference, the overall signal is
weak when assessed through sparse predictive modeling.

As an exploratory sensitivity analysis, a weaker ℓ1 penalty recovers X2 as the dominant coefficient, consistent with the OLS and Ridge regression results. This suggests that while the signal is not strong enough to survive aggressive sparsity constraints, X2 remains the most stable predictor across multiple modeling frameworks.

In [7]:
coef_comparison = pd.DataFrame({
    "OLS": ols_model.params[1:],   # exclude intercept
    "Ridge": ridge_coef,
    "Lasso": lasso_coef
})

coef_comparison


Unnamed: 0,OLS,Ridge,Lasso
X1,-0.068826,-0.001854,0.0
X2,0.347111,0.064729,0.0
X3,-0.061708,-0.004889,0.0


### Coefficient Stability Across Models

The coefficient comparison shows that the substantive conclusion of the baseline OLS model is largely preserved under Ridge regularization, with knowledge of the US–China trade war (X2) remaining the dominant predictor.

Under Lasso regularization with cross-validated penalty selection, all coefficients are shrunk to zero, indicating that the overall predictive signal is weak when sparse modeling is enforced. However, sensitivity analysis using a weaker ℓ1 penalty recovers X2 as the primary coefficient, consistent with OLS and Ridge estimates.

These results suggest that while the effect of X2 is statistically robust, the model has limited predictive strength under strict regularization.

**Status:** Phase 4 complete. Ridge regression confirms coefficient stability, while Lasso highlights weak overall predictive signal under sparsity constraints.