# 05 Regularization: Ridge and Lasso

Shrinkage, coefficient paths, feature selection.


## Table of Contents
- [Build feature matrix](#build-feature-matrix)
- [Fit ridge/lasso](#fit-ridge-lasso)
- [Coefficient paths](#coefficient-paths)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)


## Why This Notebook Matters
Regression is the bridge between statistics and ML. You will learn:
- single-factor vs multi-factor interpretation,
- robust standard errors,
- coefficient stability and multicollinearity.


## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Treating coefficients as causal without a causal design.
- Ignoring multicollinearity (unstable coefficients).

## Matching Guide
- `docs/guides/02_regression/05_regularization_ridge_lasso.md`



## How To Use This Notebook
- This notebook is hands-on. Most code cells are incomplete on purpose.
- Complete each TODO, then run the cell.
- Use the matching guide (`docs/guides/02_regression/05_regularization_ridge_lasso.md`) for deep explanations and alternative examples.
- Write short interpretation notes as you go (what changed, why it matters).



<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.



In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT



## Goal
Use ridge and lasso regression to handle correlated macro predictors.

Why this notebook exists:
- OLS coefficients can be unstable when predictors are correlated.
- Ridge shrinks coefficients smoothly.
- Lasso can set some coefficients exactly to 0 (feature selection-ish).



## Primer: sklearn Pipelines (How To Avoid Preprocessing Leakage)

### Why pipelines exist
A common ML mistake is fitting preprocessing (scalers, imputers) on the full dataset.
That leaks information from the test set into training.

A `Pipeline` enforces the correct order:
- fit preprocessing on training only
- apply preprocessing to test
- fit model on training only

### Key API concepts
- `fit(X, y)`: learn parameters from data (e.g., scaler means/standard deviations, model weights).
- `transform(X)`: apply learned parameters to new data (e.g., scale).
- `fit_transform(X, y)`: convenience that does both on the same data.

If you do `scaler.fit(X_all)` before splitting, you leaked test-set information.

### Example pattern
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

clf = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=5000)),
])

# clf.fit(X_train, y_train)
# y_prob = clf.predict_proba(X_test)[:, 1]
```

### Mini demo: the leakage you're avoiding (toy example)
```python
import numpy as np
from sklearn.preprocessing import StandardScaler

# Pretend the last 20% of data comes from a different era with a different mean
rng = np.random.default_rng(0)
X_train = rng.normal(loc=0.0, scale=1.0, size=(100, 1))
X_test  = rng.normal(loc=2.0, scale=1.0, size=(25, 1))

# WRONG: fit scaler on train+test (leaks the future)
sc_wrong = StandardScaler().fit(np.vstack([X_train, X_test]))
X_test_wrong = sc_wrong.transform(X_test)

# RIGHT: fit scaler on train only
sc_right = StandardScaler().fit(X_train)
X_test_right = sc_right.transform(X_test)

print("test mean after wrong scaling:", float(X_test_wrong.mean()))
print("test mean after right scaling:", float(X_test_right.mean()))
```

### What to remember
- Always split by time first.
- Then fit the pipeline on train.
- Then evaluate on test.

If you need different preprocessing for different columns, look into:
- `sklearn.compose.ColumnTransformer`


<a id="build-feature-matrix"></a>
## Build feature matrix

### Goal
Choose a target and feature set from the macro quarterly table.



### Your Turn (1): Load data and pick columns


In [None]:
import pandas as pd

path = PROCESSED_DIR / 'macro_quarterly.csv'
if path.exists():
    df = pd.read_csv(path, index_col=0, parse_dates=True)
else:
    df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)

y_col = 'gdp_growth_qoq'

# TODO: Choose a feature list.
# Tip: start with lagged features to avoid timing ambiguity.
x_cols = [
    'T10Y2Y_lag1',
    'UNRATE_lag1',
    'FEDFUNDS_lag1',
    'INDPRO_lag1',
    'RSAFS_lag1',
    # TODO: add more lags/features if you want
]

df_m = df[[y_col] + x_cols].dropna().copy()
df_m.tail()



### Your Turn (2): Time split


In [None]:
from src.evaluation import time_train_test_split_index

split = time_train_test_split_index(len(df_m), test_size=0.2)
train = df_m.iloc[split.train_slice]
test = df_m.iloc[split.test_slice]

X_train = train[x_cols]
y_train = train[y_col]
X_test = test[x_cols]
y_test = test[y_col]

X_train.shape, X_test.shape



<a id="fit-ridge-lasso"></a>
## Fit ridge/lasso

### Goal
Fit ridge and lasso over a range of regularization strengths and compare out-of-sample error.



### Your Turn (1): Fit ridge and lasso across alpha grid


In [None]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error

alphas = np.logspace(-3, 2, 20)

ridge_rmse = []
lasso_rmse = []

for a in alphas:
    ridge = Pipeline([
        ('scaler', StandardScaler()),
        ('model', Ridge(alpha=float(a))),
    ])
    lasso = Pipeline([
        ('scaler', StandardScaler()),
        ('model', Lasso(alpha=float(a), max_iter=20000)),
    ])

    ridge.fit(X_train, y_train)
    lasso.fit(X_train, y_train)

    ridge_pred = ridge.predict(X_test)
    lasso_pred = lasso.predict(X_test)

    ridge_rmse.append(mean_squared_error(y_test, ridge_pred, squared=False))
    lasso_rmse.append(mean_squared_error(y_test, lasso_pred, squared=False))

best_ridge = float(alphas[int(np.argmin(ridge_rmse))])
best_lasso = float(alphas[int(np.argmin(lasso_rmse))])
best_ridge, best_lasso



### Your Turn (2): Plot RMSE vs alpha


In [None]:
import matplotlib.pyplot as plt

# TODO: Plot ridge_rmse and lasso_rmse vs alphas (log scale).
...



<a id="coefficient-paths"></a>
## Coefficient paths

### Goal
Visualize how coefficients shrink as regularization increases.

This is one of the best ways to build intuition for what ridge/lasso are doing.



### Your Turn (1): Fit models and record coefficients across alphas


In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso

ridge_coefs = []
lasso_coefs = []

for a in alphas:
    ridge = Pipeline([('scaler', StandardScaler()), ('model', Ridge(alpha=float(a)))])
    lasso = Pipeline([('scaler', StandardScaler()), ('model', Lasso(alpha=float(a), max_iter=20000))])
    ridge.fit(X_train, y_train)
    lasso.fit(X_train, y_train)

    ridge_coefs.append(ridge.named_steps['model'].coef_)
    lasso_coefs.append(lasso.named_steps['model'].coef_)

ridge_coefs = pd.DataFrame(ridge_coefs, columns=x_cols, index=alphas)
lasso_coefs = pd.DataFrame(lasso_coefs, columns=x_cols, index=alphas)

ridge_coefs.head()



### Your Turn (2): Plot coefficient paths


In [None]:
import matplotlib.pyplot as plt

# TODO: Plot coefficient paths for ridge and lasso.
# Hint: loop over columns and plot series on same axes.
...



<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2-3 sentences summarizing what you verified.



In [None]:
# TODO: After you build X/y and split by time, validate the split.
# Example (adjust variable names):
# assert X_train.index.max() < X_test.index.min()
# assert y_train.index.equals(X_train.index)
# assert y_test.index.equals(X_test.index)
# assert not X_train.isna().any().any()
# assert not X_test.isna().any().any()
...



## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.



## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?



<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Build feature matrix</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 05_regularization_ridge_lasso — Build feature matrix
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True).dropna()
target = 'gdp_growth_qoq'
X = df.drop(columns=[c for c in df.columns if c.startswith('gdp_') or c in {'GDPC1','recession','target_recession_next_q'}], errors='ignore')
y = df[target]
split = int(len(df)*0.8)
X_tr, X_te = X.iloc[:split], X.iloc[split:]
y_tr, y_te = y.iloc[:split], y.iloc[split:]
```

</details>

<details><summary>Solution: Fit ridge/lasso</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 05_regularization_ridge_lasso — Fit ridge/lasso
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso

alphas = [0.01, 0.1, 1.0, 10.0]
ridge_coefs = {}
lasso_coefs = {}
for a in alphas:
    r = Pipeline([('scaler', StandardScaler()), ('m', Ridge(alpha=a))]).fit(X_tr, y_tr)
    l = Pipeline([('scaler', StandardScaler()), ('m', Lasso(alpha=a, max_iter=5000))]).fit(X_tr, y_tr)
    ridge_coefs[a] = r.named_steps['m'].coef_
    lasso_coefs[a] = l.named_steps['m'].coef_
ridge_coefs.keys(), lasso_coefs.keys()
```

</details>

<details><summary>Solution: Coefficient paths</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 05_regularization_ridge_lasso — Coefficient paths
import matplotlib.pyplot as plt

# Plot a few coefficient paths (first 5 features)
feat_names = list(X.columns)
for i in range(min(5, len(feat_names))):
    plt.plot(alphas, [ridge_coefs[a][i] for a in alphas], label=f'Ridge {feat_names[i]}')
plt.xscale('log')
plt.legend()
plt.title('Ridge coefficient paths (subset)')
plt.show()
```

</details>

