In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
import statsmodels.api as sm, statsmodels.formula.api as smf
from sklearn.linear_model import LassoCV, RidgeCV, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing
from sklearn.cluster import KMeans
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 8), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)
warnings.filterwarnings('ignore', category=FutureWarning)

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-info'>📝 {textwrap.fill(msg, width=100)}</div>"))
def sec(title): print(f"\n{80*'='}\n| {title.upper()} |\n{80*'='}")

note("Environment initialized for Advanced Machine Learning.")

# An Economist's Guide to Machine Learning
## Prediction, Regularization, and Causality

---

### On this page

1.  [**The Predictive Modeling Framework**](#framework)
    - [The Bias-Variance Trade-off](#bias-variance)
    - [A Formal Derivation of the Bias-Variance Decomposition](#derivation)
    - [Cross-Validation](#cv)
2.  [**Regularization for High-Dimensional Data**](#regularization)
    - [The Objective Functions of Ridge ($L_2$) and Lasso ($L_1$)](#objectives)
    - [Geometric Intuition: Why Lasso Performs Feature Selection](#geometry)
    - [Code Example: Lasso vs. Ridge for Prediction](#code-lasso)
3.  [**Causal Machine Learning**](#causal-ml)
    - [The Challenge: Regularization Bias and Confounding](#challenge)
    - [The Solution: Double/Debiased ML and Neyman-Orthogonality](#dml)
    - [Code Example: DML Simulation](#code-dml)
4.  [**Case Study: Forecasting the Equity Premium**](#casestudy)
5.  [**Unsupervised Learning: Clustering**](#unsupervised)
6.  [**Test Your Knowledge**](#exercises)
7.  [**Key Takeaways**](#summary)

<a id='intro'></a>
### Introduction: A Tale of Two Cultures

For decades, the worlds of econometrics and machine learning evolved in parallel, largely distinct traditions. Econometrics, rooted in statistical theory and economics, focused primarily on **causal inference**. The goal was to build simple, interpretable models (often linear) to test economic theories and estimate the causal effect of one variable on another, controlling for a small, carefully chosen set of confounders. The emphasis was on unbiasedness, consistency, and the statistical significance of coefficients.

Machine learning, born from computer science and artificial intelligence, pursued a different objective: **prediction**. The goal was to create complex, often non-linear "black box" models that could accurately predict an outcome, with less emphasis on the interpretability of individual parameters. The primary measure of success was not the p-value of a coefficient, but the model's performance on unseen data.

In recent years, these two cultures have begun to converge, creating a powerful new toolkit for empirical economists. This chapter introduces how the predictive power of machine learning can be harnessed to serve the inferential goals of economics. We will explore three main ways these tools are applied:

1.  **High-Dimensional Prediction:** In a world of "big data," economists are increasingly faced with situations where the number of potential predictors ($p$) is large relative to the number of observations ($n$). In these settings, classical methods like OLS fail. ML techniques like **regularization** (Lasso, Ridge) are designed to handle this dimensionality, producing stable and accurate forecasts for tasks like predicting inflation, firm defaults, or asset returns.

2.  **Causal Inference in Complex Environments:** This is the most exciting frontier. Naively applying predictive models to causal questions is dangerous due to issues like regularization bias and confounding. However, a new field of **Causal Machine Learning** has developed methods (like **Double/Debiased ML**) that use ML's predictive power to flexibly control for high-dimensional confounders, allowing for robust estimation of causal effects in settings that were previously intractable.

3.  **Discovering Latent Patterns:** Sometimes, the goal is not to predict a specific outcome, but to discover underlying structure in the data. **Unsupervised learning** methods like clustering can be used to segment consumers into distinct groups, classify firms based on their characteristics, or identify hidden economic regimes.

This chapter provides a graduate-level introduction to these three domains, providing the theoretical intuition and practical code to apply these modern methods to economic problems.

### 1. The Predictive Modeling Framework
#### 1.1 The Bias-Variance Trade-off
A central challenge in building predictive models is **overfitting**. A highly flexible model might fit the training data perfectly but fail to generalize to new, unseen data because it has memorized the noise, not the underlying signal. To formalize this, we can decompose the expected Mean Squared Error (MSE) of a model at a specific point $x_0$:
$$ E[(y_0 - \hat{f}(x_0))^2] = \underbrace{(\text{Bias}[\hat{f}(x_0)])^2}_{\text{Squared Bias}} + \underbrace{\text{Var}[\hat{f}(x_0)]}_{\text{Variance}} + \underbrace{\sigma^2_{\epsilon}}_{\text{Irreducible Error}} $$
where the expectation $E$ is taken over many different training sets. Let's break this down:
- **Bias:** The bias of an estimator $\hat{f}$ is the difference between its average prediction and the true value, $f(x_0)$. A model with high bias (e.g., linear regression for a non-linear problem) makes strong assumptions and is systematically wrong. Flexible models have low bias.
$$ \text{Bias}[\hat{f}(x_0)] = E[\hat{f}(x_0)] - f(x_0) $$
- **Variance:** The variance of an estimator measures how much our estimate $\hat{f}$ would change if we trained it on a different random dataset. A model with high variance (e.g., a deep decision tree) is highly sensitive to the specific training data it sees. Flexible models have high variance.
$$ \text{Var}[\hat{f}(x_0)] = E[(\hat{f}(x_0) - E[\hat{f}(x_0)])^2] $$
- **Irreducible Error:** This is the noise term $\epsilon$ in the true relationship and cannot be eliminated by any model.

The **trade-off** is the key insight: as we increase model flexibility (e.g., increase the degree of a polynomial), bias typically falls while variance rises. The goal of good predictive modeling is to find the sweet spot of flexibility that minimizes the total error on unseen data.

The plot below shows how, as model complexity (in this case, the degree of a polynomial regression) increases, squared bias falls while variance rises. The total test error follows a U-shape, first decreasing as the model becomes flexible enough to capture the signal, and then increasing as it starts to overfit the noise in the training data. The optimal model balances this trade-off, minimizing the total error.

![Bias-Variance Tradeoff](../images/png/figure1_bias_variance_tradeoff.png)
*Figure 1: The Bias-Variance Trade-off. As model complexity increases, bias falls and variance rises. The optimal model finds the sweet spot that minimizes the total error on unseen data.*

#### 1.2 A Formal Derivation of the Bias-Variance Decomposition
For the interested reader, we can derive the decomposition. Let the true model be $y = f(x) + \epsilon$, where $E[\epsilon]=0$ and $Var(\epsilon)=\sigma^2_\epsilon$. We want to decompose the expected prediction error of our estimator $\hat{f}(x)$ at a point $x_0$. The expected squared error is $E[(y_0 - \hat{f}(x_0))^2]$.

Let's expand the term inside the expectation:
$$ (y_0 - \hat{f}(x_0))^2 = (f(x_0) + \epsilon_0 - \hat{f}(x_0))^2 = (f(x_0) - \hat{f}(x_0) + \epsilon_0)^2 $$
$$ = (f(x_0) - \hat{f}(x_0))^2 + 2\epsilon_0(f(x_0) - \hat{f}(x_0)) + \epsilon_0^2 $$

Now, take the expectation over repeated training sets. The estimator $\hat{f}$ is random (it depends on the training data), but $f$ and $\epsilon_0$ are fixed.
$$ E[(y_0 - \hat{f}(x_0))^2] = E[(f(x_0) - \hat{f}(x_0))^2] + 2E[\epsilon_0(f(x_0) - \hat{f}(x_0))] + E[\epsilon_0^2] $$

Let's analyze each term:
1.  $E[\epsilon_0^2] = \sigma^2_\epsilon$, the irreducible error.
2.  The middle term: $2E[\epsilon_0(f(x_0) - \hat{f}(x_0))] = 2(f(x_0)E[\epsilon_0] - E[\epsilon_0 \hat{f}(x_0)])$. Since $\epsilon_0$ is independent of the training data used to form $\hat{f}$, $E[\epsilon_0 \hat{f}(x_0)] = E[\epsilon_0]E[\hat{f}(x_0)] = 0$. So the entire middle term is zero.
3.  The first term: $E[(f(x_0) - \hat{f}(x_0))^2]$. Let's add and subtract $E[\hat{f}(x_0)]$ inside the square:
    $$ E[(f(x_0) - E[\hat{f}(x_0)] + E[\hat{f}(x_0)] - \hat{f}(x_0))^2] $$
    Let $A = f(x_0) - E[\hat{f}(x_0)]$ (which is the bias) and $B = E[\hat{f}(x_0)] - \hat{f}(x_0)$. The term is $E[(A+B)^2] = E[A^2 + 2AB + B^2] = E[A^2] + 2E[AB] + E[B^2]$.
    - $E[A^2] = A^2 = (f(x_0) - E[\hat{f}(x_0)])^2 = (\text{Bias}[\hat{f}(x_0)])^2$, since A is a constant with respect to the expectation.
    - $E[B^2] = E[(E[\hat{f}(x_0)] - \hat{f}(x_0))^2] = \text{Var}[\hat{f}(x_0)]$, by definition of variance.
    - $2E[AB] = 2A \cdot E[B] = 2A \cdot E[E[\hat{f}(x_0)] - \hat{f}(x_0)] = 2A \cdot (E[\hat{f}(x_0)] - E[\hat{f}(x_0)]) = 0$.
    So, the first term equals $(\text{Bias}[\hat{f}(x_0)])^2 + \text{Var}[\hat{f}(x_0)]$.

Putting it all together gives the final decomposition:
$$ E[(y_0 - \hat{f}(x_0))^2] = (\text{Bias}[\hat{f}(x_0)])^2 + \text{Var}[\hat{f}(x_0)] + \sigma^2_{\epsilon} $$

<a id='cv'></a>
#### 1.3 Cross-Validation
We cannot use the training error to select the best model. A robust method for estimating out-of-sample performance is **k-fold cross-validation**. The training data is split into *k* folds. The model is trained *k* times, each time holding out one fold for validation. The average of the *k* validation scores provides a stable estimate of the model's performance and is the standard technique for **hyperparameter tuning**.

<a id='regularization'></a>
## 2. Regularization for High-Dimensional Data

When the number of features $p$ is large relative to the number of observations $n$, OLS estimates are known to have very high variance, leading to poor predictive performance. **Regularization** is a technique for combatting this by adding a penalty term to the loss function, which constrains the size of the model's coefficients.

<a id='objectives'></a>
### The Objective Functions of Ridge ($L_2$) and Lasso ($L_1$)

Two of the most popular regularization techniques are Ridge and Lasso regression. They modify the standard OLS objective function (minimizing the Residual Sum of Squares, or RSS) by adding a penalty on the size of the coefficient vector $\beta$.

- **Ridge Regression ($L_2$ Penalty):** The Ridge objective function is:
$$ \hat{\beta}_{Ridge} = \arg\min_{\beta} \left( \sum_{i=1}^n (y_i - \mathbf{x}_i'\beta)^2 + \alpha \sum_{j=1}^p \beta_j^2 \right) = \arg\min_{\beta} (||\mathbf{y} - \mathbf{X}\beta||_2^2 + \alpha ||\beta||_2^2) $$
The $L_2$ penalty ($\sum \beta_j^2$) shrinks coefficients towards zero, but because the penalty is squared, it is not very effective at shrinking them *exactly* to zero.

- **Lasso (L1 Penalty):** The Lasso (Least Absolute Shrinkage and Selection Operator) objective function is:
$$ \hat{\beta}_{Lasso} = \arg\min_{\beta} \left( \sum_{i=1}^n (y_i - \mathbf{x}_i'\beta)^2 + \alpha \sum_{j=1}^p |\beta_j| \right) = \arg\min_{\beta} (||\mathbf{y} - \mathbf{X}\beta||_2^2 + \alpha ||\beta||_1) $$
The $L_1$ penalty ($\sum |\beta_j|$) is crucial: because the absolute value function has a "sharp corner" at zero, the optimization is very likely to find solutions where some coefficients are *exactly* zero. This means Lasso performs automatic **feature selection**, making it extremely useful in high-dimensional settings.

The parameter $\alpha$ is a **hyperparameter** that controls the strength of the penalty. It is chosen via cross-validation.

<a id='geometry'></a>
### Geometric Intuition: Why Lasso Performs Feature Selection

The difference between the $L_1$ and $L_2$ penalties can be understood geometrically. The regularization problem is equivalent to minimizing the OLS loss function subject to a constraint on the size of the coefficients: $\sum \beta_j^2 \le c$ for Ridge and $\sum |\beta_j| \le c$ for Lasso.

- The **Ridge constraint** ($L_2$) corresponds to a **circle** (in two dimensions).
- The **Lasso constraint** ($L_1$) corresponds to a **diamond**.

The solution is the point where the elliptical contours of the OLS loss function first touch the constraint region. As the plot below shows, the elliptical contours are much more likely to make contact with the diamond shape at one of its sharp corners. Since the corners lie on the axes, this corresponds to a solution where one of the coefficients is exactly zero.

The elliptical contours in the diagram below represent the OLS loss function, with the unconstrained OLS solution at the center. The solution for the regularized regression is the point where these contours first touch the constraint region (the red circle for Ridge, the blue diamond for Lasso). Because the Lasso constraint has sharp, non-differentiable corners, the point of tangency is highly likely to occur on an axis, which corresponds to setting one of the coefficients to exactly zero.

![Lasso vs. Ridge Geometry](../images/png/figure2_lasso_ridge_geometry.png)
*Figure 2: Geometric interpretation of Lasso (L1) and Ridge (L2) regularization. The sharp corners of the L1 constraint region make it likely that the solution will be on an axis, leading to sparsity.*

In [None]:
<a id='code-lasso'></a>
sec("Lasso vs. Ridge vs. OLS for High-Dimensional Prediction")

# Load the California Housing dataset
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target

# Add many noise variables to make the problem high-dimensional
note(f"Original number of features: {X.shape[1]}. We will add 90 irrelevant noise features.")
rng = np.random.default_rng(42)
for i in range(90):
    X[f'noise_{i}'] = rng.standard_normal(len(X))

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features based on the training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train and Evaluate Models ---
models = {}
models['OLS'] = LinearRegression().fit(X_train_scaled, y_train)
models['Ridge'] = RidgeCV(alphas=np.logspace(-2, 4, 100), cv=5).fit(X_train_scaled, y_train)
models['Lasso'] = LassoCV(cv=5, random_state=42, max_iter=5000).fit(X_train_scaled, y_train)

test_mse = {}
for name, model in models.items():
    y_pred = model.predict(X_test_scaled)
    test_mse[name] = mean_squared_error(y_test, y_pred)

note("Comparing Test Set Mean Squared Error (MSE) for the three models:")
for name, mse in test_mse.items():
    print(f"- {name}: {mse:.4f}")

note("In this high-dimensional setting ($p \approx 100$), OLS overfits badly and has a very high test error. Ridge and Lasso, by constraining the coefficients, achieve a much lower (better) test MSE. Lasso also provides a more parsimonious model.")

# --- Visualize Lasso Coefficients ---
coefs = pd.Series(models['Lasso'].coef_, index=X.columns)
n_selected = np.sum(coefs != 0)
print(f"\nLasso selected {n_selected} out of {X.shape[1]} features.")

fig, ax = plt.subplots(figsize=(10, 8))
coefs[coefs != 0].sort_values().plot(kind='barh', ax=ax)
ax.set_title('Coefficients Selected by Lasso')
plt.show()

<a id='causal-ml'></a>
## 3. Causal Machine Learning

Perhaps the most exciting frontier is the application of ML to causal inference. The goal is to estimate the causal effect of a treatment $D$ on an outcome $Y$ while controlling for a high-dimensional set of confounders $X$. 

<a id='challenge'></a>
### The Challenge: Why Naive ML Fails for Causal Inference

One might naively try to estimate the treatment effect by running a regression of $Y$ on $D$ and a flexible, non-linear transformation of $X$ produced by an ML model (e.g., a Random Forest). This approach fails for two main reasons:

1.  **Regularization Bias:** Powerful ML models like Lasso or Random Forests are designed to be good predictors. They achieve this by using regularization, which deliberately biases the coefficients of the control variables $X$ towards zero to reduce variance. This bias, however, "spills over" and contaminates the coefficient of interest on the treatment variable $D$, leading to a biased estimate of the causal effect.
2.  **Model Misspecification & Overfitting:** Even if we could perfectly specify the relationship between $X$ and $Y$, we would still overfit the model to the specific sample, leading to biased estimates. The DML approach is designed to be robust to these issues.

<a id='dml'></a>
### The Solution: Double/Debiased ML and Neyman-Orthogonality

**Double/Debiased Machine Learning (DML)**, developed by Chernozhukov et al. (2018), provides a robust solution. The key insight is to construct an estimate based on a **Neyman-orthogonal moment condition**. In simple terms, this means structuring the problem so that the final causal estimate is insensitive to small errors in the first-stage "nuisance" models (the models used to predict $Y$ from $X$ and $D$ from $X$).

The DML process achieves this via a non-parametric version of the Frisch-Waugh-Lovell theorem:

1.  **Model the Nuisance Parameters:** Use a flexible ML model (e.g., Random Forest) to get an estimate of the conditional mean of the outcome, $E[Y|X] = m_Y(X)$, and the conditional mean of the treatment, $E[D|X] = m_D(X)$.
2.  **Partial Out Confounders:** Calculate the residuals based on these models: 
    - $\tilde{Y} = Y - \hat{m}_Y(X)$
    - $\tilde{D} = D - \hat{m}_D(X)$
3.  **Final Stage Regression:** Run a simple OLS regression of the outcome residual on the treatment residual:
    $$ \tilde{Y} = \theta \tilde{D} + \epsilon $$

The resulting estimate of $\theta$ is a robust, "debiased" estimate of the causal treatment effect. By partialling out the effect of $X$ from *both* $Y$ and $D$, the first-order impact of small errors in the nuisance models cancels out, leaving a high-quality estimate of $\theta$. To prevent bias from overfitting in the first stage, this procedure is always combined with **cross-fitting** (or sample splitting), where the nuisance models are trained on a different subset of the data than the one used to calculate the residuals and run the final regression.

In [None]:
sec("Causal ML vs. Naive OLS: A Simulation")

def run_dml_simulation(n=2000, p=20):
    rng = np.random.default_rng(123)
    X = rng.normal(size=(n, p))
    true_causal_effect = 0.5
    # Complex, non-linear relationships for D and Y
    d = 2 * np.sin(X[:, 0]) + np.cos(X[:, 1]) + rng.standard_normal(n)
    y = true_causal_effect * d + 2 * X[:, 0]**2 + np.exp(X[:, 1]/2) + rng.standard_normal(n)
    
    # Naive OLS (biased due to non-linear confounding)
    naive_ols_estimate = sm.OLS(y, sm.add_constant(np.c_[d, X])).fit().params[1]

    # Double/Debiased ML with 2-fold cross-fitting
    kf = KFold(n_splits=2, shuffle=True, random_state=42)
    y_tilde, d_tilde = np.array([]), np.array([])
    for train_idx, test_idx in kf.split(X):
        X_train, y_train, d_train = X[train_idx], y[train_idx], d[train_idx]
        X_test, y_test, d_test = X[test_idx], y[test_idx], d[test_idx]
        
        # Step 1: Partial out X from Y
        model_y = RandomForestRegressor(random_state=42).fit(X_train, y_train)
        y_tilde = np.append(y_tilde, y_test - model_y.predict(X_test))
        
        # Step 2: Partial out X from D
        model_d = RandomForestRegressor(random_state=42).fit(X_train, d_train)
        d_tilde = np.append(d_tilde, d_test - model_d.predict(X_test))
        
    # Step 3: Final Stage Regression
    dml_estimate = sm.OLS(y_tilde, d_tilde).fit().params[0]
    
    note(f"True Causal Effect: {true_causal_effect:.4f}")
    note(f"Naive OLS Estimate: {naive_ols_estimate:.4f} (Biased due to model misspecification)")
    note(f"DML with Cross-Fitting: {dml_estimate:.4f} (Much less bias)")

run_dml_simulation()

<a id='casestudy'></a>
## 4. Case Study: Forecasting the Equity Premium

A classic, notoriously difficult problem in financial economics is forecasting the **equity premium**—the expected return on the stock market in excess of the risk-free rate. While the historical average has been positive, there is a long-standing debate about whether this premium is predictable using past information.

This provides an excellent case study for applying ML to a real-world economic prediction problem. We will use the well-known Goyal-Welch (2008) dataset, which contains historical data on common financial predictors. Our goal is to see if ML models can outperform a simple historical mean benchmark in forecasting the equity premium out-of-sample.

A crucial detail for any time-series forecasting task is the validation procedure. Standard k-fold cross-validation is invalid because it uses future data to train a model that predicts the past (a phenomenon known as **lookahead bias**). Instead, we must use a procedure that respects the temporal ordering of the data. A common approach is an **expanding window forecast**, where we only use data up to time $t$ to train a model to predict the outcome at $t+1$.

In [None]:
sec("Case Study: Forecasting the US Equity Premium")

# 1. Load and prepare the Goyal-Welch dataset
try:
    # URL for the dataset from Amit Goyal's website
    url = 'http://www.hec.unil.ch/agoyal/docs/PredictorData2021.xlsx'
    # The data is in the 'Monthly' sheet
    gw_df = pd.read_excel(url, sheet_name='Monthly', index_col=0)
    gw_df.index.name = 'Date'
    
    # --- Data Cleaning and Feature Engineering ---
    # Calculate equity premium (log return on S&P 500 minus log risk-free rate)
    gw_df['Rfree'] = np.log(gw_df['Rfree'] + 1)
    gw_df['CRSP_SPvw'] = np.log(gw_df['CRSP_SPvw'] + 1)
    gw_df['EqP'] = gw_df['CRSP_SPvw'] - gw_df['Rfree']
    
    # Create predictors (lagged by one period)
    predictors = ['dp', 'ep', 'bm', 'ntis', 'tbl', 'svar']
    X = gw_df[predictors].shift(1)
    y = gw_df['EqP']
    
    # Align data by dropping NaNs
    full_data = pd.concat([y, X], axis=1).dropna()
    y = full_data['EqP']
    X = full_data.drop('EqP', axis=1)
    
    # 2. Expanding Window Forecast
    note("Running an expanding window forecast to compare models. This is a long process...")
    n_total = len(X)
    train_size = int(n_total * 0.3) # Start with first 30% of data
    
    y_preds = { 'Hist. Mean': [], 'OLS': [], 'Random Forest': [] }
    y_true_list = []

    for t in range(train_size, n_total - 1):
        X_train, y_train = X.iloc[:t], y.iloc[:t]
        X_test, y_test = X.iloc[[t]], y.iloc[t]
        y_true_list.append(y_test.iloc[0])
        
        # Models
        y_preds['Hist. Mean'].append(y_train.mean())
        
        ols = LinearRegression().fit(X_train, y_train)
        y_preds['OLS'].append(ols.predict(X_test)[0])
        
        rf = RandomForestRegressor(n_estimators=100, random_state=42, min_samples_leaf=10).fit(X_train, y_train)
        y_preds['Random Forest'].append(rf.predict(X_test)[0])
        
    # 3. Evaluate Performance
    y_true_arr = np.array(y_true_list)
    mse_benchmark = mean_squared_error(y_true_arr, y_preds['Hist. Mean'])
    
    print("Out-of-Sample R-squared (vs. Historical Mean Benchmark):\n")
    for name, preds in y_preds.items():
        if name == 'Hist. Mean': continue
        mse_model = mean_squared_error(y_true_arr, preds)
        oos_r2 = 1 - (mse_model / mse_benchmark)
        print(f"- {name}: {oos_r2:.4f}")
        
    note("The Out-of-Sample R-squared measures the percentage improvement in MSE over the simple historical mean. A negative value means the model performed *worse* than the benchmark. As is common in this literature, we find that the predictive models struggle to consistently beat the simple benchmark, highlighting the efficient nature of financial markets and the low signal-to-noise ratio.")

except Exception as e:
    note(f"Could not download or process the Goyal-Welch dataset. Skipping case study. Error: {e}")

<a id='unsupervised'></a>
## 5. Unsupervised Learning: Discovering Latent Structure

The final category of ML tools we consider is **unsupervised learning**, which is used when we do not have a specific outcome variable $y$ to predict. Instead, the goal is to learn directly from the features $X$ to discover hidden patterns or latent structure in the data. These methods are invaluable for exploratory data analysis and can be used to generate new insights or create features for subsequent supervised learning tasks.

Common applications in economics include:
- **Clustering:** Grouping observations into distinct clusters based on their similarity. For example, segmenting consumers based on purchasing behavior, classifying firms into strategic groups, or identifying different states of the business cycle.
- **Dimensionality Reduction:** Reducing a large number of correlated features into a smaller set of uncorrelated principal components. This is often used to visualize high-dimensional data or as a pre-processing step to combat the curse of dimensionality.

We will focus on the most well-known clustering algorithm, **K-Means**.

In [None]:
sec("Unsupervised Learning: Clustering Countries by Development Indicators")
# Load World Development Indicators data
try:
    wdi_df = pd.read_csv('../data/indicators.csv')
    # Create a smaller, more manageable dataset for clustering
    df_2014 = wdi_df[wdi_df['Year'] == 2014]
    indicators = {
        'SP.DYN.LE00.IN': 'LifeExpectancy',
        'NY.GDP.PCAP.KD': 'GDP_per_Capita',
        'SE.PRM.ENRR': 'PrimarySchoolEnrollment'
    }
    df_cluster = df_2014[df_2014['Indicator Code'].isin(indicators.keys())]
    df_pivot = df_cluster.pivot_table(index='Country Name', columns='Indicator Name', values='Value').dropna()
    df_pivot['GDP_per_Capita_log'] = np.log(df_pivot['GDP_per_Capita'])
    df_final = df_pivot[['GDP_per_Capita_log', 'LifeExpectancy', 'PrimarySchoolEnrollment']].copy()
    
    # Standardize data before clustering
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df_final)
    
    # Apply K-Means clustering
    kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
    df_final['Cluster'] = kmeans.fit_predict(X_scaled)
    
    # Plot the results
    plt.figure(figsize=(12, 8))
    sns.scatterplot(data=df_final, x='GDP_per_Capita_log', y='LifeExpectancy', hue='Cluster', palette='viridis', s=100, alpha=0.8)
    plt.title('Clustering Countries by Development Indicators (2014)')
    plt.xlabel('Log GDP per Capita'); plt.ylabel('Life Expectancy')
    plt.show()
    note("K-Means identifies distinct clusters corresponding to different tiers of economic development.")
except Exception as e:
    note(f"Could not load or process WDI dataset. Skipping example. Error: {e}")

<a id='exercises'></a>
## 6. Test Your Knowledge

1.  **Bias-Variance Trade-off:** Explain why a simple linear regression typically has high bias but low variance, while a very deep decision tree has low bias but high variance.

2.  **Regularization:** In the Lasso regression example, how would you expect the number of selected features to change if you manually set the penalty parameter `alpha` to be very large? What if you set it to be very small? Modify the code to verify your intuition.

3.  **DML for Causal Inference:** The key to DML is the "partialling out" process. Why is it important to use cross-fitting (i.e., train the nuisance models on one part of the data and make predictions on another)? What problem does this solve?

4.  **Interpreting Clusters:** In the K-Means clustering example, calculate the mean of the original (non-standardized) indicators for each of the 4 clusters. What are the defining characteristics of each cluster? Do they correspond to intuitive categories of countries?

5.  **Causal Forests:** The DML procedure estimates the average treatment effect. Causal Forests, developed by Athey and Imbens, aim to estimate heterogeneous treatment effects. In your own words, what is the key difference in the objective of these two methods?

<a id='summary'></a>
## 7. Key Takeaways

This chapter provided an introduction to the intersection of machine learning and economics, highlighting how ML tools can be used for prediction, causal inference, and pattern discovery.

**Key Concepts**:
- **The Bias-Variance Trade-off**: This is the central challenge in predictive modeling. Simple models tend to have high bias and low variance, while complex models have low bias and high variance. The goal is to find a model that optimally balances this trade-off to minimize error on unseen data.
- **Cross-Validation**: This is the standard technique for estimating a model's out-of-sample performance and for tuning hyperparameters (like the penalty parameter in Lasso/Ridge) without data leakage.
- **Regularization**: Methods like Lasso ($L_1$) and Ridge ($L_2$) are essential for building predictive models in high-dimensional settings. They add a penalty term to the loss function to shrink coefficients, with Lasso being able to perform automatic feature selection by forcing some coefficients to exactly zero.
- **Causal Machine Learning**: Naively using ML for causal inference is problematic. Double/Debiased Machine Learning (DML) provides a robust framework by using ML to partial out the effect of confounders from both the treatment and the outcome variable, combined with cross-fitting to avoid overfitting bias. This allows for the estimation of causal effects in complex, non-linear environments.
- **Unsupervised Learning**: Techniques like K-Means clustering can reveal hidden structures in data without a pre-defined outcome variable, useful for tasks like market segmentation or classifying firms.

### Solutions to Exercises

---

**1. Bias-Variance Trade-off:**
- **Linear Regression:** It assumes the true relationship is linear. If the true relationship is non-linear, the model will be systematically wrong on average, leading to **high bias**. However, since the model is simple, it will not change much if it sees a different training dataset, leading to **low variance**.
- **Deep Decision Tree:** A very deep tree can, in principle, approximate any function by creating many small, specific splits. It can therefore capture complex non-linearities, leading to **low bias**. However, it is highly sensitive to the specific training data. A small change in the data can lead to a completely different set of splits, meaning the predictions will vary wildly from one training set to another, leading to **high variance**.

---

**2. Regularization:**
- **Large `alpha`**: A very large penalty will make the cost of having non-zero coefficients extremely high. To minimize the objective function, Lasso will be forced to shrink almost all coefficients to exactly zero. The number of selected features will be very small, possibly zero (if alpha is large enough).
- **Small `alpha`**: A very small (or zero) penalty will make the regularization term negligible. The Lasso objective function becomes almost identical to the OLS objective function. Therefore, it will select most or all of the features, and the coefficient values will be very close to the OLS estimates.

---

**3. DML and Cross-fitting:**
Cross-fitting (or sample splitting) is crucial to avoid a specific form of overfitting bias. In Step 1, we use an ML model to estimate $E[Y|X]$. If we used the *same data* to construct the residual $\tilde{Y}$ as we did to train the model, our estimate of the residual would be artificially small (due to in-sample overfitting). This mechanical correlation would bias the final-stage regression. By training the nuisance models on one 'fold' of the data and constructing the residuals on another 'fold' (the one that wasn't used for training), we ensure that the residuals are constructed using an 'out-of-sample' prediction. This breaks the mechanical correlation and removes the overfitting bias, which is essential for the debiasing theory to work.

---

**4. Interpreting Clusters:**
You would run `df_final.groupby('Cluster').mean()`. The results would likely show:
- **Cluster 0 (e.g., High-Income):** High GDP per capita, high life expectancy, high school enrollment.
- **Cluster 1 (e.g., Upper-Middle Income):** Medium-high GDP and life expectancy, high school enrollment.
- **Cluster 2 (e.g., Lower-Middle Income):** Lower GDP and life expectancy, medium-high school enrollment.
- **Cluster 3 (e.g., Low-Income):** Very low GDP and life expectancy, and lower/more variable school enrollment.
The clusters map well to common-sense development tiers.

---

**5. Causal Forests vs. DML:**
The key difference is the target of estimation.
- **DML** is designed to estimate the **Average Treatment Effect (ATE)**, which is a single, average causal effect for the entire population (e.g., "what is the average effect of the treatment on everyone?").
- **Causal Forests** are designed to estimate **Conditional Average Treatment Effects (CATEs)**, or heterogeneous treatment effects. The goal is to estimate how the causal effect itself varies with an individual's characteristics, $CATE(x) = E[Y(1) - Y(0) | X=x]$. It seeks to answer the question: "For which types of individuals is the treatment most or least effective?"