In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import RidgeCV, LassoCV
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 12, 'figure.figsize': (10, 6), 'figure.dpi': 130,
                     'axes.titlesize': 'x-large', 'axes.labelsize': 'large',
                     'xtick.labelsize': 'medium', 'ytick.labelsize': 'medium'})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg, **kwargs):
    display(Markdown(f"<div class='alert alert-info'>📝 {textwrap.fill(msg, width=100)}</div>"))
def sec(title):
    print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

note("Environment initialized for Advanced OLS analysis.")

# Part 6: Econometrics
## Chapter 6.01: The Linear Model: Theory, Diagnostics, and Extensions

### Table of Contents
1.  [The Theory of Ordinary Least Squares (OLS)](#1.-The-Theory-of-Ordinary-Least-Squares)
    *   [1.1 Geometric Interpretation](#1.1-The-Geometric-Interpretation-of-OLS)
    *   [1.2 Finite-Sample Properties (Gauss-Markov)](#1.2-Finite-Sample-Properties-(Gauss-Markov))
    *   [1.3 Asymptotic Properties](#1.3-Asymptotic-Properties)
2.  [The Frisch-Waugh-Lovell Theorem](#2.-The-Frisch-Waugh-Lovell-Theorem)
3.  [Model Diagnostics](#3.-Model-Diagnostics)
    *   [3.1 Heteroskedasticity](#3.1-Heteroskedasticity)
    *   [3.2 Multicollinearity](#3.2-Multicollinearity)
4.  [Advanced Topics](#4.-Advanced-Topics)
    *   [4.1 Generalized Least Squares (GLS)](#4.1-Generalized-Least-Squares-(GLS))
    *   [4.2 Regularization for High-Dimensional Models](#4.2-Regularization-for-High-Dimensional-Models)
    *   [4.3 Bayesian Linear Regression with Gibbs Sampling](#4.3-Bayesian-Linear-Regression-with-Gibbs-Sampling)
5.  [Chapter Summary](#5.-Chapter-Summary)
6.  [Exercises](#6.-Exercises)

### Introduction: The Cornerstone of Econometrics
The linear regression model, estimated via **Ordinary Least Squares (OLS)**, is the cornerstone of econometrics. Its power stems from its computational simplicity, intuitive geometric interpretation, and deep statistical theory. This chapter provides a PhD-level treatment of the linear model, covering its theoretical foundations, diagnostic tools, and key extensions for modern empirical work.

### 1. The Theory of Ordinary Least Squares
The linear model is $\mathbf{y} = \mathbf{X}\beta + \mathbf{u}$. The OLS estimator $\hat{\beta}$ minimizes the Sum of Squared Residuals (SSR) and is given by:
$$ \hat{\beta}_{OLS} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y} $$

#### 1.1 The Geometric Interpretation of OLS
OLS has a powerful geometric interpretation. The vector of fitted values, $\hat{\mathbf{y}} = \mathbf{X}\hat{\beta}$, is the **orthogonal projection** of the actual data vector $\mathbf{y}$ onto the **column space of X**. The vector of residuals, $\hat{\mathbf{u}} = \mathbf{y} - \hat{\mathbf{y}}$, is the component of $\mathbf{y}$ that is orthogonal to this space. The normal equations, $\mathbf{X}'\hat{\mathbf{u}} = 0$, are the formal statement of this orthogonality.

#### 1.2 Finite-Sample Properties (Gauss-Markov)
Under the **Classical Linear Model (CLM) Assumptions** (including strict exogeneity and spherical errors, i.e., homoskedasticity and no serial correlation), the **Gauss-Markov Theorem** states that the OLS estimator is the **Best Linear Unbiased Estimator (BLUE)**. It is the most efficient (minimum variance) estimator among all linear unbiased estimators.

#### 1.3 Asymptotic Properties
Often, the strict CLM assumptions do not hold. However, OLS still has desirable **asymptotic properties** (properties in large samples) under weaker assumptions.
- **Consistency:** If $E[\mathbf{x}_i u_i] = 0$ (a weaker exogeneity assumption), then as the sample size $n \to \infty$, the OLS estimator $\hat{\beta}$ converges in probability to the true parameter $\beta$. This relies on the Law of Large Numbers.
- **Asymptotic Normality:** Under the same weak exogeneity assumption and finite fourth moments of the errors, the distribution of the OLS estimator approaches a normal distribution as the sample size grows. This relies on the Central Limit Theorem and justifies the use of t-tests and F-tests in large samples, even if the errors are not normally distributed.

### 2. The Frisch-Waugh-Lovell Theorem
The FWL theorem provides a deep understanding of what a multiple regression coefficient represents. It states that the coefficient $\beta_1$ in a regression of $y$ on $X_1$ and $X_2$ is the same as the coefficient from a simple regression of the residuals from a regression of $y$ on $X_2$ against the residuals from a regression of $X_1$ on $X_2$. In essence, $\beta_1$ captures the effect of the part of $X_1$ that is orthogonal to (i.e., uncorrelated with) $X_2$.

### 3. Model Diagnostics

#### 3.1 Heteroskedasticity
**Heteroskedasticity** occurs when the variance of the error term is not constant ($Var(u_i) = \sigma_i^2$). OLS remains unbiased, but its standard errors are incorrect. We can test for it using the **Breusch-Pagan test** and correct the standard errors using **heteroskedasticity-robust** estimators.

#### 3.2 Multicollinearity
**Multicollinearity** occurs when regressors are highly correlated. It does not bias the coefficients, but it inflates their variance, making estimates imprecise. We can detect it using the **Variance Inflation Factor (VIF)**. A VIF greater than 5 or 10 is a common red flag.

### 4. Advanced Topics

#### 4.1 Generalized Least Squares (GLS)
When the error covariance matrix is not spherical ($E[\mathbf{u}\mathbf{u}'|\mathbf{X}] \ne \sigma^2 \mathbf{I}$), OLS is no longer BLUE. The efficient estimator is **Generalized Least Squares (GLS)**:
$$ \hat{\beta}_{GLS} = (\mathbf{X}'\mathbf{\Omega}^{-1}\mathbf{X})^{-1} \mathbf{X}'\mathbf{\Omega}^{-1}\mathbf{y} $$
In practice, the true error covariance matrix $\mathbf{\Omega}$ is unknown and must be estimated. This leads to **Feasible GLS (FGLS)**, a two-step procedure where we use the residuals from a first-stage OLS regression to estimate $\mathbf{\Omega}$.

#### 4.2 Regularization for High-Dimensional Models
In high-dimensional settings (where the number of predictors $p$ is close to or larger than the number of observations $n$), OLS performs poorly or fails completely. **Regularization** methods solve this by adding a penalty term to the minimization problem, which introduces some bias to dramatically reduce variance.
- **Ridge Regression (L2 Penalty):** Minimizes $SSR + \lambda \sum \beta_j^2$. It shrinks coefficients towards zero.
- **LASSO (L1 Penalty):** Minimizes $SSR + \lambda \sum |\beta_j|$. The L1 penalty has the crucial property that it can shrink some coefficients to be *exactly* zero, thus performing automatic variable selection.

In [None]:
sec("LASSO for Variable Selection in a High-Dimensional Setting")
rng = np.random.default_rng(42)
N, P_total, P_true = 100, 200, 10 # 100 obs, 200 potential predictors, only 10 are non-zero

X = rng.standard_normal((N, P_total))
true_beta = np.zeros(P_total); true_beta[:P_true] = np.random.uniform(-2, 2, P_true)
y = X @ true_beta + rng.normal(0, 2, size=N)

# Use LassoCV to find the best penalty parameter alpha via cross-validation
lasso_cv = LassoCV(cv=10, random_state=42).fit(X, y)

note(f"LASSO with cross-validation selected alpha = {lasso_cv.alpha_:.4f}")
estimated_beta = lasso_cv.coef_

fig, ax = plt.subplots(figsize=(12, 7))
ax.stem(np.arange(P_total), true_beta, linefmt='k-', markerfmt='ko', basefmt='k-', label='True Coefficients')
ax.stem(np.arange(P_total), estimated_beta, linefmt='r--', markerfmt='rx', basefmt='r--', label='LASSO Estimates')
ax.set_title('LASSO for Variable Selection'); ax.set_xlabel('Coefficient Index'); ax.legend()
plt.show()
note(f"LASSO correctly identifies most of the true non-zero coefficients and shrinks the vast majority of the irrelevant coefficients to exactly zero. OLS would be unable to run in this p > n setting.")

#### 4.3 Bayesian Linear Regression with Gibbs Sampling
The Bayesian approach treats the parameters $\beta$ and $\sigma^2$ as random variables. We place priors on them and use the data to update these to a posterior distribution. For the linear model with standard conjugate priors, we can use a highly efficient MCMC method called a **Gibbs Sampler**.

A Gibbs sampler works by iteratively sampling from the **full conditional posterior distribution** of each parameter, holding the others fixed. For the linear model, these distributions have known analytical forms:
- $p(\beta | y, X, \sigma^2) \sim \text{Normal}(\dots)$
- $p(\sigma^2 | y, X, \beta) \sim \text{Inverse-Gamma}(\dots)$

### 5. Chapter Summary
- **OLS Theory:** The OLS estimator is the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$. Under the CLM assumptions, it is BLUE. Under weaker assumptions, it is consistent and asymptotically normal.
- **FWL Theorem:** Provides a deep interpretation of multiple regression coefficients.
- **Diagnostics:** It is crucial to test for violations of the CLM assumptions, particularly heteroskedasticity and multicollinearity, and use robust methods when necessary.
- **GLS:** When the error structure is non-spherical, GLS is the efficient estimator.
- **High-Dimensional Data:** When $p \ge n$, OLS fails. Regularization methods like Ridge and LASSO are essential. LASSO is particularly useful for variable selection.
- **Bayesian Approach:** Provides a powerful alternative that yields a full posterior distribution of the parameters, allowing for a richer characterization of uncertainty.

### 6. Exercises

1.  **Omitted Variable Bias Formula:** The FWL theorem provides the formula for omitted variable bias. Run a regression of `Literacy ~ Crime` and an auxiliary regression `Clergy ~ Crime` from the `Guerry` dataset. Use the results to manually calculate the OVB for the `Crime` coefficient when `Clergy` is omitted from the main regression. Verify your result by comparing it to the difference in the `Crime` coefficient between the short and long regressions.

2.  **Implementing FGLS:** Suppose you suspect the errors in the `Guerry` model are heteroskedastic and proportional to `Population`. Implement a two-step FGLS procedure: 1) Run OLS and get the residuals $\hat{u}$. 2) Regress $\ln(\hat{u}^2)$ on $\ln(Population)$ to get an estimate of the variance function. 3) Use the predicted variances to construct weights and run Weighted Least Squares (WLS). Compare the FGLS standard errors to the OLS and robust standard errors.

3.  **LASSO vs. Ridge:** In the high-dimensional example, replace the `LassoCV` with `RidgeCV`. How do the estimated coefficients differ? Why does Ridge not perform variable selection?

4.  **Gibbs Sampler:** Implement a Gibbs sampler for the Bayesian linear model from scratch. You will need to code the loop that iteratively draws from the conditional posterior for $\beta$ (a multivariate normal) and $\sigma^2$ (an inverse-gamma). Compare the posterior means from your sampler to the results from the `MCMCSampler` class.

5.  **Horseshoe Prior:** The Horseshoe prior is a popular choice for sparse Bayesian models. Research its properties. How does it differ from a standard Laplace (Bayesian LASSO) prior? Why is it often considered to have better theoretical properties?