In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import arviz as az
from scipy.stats import beta as beta_dist, norm
from econometrics_utils import MCMCSampler
# pip install pymc
try:
    import pymc as pm
    PYMC_AVAILABLE = True
except ImportError:
    PYMC_AVAILABLE = False

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 12, 'figure.figsize': (11, 7), 'figure.dpi': 130})
np.set_printoptions(suppress=True, linewidth=120, precision=4)
warnings.filterwarnings('ignore', category=FutureWarning)
az.style.use('arviz-darkgrid')

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-block alert-info'>📝 **Note:** {msg}</div>"))
def sec(title): print(f"\n{80*'='}\n| {title.upper()} |\n{80*'='}")

if not PYMC_AVAILABLE:
    note("The 'pymc' library is not installed (`pip install pymc`). Some sections will be skipped.")
note(f"Environment initialized. PyMC available: {PYMC_AVAILABLE}")

# Part 6: Econometrics
## Chapter 6.6: An Introduction to Bayesian Econometrics

### Introduction: A Different Philosophy of Inference

This chapter introduces a fundamentally different philosophy of statistical inference: **Bayesian Econometrics**. In contrast to the **Frequentist** approach, which underpins most of the methods seen so far (like OLS and MLE), the Bayesian paradigm treats model parameters not as fixed, unknown constants, but as random variables about which we can have beliefs and uncertainty.

The philosophical divide between the two schools of thought can be summarized by what each considers to be random:

- The **Frequentist** approach views the **data as random** and the **parameters as fixed**. A parameter like a regression coefficient, $\beta$, is a single, unknown constant of nature. We cannot make probability statements about $\beta$ itself (e.g., "there is a 95% probability that $\beta$ is between 0.5 and 1.0"). It either is or it isn't in that interval. Instead, we construct **confidence intervals**. A 95% confidence interval is a statement about the *procedure*: if we could repeat our experiment many times, drawing new random datasets each time, 95% of the intervals we construct would contain the true, fixed $\beta$.

- The **Bayesian** approach views the **data as fixed** (it's what we observed) and the **parameters as random**. Probability is defined as a **degree of belief** or confidence in a proposition. It is perfectly natural to have a probability distribution over a parameter, representing our uncertainty about its true value. We start with an initial belief (the **prior**), and as we collect data, we update our beliefs in a principled way using **Bayes' Theorem**.

This approach provides a powerful and intuitive framework for thinking about and communicating uncertainty. Its modern application has been unlocked by computational breakthroughs, particularly in **Markov Chain Monte Carlo (MCMC)** algorithms. These methods allow us to solve complex and high-dimensional problems that were computationally intractable for most of the 20th century, moving Bayesian analysis from a theoretical curiosity to a practical tool for empirical research.

## 1. The Core of the Bayesian Paradigm: Bayes' Theorem

The engine of Bayesian inference is Bayes' Theorem, which provides a formal rule for updating our beliefs in light of new evidence.
$$ \underbrace{P(\theta|D)}_{\text{Posterior}} = \frac{\overbrace{P(D|\theta)}^{\text{Likelihood}} \times \overbrace{P(\theta)}^{\text{Prior}}}{\underbrace{P(D)}_{\text{Marginal Likelihood}}} $$ 

Let's break down the components in the context of econometric modeling:

- **Prior Distribution, $P(\theta)$:** This distribution represents our beliefs about the parameter $\theta$ *before* seeing the data. This is a powerful feature, as it allows us to formally incorporate existing knowledge or theory into our model. If we are largely ignorant or want the data to speak for itself, we can use a "weak," "diffuse," or "uninformative" prior that assigns roughly equal belief across a wide range of parameter values.

- **Likelihood, $P(D|\theta)$:** This is the same likelihood function used in Maximum Likelihood Estimation. It specifies the data-generating process. It answers the question: "If the true parameter value were $\theta$, what is the probability of observing the data $D$ that we actually collected?"

- **Posterior Distribution, $P(\theta|D)$:** This is the primary output of a Bayesian analysis. It represents our updated beliefs about $\theta$ *after* observing the data. The posterior is a principled compromise, a weighted average of the information contained in the prior and the information contained in the data (via the likelihood).

- **Marginal Likelihood, $P(D)$:** This term represents the probability of the data, averaged over all possible values of the parameters, weighted by our prior beliefs: $P(D) = \int P(D|\theta)P(\theta) d\theta$. This denominator acts as a normalizing constant, ensuring the posterior distribution integrates to 1. For parameter estimation, we can often ignore it, as the posterior is proportional to the numerator: **Posterior $\propto$ Likelihood $\times$ Prior**. However, the marginal likelihood (also called **model evidence**) is crucial for Bayesian model comparison.

## 2. Conjugate Priors: An Analytical Shortcut

Before modern computational methods became widespread, Bayesian analysis relied heavily on finding analytical solutions to the updating problem. This is possible in special cases where the **prior distribution is conjugate to the likelihood function**. A prior is conjugate if the resulting posterior distribution belongs to the same family of distributions as the prior.

The classic example is the **Beta-Binomial model**. Suppose we are interested in the probability of success, $p$, in a series of Bernoulli trials. The likelihood of observing $k$ successes in $n$ trials is given by the Binomial distribution. If we choose a Beta distribution as our prior for the parameter $p$, the resulting posterior for $p$ is also a Beta distribution. This provides a simple, closed-form way to see how the data updates our beliefs.

- **Prior:** $p \sim \text{Beta}(\alpha_0, \beta_0)$
- **Likelihood:** $k | p \sim \text{Binomial}(n, p)$
- **Posterior:** $p | k \sim \text{Beta}(\alpha_0 + k, \beta_0 + n - k)$

The updating rule is beautifully simple: we just add the number of observed successes to the prior's $\alpha$ parameter and the number of observed failures to the prior's $\beta$ parameter.

In [None]:
sec("Bayesian Updating with a Conjugate Prior (Beta-Binomial)")

# === Setup ===
# 1. Prior Beliefs: We start with a Beta(2, 2) prior. This is a weak prior centered 
#    at 0.5, suggesting we think the probability is likely around 50/50 but we are not very certain.
alpha_prior, beta_prior = 2, 2

# 2. Observed Data: We conduct an experiment and observe 15 successes in 20 trials.
n_trials, n_successes = 20, 15

# === Bayesian Updating ===
# 3. Posterior Calculation: We update our beliefs using the conjugate rule.
alpha_post = alpha_prior + n_successes
beta_post = beta_prior + (n_trials - n_successes)

# === Visualization ===
p_grid = np.linspace(0, 1, 200)
prior_pdf = beta_dist.pdf(p_grid, alpha_prior, beta_prior)
posterior_pdf = beta_dist.pdf(p_grid, alpha_post, beta_post)
likelihood_unscaled = (p_grid**n_successes * (1-p_grid)**(n_trials-n_successes))
likelihood_scaled = likelihood_unscaled / np.max(likelihood_unscaled) * np.max(posterior_pdf) # Scale for visualization

fig, ax = plt.subplots()
ax.plot(p_grid, prior_pdf, label=f'Prior: Beta({alpha_prior}, {beta_prior})', color='gray', ls='--')
ax.plot(p_grid, posterior_pdf, label=f'Posterior: Beta({alpha_post}, {beta_post})', color='blue', lw=2.5)
ax.fill_between(p_grid, likelihood_scaled, color='green', alpha=0.3, label=f'Likelihood (scaled for viz)')
ax.axvline(n_successes/n_trials, color='green', ls=':', label=f'Data MLE = {n_successes/n_trials:.2f}')

ax.set_title('Figure 1: Bayesian Updating in the Beta-Binomial Model', fontsize=16)
ax.set_xlabel('Probability of Success (p)')
ax.set_ylabel('Probability Density')
ax.legend()
plt.show()

note("The posterior distribution is a compromise between the broad prior belief centered at 0.5 and the evidence from the data (the likelihood, which peaks at the MLE of 0.75). The posterior is much narrower than the prior, reflecting our increased certainty after observing the data.")

## 3. The Computational Revolution: MCMC from Scratch

For most realistic models, conjugate priors are not available. The integral required to calculate the marginal likelihood is often high-dimensional and analytically intractable. The modern Bayesian revolution was sparked by **Markov Chain Monte Carlo (MCMC)** methods, which allow us to draw samples from a posterior distribution without having to calculate it directly.

The key insight is that we only need to calculate the posterior *up to a constant of proportionality*, since **Posterior $\propto$ Likelihood $\times$ Prior**. MCMC algorithms use this fact to construct a "random walk" (a Markov Chain) that explores the parameter space, spending more time in regions of high posterior probability. After an initial "burn-in" period, the samples from this walk are equivalent to draws from the true posterior distribution.

#### 3.1 The Metropolis-Hastings Algorithm
The Metropolis-Hastings (MH) algorithm is one of the simplest and most foundational MCMC algorithms. Here's the intuition:
1.  **Start** at some initial parameter value, $\theta_{current}$.
2.  **Propose** a new parameter value, $\theta_{proposal}$, by taking a small random step away from the current value (e.g., drawing from a normal distribution centered at $\theta_{current}$).
3.  **Compare** the posterior probability at the proposed point to the current point by calculating the ratio $ \alpha = \frac{P(\theta_{proposal}|D)}{P(\theta_{current}|D)} $. Since the denominator of the posterior is constant, this is just the ratio of the (Likelihood $\times$ Prior) at the two points.
4.  **Decide** whether to move to the proposed point:
    - If the proposal is in a region of higher probability ($\alpha > 1$), we **always accept** the move. $\theta_{new} = \theta_{proposal}$.
    - If the proposal is in a region of lower probability ($\alpha < 1$), we **accept it with probability $\alpha$**. We draw a random number from a uniform(0,1) distribution; if it's less than $\alpha$, we move. Otherwise, we **reject** the proposal and stay where we are ($\	heta_{new} = \theta_{current}$).
5.  **Repeat** this process thousands of times.

This simple "propose-accept-reject" rule guarantees that the chain will eventually converge to and draw samples from the true posterior distribution.

#### 3.2 Implementing MCMC for the Beta-Binomial Model
Let's use our new `MCMCSampler` class, which implements the MH algorithm, to solve the Beta-Binomial model. This will validate our sampler by showing that it can reproduce the analytical posterior we found earlier.

First, we need to define a function for the **log-posterior**, which is just the sum of the log-likelihood and the log-prior.

In [None]:
sec("Estimating the Beta-Binomial Model with a From-Scratch MCMC Sampler")

def log_posterior_beta_binomial(p, data):
    """ Log-posterior for the Beta-Binomial model. """
    # Unpack data and parameters
    n_trials, n_successes = data
    alpha_prior, beta_prior = 2, 2
    
    # Parameter must be in (0, 1)
    if p <= 0 or p >= 1:
        return -np.inf
    
    # Log-prior: Beta(alpha, beta)
    log_prior = beta_dist.logpdf(p, alpha_prior, beta_prior)
    
    # Log-likelihood: Binomial(n, p)
    log_likelihood = n_successes * np.log(p) + (n_trials - n_successes) * np.log(1 - p)
    
    return log_prior + log_likelihood

# Instantiate and run the sampler
mcmc_sampler = MCMCSampler(log_posterior_beta_binomial, data=(n_trials, n_successes))
mcmc_sampler.sample(start_params=[0.5], num_samples=20000, burn_in=2000, step_size=0.05)

# Get the posterior samples
posterior_samples = mcmc_sampler.samples.flatten()

note("Summary of the MCMC posterior samples:")
mcmc_sampler.summary()

In [None]:
sec("Validating the MCMC Sampler Against the Analytical Solution")

fig, ax = plt.subplots()
# Plot the histogram of our MCMC samples
sns.histplot(posterior_samples, bins=50, stat='density', ax=ax, 
             label='MCMC Posterior Samples', color='steelblue', alpha=0.7)

# Overlay the true analytical posterior PDF
ax.plot(p_grid, posterior_pdf, label=f'True Posterior: Beta({alpha_post}, {beta_post})', 
        color='red', ls='--', lw=2.5)

ax.set_title('Figure 2: MCMC Simulation vs. Analytical Posterior', fontsize=16)
ax.set_xlabel('Probability of Success (p)')
ax.set_ylabel('Probability Density')
ax.legend()
plt.show()

note("The histogram of the samples generated by our from-scratch Metropolis-Hastings sampler perfectly matches the true posterior distribution. This validates our implementation and demonstrates the power of MCMC.")

#### 3.3 MCMC Diagnostics: Has the Chain Converged?

A crucial step in any MCMC analysis is to diagnose whether the algorithm has worked correctly. The primary goal is to assess if the Markov chain has converged to its stationary distribution (the posterior) and is exploring it effectively.

1.  **Trace Plots:** A trace plot shows the value of a parameter at each iteration of the MCMC chain. A healthy trace plot should look like a "fat, hairy caterpillar," indicating that the chain is rapidly exploring the full posterior distribution without getting stuck. You should not see long-term trends or periods where the chain gets stuck in one place.

2.  **Autocorrelation Plots:** By design, MCMC samples are correlated—each new sample depends on the previous one. An autocorrelation plot shows the correlation of the samples with lagged versions of themselves. We want this correlation to die down quickly. High autocorrelation means the chain is moving inefficiently and we will need more samples to get a good representation of the posterior.

In [None]:
sec("Visual MCMC Diagnostics")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Trace Plot
ax1.plot(posterior_samples)
ax1.set_title('Trace Plot for Parameter p')
ax1.set_xlabel('MCMC Iteration')
ax1.set_ylabel('Parameter Value')

# Autocorrelation Plot
pd.plotting.autocorrelation_plot(posterior_samples, ax=ax2)
ax2.set_title('Autocorrelation Plot for Parameter p')

plt.tight_layout()
plt.show()

note("The trace plot looks good: it's centered around the posterior mean and explores the space well. The autocorrelation plot shows that the correlation between samples drops quickly, suggesting our sampler is efficient.")

## 4. Bayesian Linear Regression with `PyMC`

`PyMC` is a powerful Python library for probabilistic programming. It allows us to specify Bayesian models by defining priors for our parameters and a likelihood for our data. `PyMC` then automatically assembles the model and uses an MCMC sampler (typically NUTS) to draw samples from the posterior distribution.

Let's implement a Bayesian version of the simple linear regression model:
$$ y_i = \alpha + \beta x_i + \epsilon_i, \quad \text{where} \quad \epsilon_i \sim N(0, \sigma^2) $$ 
In a Bayesian framework, we must place priors on all unknown parameters: $\alpha$, $\beta$, and $\sigma$.

In [None]:
sec("Bayesian Regression with PyMC and ArviZ")
if not PYMC_AVAILABLE:
    note("PyMC is not installed. Skipping this section.")
else:
    # 1. Generate some sample data
    rng = np.random.default_rng(42)
    N = 100
    true_alpha, true_beta, true_sigma = 1.0, 2.5, 0.5
    X = rng.uniform(0, 1, N)
    y = true_alpha + true_beta * X + rng.normal(0, true_sigma, size=N)

    # 2. Define the Bayesian model using PyMC's context manager
    with pm.Model() as linear_model:
        # --- Priors for unknown model parameters ---
        # We use weakly informative priors. Normal distributions for the coefficients,
        # centered at 0. A HalfNormal for sigma, as it must be positive.
        alpha = pm.Normal('alpha', mu=0, sigma=10)
        beta = pm.Normal('beta', mu=0, sigma=10)
        sigma = pm.HalfNormal('sigma', sigma=1)
        
        # --- Model for the expected value of the outcome ---
        mu = alpha + beta * X
        
        # --- Likelihood (the data-generating distribution) ---
        # The observed=y argument tells PyMC that this is the data we want to condition on.
        Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=y)
        
        # 3. Use MCMC to draw samples from the posterior
        # 'draws' are the number of samples per chain after tuning.
        # 'tune' are the initial steps to discard (burn-in).
        # 'chains' are the number of independent MCMC chains to run.
        idata = pm.sample(draws=2000, tune=1000, chains=4, progressbar=False, random_seed=42)

    # 4. Analyze and interpret the results using ArviZ
    note("The ArviZ library provides excellent tools for summarizing MCMC results and diagnosing convergence issues.")
    print("--- Posterior Summary ---")
    summary = az.summary(idata, var_names=['alpha', 'beta', 'sigma'], hdi_prob=0.95)
    display(summary.round(3))
    
    # --- MCMC Diagnostics: The Trace Plot ---
    # This plot shows the posterior distribution (left) and the sampling path (right) for each parameter.
    az.plot_trace(idata, var_names=['alpha', 'beta', 'sigma'])
    plt.suptitle('Figure 2: MCMC Trace Plots for Model Parameters', fontsize=16, y=1.02)
    plt.tight_layout()
    plt.show()

### Interpreting the Bayesian Output

The output of an MCMC estimation is the entire **posterior distribution** for each parameter, which represents our complete, updated knowledge. We summarize this distribution with a few key statistics:

- **mean:** The posterior mean, often used as the primary point estimate for the parameter.
- **sd:** The posterior standard deviation, a measure of our uncertainty about the parameter after seeing the data.
- **hdi 3% / hdi 97%:** The bounds of the 94% **Highest Density Interval (HDI)**. This is the Bayesian **credible interval**. We can state that, conditional on our model and data, there is a 94% probability that the true parameter value lies within this interval. This is a direct, intuitive probabilistic statement that is often misinterpreted as what a Frequentist confidence interval provides.

The summary also provides crucial **diagnostic statistics**:

- **r_hat ($\hat{R}$):** The Gelman-Rubin diagnostic. It checks for convergence by comparing the variance between the MCMC chains to the variance within each chain. It should be very close to 1.0 (ideally < 1.01). A value greater than 1.1 indicates the chains have not converged to the same distribution, and the results should not be trusted.
- **ess_bulk / ess_tail:** Bulk and Tail Effective Sample Size. MCMC draws are autocorrelated, so they contain less information than the same number of independent draws. The ESS estimates how many independent samples our correlated MCMC draws are worth. Higher numbers are better, as low ESS can lead to noisy estimates of the posterior.

## 5. Exercises

1.  **Prior Sensitivity Analysis:** The choice of prior can influence the posterior, especially with small datasets. Re-run the linear regression model from Section 4, but change the priors to be much more "informative" (narrower) and incorrect. For example, set `alpha = pm.Normal('alpha', mu=5, sigma=0.1)` and `beta = pm.Normal('beta', mu=-5, sigma=0.1)`. How do the posterior mean and credible intervals change compared to the original model? Now, increase the dataset size to `N=5000` and repeat the experiment with these incorrect, informative priors. What happens to the influence of the prior as the sample size grows? This demonstrates the concept of the data "overwhelming" the prior.

2.  **Posterior Predictive Checks:** A key part of the Bayesian workflow is checking if your fitted model generates data that looks like the data you actually observed. This is a **posterior predictive check**. Using the original fitted model (`idata`), use `pm.sample_posterior_predictive` to generate simulated datasets from your fitted model's posterior distribution. Then, use `arviz.plot_ppc` to plot a histogram of the observed `y` and overlay it with histograms from several of your simulated datasets. Does the model capture the central tendency and spread of the data well?

3.  **Bayesian Logistic Regression:** Use `PyMC` to build a Bayesian logistic regression model. First, generate synthetic data for a binary outcome. For example, create an `X` variable, define a probability `p = pm.math.sigmoid(alpha + beta * X)`, and then use `pm.Bernoulli` as the likelihood function to model a binary `y`. Fit the model and interpret the posterior for `beta`. How would you explain the meaning of the credible interval for `beta` in this context?

4.  **Hierarchical Models:** One of the most powerful applications of the Bayesian framework is in building **hierarchical (or multilevel) models**. These are models where parameters are themselves drawn from a distribution, which is perfect for modeling data with a group structure (e.g., students within schools, or firms within industries). Research the concept of **partial pooling** in hierarchical models. Explain why it is often superior to either complete pooling (ignoring the group structure and fitting one model) or no pooling (running a separate regression for each group).