# 01 Probability Distributions

The building blocks of statistical inference: understanding, visualizing, and simulating the distributions you will encounter throughout this project.

## Table of Contents
- [What is a probability distribution?](#what-is-a-probability-distribution)
- [The Normal distribution](#the-normal-distribution)
- [The t-distribution](#the-t-distribution)
- [The Chi-Squared distribution](#the-chi-squared-distribution)
- [The F-distribution](#the-f-distribution)
- [Discrete distributions: Binomial and Poisson](#discrete-distributions)
- [Comparing distributions with real data](#comparing-distributions-with-real-data)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)

## Why This Notebook Matters
Every hypothesis test, confidence interval, and p-value in this project relies on an assumed
probability distribution. If you don't understand where these distributions come from and
what they look like, statistical inference becomes a black box. This notebook makes the
distributions concrete through simulation and visualization.

## Prerequisites (Quick Self-Check)
- Completed notebook 00 (descriptive statistics).
- Comfort with histograms and basic plotting.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can describe the shape and parameters of the normal, t, chi-squared, F, binomial, and Poisson distributions.
- You can simulate data from each and visualize PDFs/PMFs and CDFs.
- You can explain when each distribution arises in econometric practice.

## Common Pitfalls
- Assuming all data is normally distributed without checking.
- Confusing PDF (density, for continuous) with PMF (probability, for discrete).
- Forgetting that the t-distribution has heavier tails than the normal.
- Treating distribution parameters as fixed truths rather than estimated quantities.

## Quick Fixes (When You Get Stuck)
- If `scipy.stats` functions confuse you, remember: `.pdf(x)` for density, `.cdf(x)` for cumulative probability, `.rvs(size=n)` to simulate.
- If plots look empty, check your x-axis range.
- If you see `ModuleNotFoundError`, re-run the bootstrap cell.

## Matching Guide
- `docs/guides/00_statistics_primer/01_probability_distributions.md`

## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2–4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/00_statistics_primer/01_probability_distributions.md`) for the math, assumptions, and deeper context.

<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.

In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT

### Common imports for this notebook

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 4)
plt.rcParams['figure.dpi'] = 100

---
<a id="what-is-a-probability-distribution"></a>
## What is a probability distribution?

### Goal
Build a mental model for what a distribution is, and distinguish the key vocabulary: PMF vs PDF, CDF, discrete vs continuous, and parameters.

### Why this matters in economics
Every econometric test implicitly assumes a probability distribution for the data or for a test statistic. When you run an OLS regression and look at p-values, you are assuming that the test statistic follows a t-distribution (or approximately a normal). If the assumption is wrong, the p-values lie.

### Primer

A **probability distribution** is a recipe for generating random data. Think of it as a machine: you turn the crank, and it spits out a number. The distribution tells you *which numbers are likely* and *which are rare*.

**Discrete vs Continuous**
- **Discrete**: the random variable can only take specific values (e.g., counts: 0, 1, 2, ...). Described by a **PMF** (probability mass function): $P(X = k)$.
- **Continuous**: the random variable can take any value in an interval (e.g., GDP growth: 2.31%, -0.47%). Described by a **PDF** (probability density function): $f(x)$. Note that $P(X = x) = 0$ for any single point; probabilities are areas under the curve.

**CDF (Cumulative Distribution Function)**
- Works for both discrete and continuous: $F(x) = P(X \le x)$.
- Always non-decreasing, from 0 to 1.

**Parameters**
- Most distributions are families indexed by parameters. The normal has $\mu$ (mean) and $\sigma$ (standard deviation). The t-distribution has $\nu$ (degrees of freedom). Changing parameters changes the shape.

**Intuition**: A distribution is not a single dataset. It is the *process* that could have generated many datasets. When we fit a distribution to data, we are guessing which machine produced our observations.

### Quick visual: PMF vs PDF vs CDF

The following cell plots a discrete PMF (Binomial) and a continuous PDF (Normal) side-by-side so you can see the difference.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# --- PMF: Binomial(n=10, p=0.3) ---
k = np.arange(0, 11)
pmf_vals = stats.binom.pmf(k, n=10, p=0.3)
axes[0].bar(k, pmf_vals, color='steelblue', edgecolor='black', alpha=0.8)
axes[0].set_title('PMF: Binomial(n=10, p=0.3)')
axes[0].set_xlabel('k')
axes[0].set_ylabel('P(X = k)')

# --- PDF: Normal(mu=0, sigma=1) ---
x = np.linspace(-4, 4, 200)
axes[1].plot(x, stats.norm.pdf(x), color='darkred', lw=2)
axes[1].fill_between(x, stats.norm.pdf(x), alpha=0.15, color='darkred')
axes[1].set_title('PDF: Normal(0, 1)')
axes[1].set_xlabel('x')
axes[1].set_ylabel('f(x)')

# --- CDF comparison ---
axes[2].step(k, stats.binom.cdf(k, n=10, p=0.3), where='mid',
             label='Binomial CDF', color='steelblue', lw=2)
axes[2].plot(x, stats.norm.cdf(x), label='Normal CDF', color='darkred', lw=2)
axes[2].set_title('CDF comparison')
axes[2].set_xlabel('x')
axes[2].set_ylabel('F(x)')
axes[2].legend()

plt.tight_layout()
plt.show()

### Interpretation

**Write 2–4 sentences:**
1. What is the key visual difference between a PMF (left) and a PDF (center)?
2. For the CDF plot (right), how do you read off $P(X \le 2)$ for the Binomial? For the Normal?

*Your answer:*

(write here)

---
<a id="the-normal-distribution"></a>
## The Normal (Gaussian) Distribution

### Goal
Simulate normal data, visualize the PDF, and learn to check whether real economic data is approximately normal.

### Why this matters in economics
The Central Limit Theorem (CLT) says that sample means become approximately normal as $n$ grows, which justifies many inference procedures. But the underlying data itself is often *not* normal (e.g., income distributions are right-skewed, asset returns have fat tails). Knowing when normality holds and when it breaks is essential.

### Primer

The **Normal distribution** $N(\mu, \sigma^2)$ has two parameters:
- $\mu$: the mean (center of the bell curve)
- $\sigma$: the standard deviation (spread)

Its PDF is:
$$
f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
$$

Key properties:
- Symmetric around $\mu$.
- About 68% of the data falls within $\mu \pm \sigma$, 95% within $\mu \pm 2\sigma$.
- Fully characterized by its first two moments (mean and variance).

### Your Turn (1): Simulate and visualize Normal data

In [None]:
rng = np.random.default_rng(42)

# TODO: Simulate 5000 draws from N(mu=2, sigma=1.5)
mu, sigma = 2, 1.5
normal_data = ...  # rng.normal(...)

# TODO: Plot a histogram (density=True) and overlay the theoretical PDF
x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 200)

fig, ax = plt.subplots()
# Histogram
ax.hist(...)  # TODO: fill in arguments (bins=50, density=True, alpha=0.5)
# PDF overlay
ax.plot(...)  # TODO: use stats.norm.pdf(x, loc=mu, scale=sigma)
ax.set_title('Normal Distribution: Simulated vs Theoretical')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend(['Theoretical PDF', 'Simulated data'])
plt.show()

### Your Turn (2): Is GDP growth normally distributed?

Load the sample macro data and check whether quarterly GDP growth looks normal.

In [None]:
import pandas as pd
df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)

gdp_growth = df['gdp_growth_qoq'].dropna()

# TODO: Plot a histogram of gdp_growth with a fitted normal overlay
mu_hat = ...  # sample mean
sigma_hat = ...  # sample std

fig, ax = plt.subplots()
ax.hist(gdp_growth, bins=20, density=True, alpha=0.5, color='steelblue',
        edgecolor='black', label='GDP growth (QoQ)')

x_range = np.linspace(gdp_growth.min() - 1, gdp_growth.max() + 1, 200)
ax.plot(...)  # TODO: overlay fitted normal PDF

ax.set_title('GDP Growth (QoQ) vs Fitted Normal')
ax.set_xlabel('Growth rate')
ax.set_ylabel('Density')
ax.legend()
plt.show()

print(f'Mean: {mu_hat:.4f}, Std: {sigma_hat:.4f}')
print(f'Skewness: {gdp_growth.skew():.4f}')
print(f'Kurtosis: {gdp_growth.kurtosis():.4f}')

### Your Turn (3): QQ Plot

A **QQ (quantile-quantile) plot** compares the quantiles of your data to the quantiles of a theoretical distribution. If the data is normal, points fall on a straight line.

In [None]:
# TODO: Create a QQ plot of GDP growth against the normal distribution
fig, ax = plt.subplots()
stats.probplot(...)  # TODO: pass gdp_growth and plot=ax
ax.set_title('QQ Plot: GDP Growth vs Normal')
plt.show()

### Interpretation

**Write 2–4 sentences:**
1. Does GDP growth appear normally distributed? What does the histogram tell you? What does the QQ plot tell you?
2. If the data has excess kurtosis (fat tails), what does that mean for inference based on the normal assumption?

*Your answer:*

(write here)

---
<a id="the-t-distribution"></a>
## The t-Distribution

### Goal
Understand why we use the t-distribution instead of the normal for small-sample inference, and see how it converges to the normal as degrees of freedom increase.

### Why this matters in economics
Every coefficient t-test in a regression uses the t-distribution. With macroeconomic data you often have small samples (e.g., 80 quarters of GDP data). Using the normal when you should use the t-distribution understates uncertainty and produces overconfident p-values.

### Primer

The **t-distribution** with $\nu$ degrees of freedom arises when you estimate the mean of a normal population but also have to estimate the variance from the same sample:

$$
t = \frac{\bar{X} - \mu}{s / \sqrt{n}} \sim t_{n-1}
$$

Key properties:
- Looks like a normal but with **heavier tails** (more probability in the extremes).
- One parameter: $\nu$ (degrees of freedom).
- As $\nu \to \infty$, the t-distribution converges to the standard normal.
- For $\nu > 30$, the difference is small but still matters for precise inference.

### Your Turn (1): Compare t-distributions with different df to the Normal

In [None]:
x = np.linspace(-5, 5, 300)

fig, ax = plt.subplots(figsize=(9, 5))

# TODO: Plot the standard normal PDF
ax.plot(x, stats.norm.pdf(x), 'k-', lw=2, label='Normal(0,1)')

# TODO: Overlay t-distributions with df = 2, 5, 10, 30
for df in [2, 5, 10, 30]:
    ax.plot(...)  # TODO: stats.t.pdf(x, df=df), with label

ax.set_title('t-Distributions vs Standard Normal')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
plt.show()

### Your Turn (2): Tail probabilities

How much more probability mass is in the tails of a t-distribution compared to the normal?

In [None]:
# TODO: Compute P(|X| > 2) for the normal and for t with df = 5, 10, 30
# Hint: P(|X| > 2) = 2 * (1 - CDF(2))

print('P(|X| > 2):')
print(f'  Normal:    {2 * (1 - stats.norm.cdf(2)):.4f}')

for df in [5, 10, 30]:
    p = ...  # TODO: compute for t-distribution
    print(f'  t(df={df:2d}):  {p:.4f}')

### Your Turn (3): Simulate convergence to Normal

Draw many samples from a t-distribution with increasing df and compare histograms.

In [None]:
rng = np.random.default_rng(42)
n_samples = 10_000

fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)
x_range = np.linspace(-5, 5, 200)

for ax, df in zip(axes, [3, 10, 30, 100]):
    # TODO: Simulate n_samples draws from t(df) and plot histogram
    samples = ...  # stats.t.rvs(df=df, size=n_samples, random_state=rng)
    ax.hist(samples, bins=60, density=True, alpha=0.5, color='steelblue')
    # Overlay normal PDF for comparison
    ax.plot(x_range, stats.norm.pdf(x_range), 'r-', lw=1.5, label='Normal')
    ax.set_title(f't(df={df})')
    ax.set_xlim(-5, 5)
    ax.legend(fontsize=8)

plt.suptitle('t-Distribution converges to Normal as df increases', y=1.02)
plt.tight_layout()
plt.show()

### Interpretation

**Write 2–4 sentences:**
1. At what df does the t-distribution become nearly indistinguishable from the normal?
2. If you have 20 quarterly observations and 3 regressors (df = 16), how much wider would a 95% confidence interval be using the t-distribution compared to the normal?

*Your answer:*

(write here)

---
<a id="the-chi-squared-distribution"></a>
## The Chi-Squared Distribution

### Goal
Build intuition for the $\chi^2$ distribution by constructing it from squared standard normals, and connect it to econometric tests.

### Why this matters in economics
The $\chi^2$ distribution appears in:
- **Breusch-Pagan test** for heteroskedasticity (non-constant error variance).
- **Likelihood ratio tests** for nested model comparisons.
- **Goodness-of-fit tests** for distributional assumptions.

If you don't know what a $\chi^2$ distribution looks like, these test statistics are just numbers without intuition.

### Primer

If $Z_1, Z_2, \ldots, Z_k$ are independent standard normals ($Z_i \sim N(0,1)$), then:

$$
Q = Z_1^2 + Z_2^2 + \cdots + Z_k^2 \sim \chi^2_k
$$

Key properties:
- One parameter: $k$ (degrees of freedom).
- Always non-negative (it is a sum of squares).
- Mean = $k$, Variance = $2k$.
- Right-skewed for small $k$; becomes more symmetric as $k$ grows.

### Your Turn (1): Build $\chi^2$ from squared normals

In [None]:
rng = np.random.default_rng(42)
n_sims = 50_000
k = 5  # degrees of freedom

# TODO: Simulate k independent standard normals and sum their squares
# Step 1: Generate a (n_sims, k) matrix of standard normals
Z = ...  # rng.normal(size=(n_sims, k))

# Step 2: Square each element and sum across columns
Q = ...  # (Z ** 2).sum(axis=1)

# Plot the simulated distribution against the theoretical chi-squared PDF
fig, ax = plt.subplots()
ax.hist(Q, bins=80, density=True, alpha=0.5, color='seagreen',
        edgecolor='black', label=f'Simulated sum of {k} squared normals')

x = np.linspace(0, 25, 200)
ax.plot(x, stats.chi2.pdf(x, df=k), 'k-', lw=2,
        label=f'$\\chi^2$({k}) PDF')

ax.set_title(f'Chi-Squared from Squared Normals (k={k})')
ax.set_xlabel('Q')
ax.set_ylabel('Density')
ax.legend()
plt.show()

print(f'Simulated mean: {Q.mean():.2f} (theoretical: {k})')
print(f'Simulated var:  {Q.var():.2f} (theoretical: {2 * k})')

### Your Turn (2): Different degrees of freedom

In [None]:
x = np.linspace(0, 30, 300)

fig, ax = plt.subplots(figsize=(9, 5))

# TODO: Plot chi-squared PDFs for df = 1, 2, 5, 10, 15
for df in [1, 2, 5, 10, 15]:
    ax.plot(...)  # TODO: stats.chi2.pdf(x, df=df), with label

ax.set_title('Chi-Squared Distribution for Different df')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.set_ylim(0, 0.5)
ax.legend()
plt.show()

### Interpretation

**Write 2–4 sentences:**
1. How does the shape of the $\chi^2$ distribution change as df increases?
2. In a Breusch-Pagan test with 3 regressors, the test statistic follows $\chi^2(3)$. Looking at the plot, what values would you consider "extreme" (in the right tail)?

*Your answer:*

(write here)

---
<a id="the-f-distribution"></a>
## The F-Distribution

### Goal
Understand the F-distribution as a ratio of chi-squared variables and connect it to joint significance tests in regression.

### Why this matters in economics
The F-distribution appears whenever you test whether a *group* of coefficients are jointly zero:
- The overall F-test in `statsmodels` output: "Are all slope coefficients jointly zero?"
- Comparing nested models: "Does adding these 3 variables significantly improve fit?"
- ANOVA tables.

Understanding the F-distribution helps you interpret these tests rather than blindly trusting a p-value.

### Primer

If $U \sim \chi^2_{d_1}$ and $V \sim \chi^2_{d_2}$ are independent, then:

$$
F = \frac{U / d_1}{V / d_2} \sim F_{d_1, d_2}
$$

Key properties:
- Two parameters: $d_1$ (numerator df) and $d_2$ (denominator df).
- Always non-negative.
- Right-skewed; becomes more symmetric as both df grow.
- When $d_1 = 1$, the F-test is equivalent to a two-sided t-test (since $F = t^2$).

### Your Turn (1): Simulate the F-distribution from chi-squared ratios

In [None]:
rng = np.random.default_rng(42)
n_sims = 50_000
d1, d2 = 3, 20  # numerator df, denominator df

# TODO: Simulate chi-squared variables and compute their ratio
U = ...  # stats.chi2.rvs(df=d1, size=n_sims, random_state=rng)
V = ...  # stats.chi2.rvs(df=d2, size=n_sims, random_state=...)

F_sim = ...  # (U / d1) / (V / d2)

# Plot
fig, ax = plt.subplots()
ax.hist(F_sim, bins=100, density=True, alpha=0.5, color='goldenrod',
        edgecolor='black', label='Simulated F', range=(0, 8))

x = np.linspace(0, 8, 300)
ax.plot(x, stats.f.pdf(x, dfn=d1, dfd=d2), 'k-', lw=2,
        label=f'F({d1},{d2}) PDF')

ax.set_title(f'F-Distribution from Chi-Squared Ratio (d1={d1}, d2={d2})')
ax.set_xlabel('F')
ax.set_ylabel('Density')
ax.legend()
plt.show()

### Your Turn (2): F-distributions with different parameters

In [None]:
x = np.linspace(0.01, 5, 300)

fig, ax = plt.subplots(figsize=(9, 5))

# TODO: Plot F-distribution PDFs for different (d1, d2) combinations
params = [(1, 10), (3, 20), (5, 50), (10, 100)]
for d1, d2 in params:
    ax.plot(...)  # TODO: stats.f.pdf(x, dfn=d1, dfd=d2), with label

ax.set_title('F-Distributions with Different Parameters')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
plt.show()

### Your Turn (3): Critical values

In regression output, you reject the null if the F-statistic exceeds a critical value. Find the 95th percentile (critical value at $\alpha = 0.05$) for different F-distributions.

In [None]:
# TODO: Compute the 95th percentile of F(d1, d2) for several (d1, d2) pairs
print('F critical values (alpha = 0.05):')
for d1, d2 in [(1, 50), (3, 50), (5, 50), (3, 20), (3, 100)]:
    cv = ...  # stats.f.ppf(0.95, dfn=d1, dfd=d2)
    print(f'  F({d1:2d}, {d2:3d}): {cv:.3f}')

### Interpretation

**Write 2–4 sentences:**
1. How does increasing the denominator df (more observations) affect the critical value?
2. Why does the F-test become harder to reject (larger critical value) when you add more restrictions (larger $d_1$)?

*Your answer:*

(write here)

---
<a id="discrete-distributions"></a>
## Discrete Distributions: Binomial and Poisson

### Goal
Understand two important discrete distributions and connect them to economic counting problems.

### Why this matters in economics
- **Binomial**: How many quarters in a decade are recessions? If each quarter has probability $p$ of being a recession, the count follows $\text{Binomial}(n, p)$.
- **Poisson**: How many Federal Reserve rate changes occur per year? How many financial crises per decade? Count data in a fixed interval often follows a Poisson distribution.

These distributions underpin count-data regression models (Poisson regression, logistic regression) that appear later in the project.

### Primer

**Binomial$(n, p)$**:
- $n$ independent trials, each with success probability $p$.
- $X$ = number of successes.
- PMF: $P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$
- Mean = $np$, Variance = $np(1-p)$.

**Poisson$(\lambda)$**:
- Counts events in a fixed interval when events occur independently at rate $\lambda$.
- PMF: $P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$
- Mean = $\lambda$, Variance = $\lambda$ (mean equals variance).
- The Poisson approximates $\text{Binomial}(n, p)$ when $n$ is large and $p$ is small, with $\lambda = np$.

### Your Turn (1): Binomial -- recession quarters in a decade

In [None]:
# A decade has 40 quarters. Historically, about 15% of quarters are recessions.
n_quarters = 40
p_recession = 0.15

rng = np.random.default_rng(42)

# TODO: Simulate 10,000 decades and count recession quarters in each
sim_counts = ...  # rng.binomial(n=n_quarters, p=p_recession, size=10_000)

# TODO: Plot the PMF (theoretical) and the simulated histogram
k = np.arange(0, 20)
pmf_vals = ...  # stats.binom.pmf(k, n=n_quarters, p=p_recession)

fig, ax = plt.subplots()
ax.bar(k - 0.2, pmf_vals, width=0.4, color='steelblue', alpha=0.8,
       edgecolor='black', label='Theoretical PMF')

# Simulated (normalized to compare with PMF)
counts, edges = np.histogram(sim_counts, bins=np.arange(-0.5, 20.5, 1), density=True)
ax.bar(np.arange(0, 20) + 0.2, counts, width=0.4, color='coral', alpha=0.8,
       edgecolor='black', label='Simulated')

ax.set_title(f'Recession Quarters per Decade: Binomial({n_quarters}, {p_recession})')
ax.set_xlabel('Number of recession quarters')
ax.set_ylabel('Probability')
ax.legend()
plt.show()

print(f'Expected: {n_quarters * p_recession:.1f} recession quarters per decade')
print(f'Simulated mean: {sim_counts.mean():.2f}')

### Your Turn (2): Poisson -- rate changes per year

In [None]:
# Suppose the Fed changes rates on average 4 times per year.
lam = 4

rng = np.random.default_rng(42)

# TODO: Simulate 10,000 years of rate-change counts
sim_poisson = ...  # rng.poisson(lam=lam, size=10_000)

# TODO: Plot PMF and simulated histogram
k = np.arange(0, 15)
pmf_vals = ...  # stats.poisson.pmf(k, mu=lam)

fig, ax = plt.subplots()
ax.bar(k - 0.2, pmf_vals, width=0.4, color='steelblue', alpha=0.8,
       edgecolor='black', label='Theoretical PMF')

counts, _ = np.histogram(sim_poisson, bins=np.arange(-0.5, 15.5, 1), density=True)
ax.bar(np.arange(0, 15) + 0.2, counts[:15], width=0.4, color='coral', alpha=0.8,
       edgecolor='black', label='Simulated')

ax.set_title(f'Fed Rate Changes per Year: Poisson($\\lambda$={lam})')
ax.set_xlabel('Number of rate changes')
ax.set_ylabel('Probability')
ax.legend()
plt.show()

print(f'Mean = Variance = {lam}')
print(f'Simulated mean: {sim_poisson.mean():.2f}, variance: {sim_poisson.var():.2f}')

### Your Turn (3): Poisson as limit of Binomial

Show that $\text{Binomial}(n, p)$ with $n$ large and $p$ small approaches $\text{Poisson}(\lambda = np)$.

In [None]:
lam = 3
k = np.arange(0, 15)

fig, ax = plt.subplots()

# TODO: Plot Poisson(3) PMF
ax.plot(k, stats.poisson.pmf(k, mu=lam), 'ko-', ms=6, lw=2,
        label=f'Poisson({lam})', zorder=5)

# TODO: Overlay Binomial(n, p) with n = 10, 30, 100, 500 (keeping np = 3)
for n_val in [10, 30, 100, 500]:
    p_val = ...  # lam / n_val
    ax.plot(...)  # TODO: stats.binom.pmf(k, n=n_val, p=p_val)

ax.set_title('Binomial approaches Poisson as n grows (np = 3)')
ax.set_xlabel('k')
ax.set_ylabel('P(X = k)')
ax.legend(fontsize=9)
plt.show()

### Interpretation

**Write 2–4 sentences:**
1. In the Binomial recession example, what is the probability of having zero recession quarters in a decade? Is that surprising?
2. For the Poisson, the mean equals the variance. If you observed that the variance of rate changes was much larger than the mean, what would that suggest about the Poisson assumption?

*Your answer:*

(write here)

---
<a id="comparing-distributions-with-real-data"></a>
## Comparing Distributions with Real Data

### Goal
Practice fitting distributions to real economic data and evaluating goodness of fit.

### Why this matters in economics
In practice, you rarely know the true distribution. You observe data and ask: "Which distribution does this most resemble?" This skill is essential for choosing the right model and understanding when standard assumptions (like normality) are violated.

### Your Turn (1): Load data and explore candidate columns

In [None]:
import pandas as pd
df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)

# Preview the data
print('Columns:', list(df.columns))
print(f'Shape: {df.shape}')
df.head()

### Your Turn (2): Fit and compare distributions on a chosen variable

Pick a column (e.g., `UNRATE`, `FEDFUNDS`, `gdp_growth_qoq`, or `gdp_growth_yoy`) and overlay fitted normal and t-distribution PDFs on its histogram.

In [None]:
# TODO: Choose a column to analyze
col = ...  # e.g., 'gdp_growth_yoy'
series = df[col].dropna()

# TODO: Fit a normal distribution to the data
mu_fit, sigma_fit = ...  # stats.norm.fit(series)

# TODO: Fit a t-distribution to the data
df_fit, loc_fit, scale_fit = ...  # stats.t.fit(series)

# Plot
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(series, bins=25, density=True, alpha=0.5, color='steelblue',
        edgecolor='black', label=f'{col} (data)')

x = np.linspace(series.min() - 1, series.max() + 1, 200)

# TODO: Overlay normal fit
ax.plot(...)  # stats.norm.pdf(x, loc=mu_fit, scale=sigma_fit)

# TODO: Overlay t-distribution fit
ax.plot(...)  # stats.t.pdf(x, df=df_fit, loc=loc_fit, scale=scale_fit)

ax.set_title(f'Distribution Fits: {col}')
ax.set_xlabel(col)
ax.set_ylabel('Density')
ax.legend()
plt.show()

print(f'Normal fit: mu={mu_fit:.3f}, sigma={sigma_fit:.3f}')
print(f't fit: df={df_fit:.1f}, loc={loc_fit:.3f}, scale={scale_fit:.3f}')

### Your Turn (3): QQ plots for both fits

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# TODO: QQ plot against normal
stats.probplot(...)  # series, dist='norm', plot=axes[0]
axes[0].set_title(f'{col}: QQ vs Normal')

# TODO: QQ plot against t-distribution (use fitted df)
stats.probplot(...)  # series, dist=stats.t, sparams=(df_fit,), plot=axes[1]
axes[1].set_title(f'{col}: QQ vs t(df={df_fit:.0f})')

plt.tight_layout()
plt.show()

### Interpretation

**Write 2–4 sentences:**
1. Which distribution (normal or t) fits the data better? How can you tell from the histogram and QQ plots?
2. If neither fits well, what features of the data (skewness, multiple modes, outliers) might explain the poor fit?

*Your answer:*

(write here)

---
## Where This Shows Up Later

The distributions you studied in this notebook are not abstract -- they appear throughout the rest of the project:

| Distribution | Where it appears | Notebook |
|---|---|---|
| **Normal** | CLT justification for large-sample inference; assumed error distribution in OLS | Throughout `02_regression/` |
| **t-distribution** | Coefficient t-tests, confidence intervals for individual parameters | `02_regression/04_inference_time_series_hac` |
| **Chi-squared** | Breusch-Pagan test for heteroskedasticity, White test, likelihood ratio tests | `02_regression/04a_residual_diagnostics` |
| **F-distribution** | Overall regression F-test, nested model comparisons, joint significance | F-tests throughout `02_regression/` |
| **Binomial** | Recession classification (binary outcomes) | `03_classification/01_logistic_recession_classifier` |
| **Poisson** | Count-data modeling (if extended) | Optional extensions |

---
<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)

Run the cell below to verify you completed the key exercises. Then answer the questions.

In [None]:
# Quick sanity checks -- these should pass if you completed the notebook.
# Uncomment and adapt as needed.

# assert 'normal_data' in dir(), 'Did you simulate normal data in Section 2?'
# assert len(normal_data) == 5000, 'Expected 5000 draws.'
# assert 'Q' in dir(), 'Did you simulate chi-squared in Section 4?'
# assert Q.mean() > 3 and Q.mean() < 7, 'Chi-squared mean should be near k=5.'
# assert 'sim_counts' in dir(), 'Did you simulate Binomial in Section 6?'
# assert 'sim_poisson' in dir(), 'Did you simulate Poisson in Section 6?'

print('All checkpoint assertions passed (uncomment them first).')

# TODO: Write 2–3 sentences:
# - Can you name all six distributions covered and one econometric context for each?
# - Which distribution surprised you the most and why?

---
## Extensions (Optional)
- **Shapiro-Wilk test**: Use `stats.shapiro()` to formally test normality of GDP growth. Compare the p-value with your visual assessment from the QQ plot.
- **Log-normal distribution**: Many economic variables (income, firm size, stock prices) follow a log-normal rather than a normal. Try `stats.lognorm` on unemployment or CPI levels.
- **Kolmogorov-Smirnov test**: Use `stats.kstest()` to compare your data against a fitted distribution. This generalizes beyond normality.
- **Mixture distributions**: Real macro data often looks like it comes from two regimes (expansion vs recession). Simulate a mixture of two normals and see how the histogram looks.

---
## Reflection

**Write 2–4 sentences for each:**

1. Which distribution assumption do you think is most commonly violated in macroeconomic data, and what are the consequences for inference?

2. If you had to explain to a non-technical colleague why we use the t-distribution instead of the normal for small samples, how would you phrase it?

3. What did you learn in this notebook that changes how you will interpret regression output going forward?

*Your answers:*

(write here)

---
<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: The Normal distribution</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# --- Your Turn (1): Simulate and visualize Normal data ---
rng = np.random.default_rng(42)
mu, sigma = 2, 1.5
normal_data = rng.normal(loc=mu, scale=sigma, size=5000)

x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 200)

fig, ax = plt.subplots()
ax.hist(normal_data, bins=50, density=True, alpha=0.5, color='steelblue',
        edgecolor='black', label='Simulated data')
ax.plot(x, stats.norm.pdf(x, loc=mu, scale=sigma), 'darkred', lw=2,
        label='Theoretical PDF')
ax.set_title('Normal Distribution: Simulated vs Theoretical')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
plt.show()

# --- Your Turn (2): GDP growth ---
import pandas as pd
df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)
gdp_growth = df['gdp_growth_qoq'].dropna()

mu_hat = gdp_growth.mean()
sigma_hat = gdp_growth.std()

fig, ax = plt.subplots()
ax.hist(gdp_growth, bins=20, density=True, alpha=0.5, color='steelblue',
        edgecolor='black', label='GDP growth (QoQ)')
x_range = np.linspace(gdp_growth.min() - 1, gdp_growth.max() + 1, 200)
ax.plot(x_range, stats.norm.pdf(x_range, loc=mu_hat, scale=sigma_hat),
        'darkred', lw=2, label='Fitted Normal')
ax.set_title('GDP Growth (QoQ) vs Fitted Normal')
ax.set_xlabel('Growth rate')
ax.set_ylabel('Density')
ax.legend()
plt.show()

# --- Your Turn (3): QQ plot ---
fig, ax = plt.subplots()
stats.probplot(gdp_growth, dist='norm', plot=ax)
ax.set_title('QQ Plot: GDP Growth vs Normal')
plt.show()
```

</details>

<details><summary>Solution: The t-distribution</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# --- Your Turn (1): Compare t-distributions to Normal ---
x = np.linspace(-5, 5, 300)
fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(x, stats.norm.pdf(x), 'k-', lw=2, label='Normal(0,1)')
for df in [2, 5, 10, 30]:
    ax.plot(x, stats.t.pdf(x, df=df), '--', lw=1.5, label=f't(df={df})')
ax.set_title('t-Distributions vs Standard Normal')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
plt.show()

# --- Your Turn (2): Tail probabilities ---
print('P(|X| > 2):')
print(f'  Normal:    {2 * (1 - stats.norm.cdf(2)):.4f}')
for df in [5, 10, 30]:
    p = 2 * (1 - stats.t.cdf(2, df=df))
    print(f'  t(df={df:2d}):  {p:.4f}')

# --- Your Turn (3): Simulate convergence ---
rng = np.random.default_rng(42)
n_samples = 10_000
fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)
x_range = np.linspace(-5, 5, 200)
for ax, df in zip(axes, [3, 10, 30, 100]):
    samples = stats.t.rvs(df=df, size=n_samples, random_state=rng)
    ax.hist(samples, bins=60, density=True, alpha=0.5, color='steelblue')
    ax.plot(x_range, stats.norm.pdf(x_range), 'r-', lw=1.5, label='Normal')
    ax.set_title(f't(df={df})')
    ax.set_xlim(-5, 5)
    ax.legend(fontsize=8)
plt.suptitle('t-Distribution converges to Normal as df increases', y=1.02)
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: The Chi-Squared distribution</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# --- Your Turn (1): Build chi-squared from squared normals ---
rng = np.random.default_rng(42)
n_sims = 50_000
k = 5
Z = rng.normal(size=(n_sims, k))
Q = (Z ** 2).sum(axis=1)

fig, ax = plt.subplots()
ax.hist(Q, bins=80, density=True, alpha=0.5, color='seagreen',
        edgecolor='black', label=f'Simulated sum of {k} squared normals')
x = np.linspace(0, 25, 200)
ax.plot(x, stats.chi2.pdf(x, df=k), 'k-', lw=2, label=f'chi2({k}) PDF')
ax.set_title(f'Chi-Squared from Squared Normals (k={k})')
ax.set_xlabel('Q')
ax.set_ylabel('Density')
ax.legend()
plt.show()

# --- Your Turn (2): Different degrees of freedom ---
x = np.linspace(0, 30, 300)
fig, ax = plt.subplots(figsize=(9, 5))
for df in [1, 2, 5, 10, 15]:
    ax.plot(x, stats.chi2.pdf(x, df=df), lw=2, label=f'chi2(df={df})')
ax.set_title('Chi-Squared Distribution for Different df')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.set_ylim(0, 0.5)
ax.legend()
plt.show()
```

</details>

<details><summary>Solution: The F-distribution</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# --- Your Turn (1): Simulate F from chi-squared ratios ---
rng = np.random.default_rng(42)
n_sims = 50_000
d1, d2 = 3, 20
U = stats.chi2.rvs(df=d1, size=n_sims, random_state=rng)
V = stats.chi2.rvs(df=d2, size=n_sims, random_state=np.random.default_rng(99))
F_sim = (U / d1) / (V / d2)

fig, ax = plt.subplots()
ax.hist(F_sim, bins=100, density=True, alpha=0.5, color='goldenrod',
        edgecolor='black', label='Simulated F', range=(0, 8))
x = np.linspace(0, 8, 300)
ax.plot(x, stats.f.pdf(x, dfn=d1, dfd=d2), 'k-', lw=2,
        label=f'F({d1},{d2}) PDF')
ax.set_title(f'F-Distribution from Chi-Squared Ratio (d1={d1}, d2={d2})')
ax.set_xlabel('F')
ax.set_ylabel('Density')
ax.legend()
plt.show()

# --- Your Turn (2): Different parameters ---
x = np.linspace(0.01, 5, 300)
fig, ax = plt.subplots(figsize=(9, 5))
for d1, d2 in [(1, 10), (3, 20), (5, 50), (10, 100)]:
    ax.plot(x, stats.f.pdf(x, dfn=d1, dfd=d2), lw=2,
            label=f'F({d1},{d2})')
ax.set_title('F-Distributions with Different Parameters')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
plt.show()

# --- Your Turn (3): Critical values ---
print('F critical values (alpha = 0.05):')
for d1, d2 in [(1, 50), (3, 50), (5, 50), (3, 20), (3, 100)]:
    cv = stats.f.ppf(0.95, dfn=d1, dfd=d2)
    print(f'  F({d1:2d}, {d2:3d}): {cv:.3f}')
```

</details>

<details><summary>Solution: Discrete distributions (Binomial and Poisson)</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# --- Your Turn (1): Binomial ---
n_quarters = 40
p_recession = 0.15
rng = np.random.default_rng(42)
sim_counts = rng.binomial(n=n_quarters, p=p_recession, size=10_000)

k = np.arange(0, 20)
pmf_vals = stats.binom.pmf(k, n=n_quarters, p=p_recession)

fig, ax = plt.subplots()
ax.bar(k - 0.2, pmf_vals, width=0.4, color='steelblue', alpha=0.8,
       edgecolor='black', label='Theoretical PMF')
counts, edges = np.histogram(sim_counts, bins=np.arange(-0.5, 20.5, 1), density=True)
ax.bar(np.arange(0, 20) + 0.2, counts, width=0.4, color='coral', alpha=0.8,
       edgecolor='black', label='Simulated')
ax.set_title(f'Recession Quarters per Decade: Binomial({n_quarters}, {p_recession})')
ax.set_xlabel('Number of recession quarters')
ax.set_ylabel('Probability')
ax.legend()
plt.show()

# --- Your Turn (2): Poisson ---
lam = 4
rng = np.random.default_rng(42)
sim_poisson = rng.poisson(lam=lam, size=10_000)

k = np.arange(0, 15)
pmf_vals = stats.poisson.pmf(k, mu=lam)

fig, ax = plt.subplots()
ax.bar(k - 0.2, pmf_vals, width=0.4, color='steelblue', alpha=0.8,
       edgecolor='black', label='Theoretical PMF')
counts, _ = np.histogram(sim_poisson, bins=np.arange(-0.5, 15.5, 1), density=True)
ax.bar(np.arange(0, 15) + 0.2, counts[:15], width=0.4, color='coral', alpha=0.8,
       edgecolor='black', label='Simulated')
ax.set_title(f'Fed Rate Changes per Year: Poisson(lambda={lam})')
ax.set_xlabel('Number of rate changes')
ax.set_ylabel('Probability')
ax.legend()
plt.show()

# --- Your Turn (3): Poisson as limit of Binomial ---
lam = 3
k = np.arange(0, 15)
fig, ax = plt.subplots()
ax.plot(k, stats.poisson.pmf(k, mu=lam), 'ko-', ms=6, lw=2,
        label=f'Poisson({lam})', zorder=5)
for n_val in [10, 30, 100, 500]:
    p_val = lam / n_val
    ax.plot(k, stats.binom.pmf(k, n=n_val, p=p_val), 'o--', ms=4,
            label=f'Binom({n_val}, {p_val:.3f})')
ax.set_title('Binomial approaches Poisson as n grows (np = 3)')
ax.set_xlabel('k')
ax.set_ylabel('P(X = k)')
ax.legend(fontsize=9)
plt.show()
```

</details>

<details><summary>Solution: Comparing distributions with real data</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
import pandas as pd
df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)

col = 'gdp_growth_yoy'
series = df[col].dropna()

# Fit normal
mu_fit, sigma_fit = stats.norm.fit(series)

# Fit t-distribution
df_fit, loc_fit, scale_fit = stats.t.fit(series)

# Plot
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(series, bins=25, density=True, alpha=0.5, color='steelblue',
        edgecolor='black', label=f'{col} (data)')
x = np.linspace(series.min() - 1, series.max() + 1, 200)
ax.plot(x, stats.norm.pdf(x, loc=mu_fit, scale=sigma_fit),
        'darkred', lw=2, label=f'Normal(mu={mu_fit:.2f}, sigma={sigma_fit:.2f})')
ax.plot(x, stats.t.pdf(x, df=df_fit, loc=loc_fit, scale=scale_fit),
        'darkblue', lw=2, ls='--',
        label=f't(df={df_fit:.1f}, loc={loc_fit:.2f}, scale={scale_fit:.2f})')
ax.set_title(f'Distribution Fits: {col}')
ax.set_xlabel(col)
ax.set_ylabel('Density')
ax.legend()
plt.show()

# QQ plots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
stats.probplot(series, dist='norm', plot=axes[0])
axes[0].set_title(f'{col}: QQ vs Normal')
stats.probplot(series, dist=stats.t, sparams=(df_fit,), plot=axes[1])
axes[1].set_title(f'{col}: QQ vs t(df={df_fit:.0f})')
plt.tight_layout()
plt.show()
```

</details>