# 02 Sampling and the Central Limit Theorem

Why sample statistics work, how sampling distributions behave, and the theorem that makes inference possible.


## Table of Contents
- [Population vs sample](#population-vs-sample)
- [Sampling variability](#sampling-variability)
- [The Law of Large Numbers](#the-law-of-large-numbers)
- [The Central Limit Theorem](#the-central-limit-theorem)
- [When does n=30 suffice?](#when-does-n30-suffice)
- [Standard error of the mean](#standard-error-of-the-mean)
- [CLT with real economic data](#clt-with-real-economic-data)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)


## Why This Notebook Matters
The Central Limit Theorem is the single most important result in applied statistics.
It explains why confidence intervals work, why t-tests are valid, and why regression
inference is possible even when your data is not normally distributed. Without the CLT,
most of the statistical tools in this project would not be justified.

## Prerequisites (Quick Self-Check)
- Completed notebooks 00-01 (descriptive statistics and distributions).
- Familiarity with the normal distribution.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain the difference between a population parameter and a sample statistic.
- You can demonstrate the CLT through simulation and interpret the result.
- You can compute and interpret the standard error of the mean.
- You know when the "n=30" rule of thumb breaks down.

## Common Pitfalls
- Confusing the distribution of the data with the distribution of the sample mean.
- Thinking the CLT says individual observations become normal (it does not).
- Assuming the CLT works for any n regardless of how skewed the data is.
- Forgetting that the CLT requires independent observations (violated in time series).

## Quick Fixes (When You Get Stuck)
- If your simulation is slow, reduce the number of repetitions to 500 first, then increase.
- If histograms look choppy, increase bins or use KDE.
- If you see `ModuleNotFoundError`, re-run the bootstrap cell.

## Matching Guide
- `docs/guides/00_statistics_primer/02_sampling_and_central_limit_theorem.md`


## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2–4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/00_statistics_primer/02_sampling_and_central_limit_theorem.md`) for the math, assumptions, and deeper context.


<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.


In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT


## Concept
This notebook builds your intuition for the mechanics of statistical inference.

You will see:
- why a sample mean is a useful estimate of a population mean,
- how sample-to-sample variability shrinks as $n$ grows,
- why the Law of Large Numbers guarantees convergence,
- why the Central Limit Theorem makes the normal distribution show up everywhere,
- when the CLT approximation is good enough and when it is not.

Everything in this notebook is simulation-based: you will *see* the theorems work
before you rely on them for regression inference later in the project.


<a id="population-vs-sample"></a>
## Population vs sample

### Goal
Understand the distinction between a population parameter and a sample statistic.

### Why this matters in economics
In economics, we almost never observe the full population. The true mean quarterly
GDP growth rate across *all possible* quarters (the population parameter $\mu$) is
unknown. What we have is a finite sample—say, 80 quarters of data—from which we
compute a sample mean $\bar{x}$. The entire goal of inference is to say something
reliable about $\mu$ using only $\bar{x}$ and its uncertainty.

### Key definitions
- **Population parameter** ($\mu$, $\sigma$): a fixed but unknown quantity describing
  the full data-generating process.
- **Sample statistic** ($\bar{x}$, $s$): a quantity computed from observed data.
  It varies from sample to sample.
- **Sampling distribution**: the distribution of a sample statistic across many
  hypothetical repeated samples of the same size.


### Your Turn (1): Population vs sample in action


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

# Imagine a "population" of quarterly GDP growth rates (in %).
# True mean = 0.8%, true std = 1.5%
pop_mean = 0.8
pop_std = 1.5
population = rng.normal(loc=pop_mean, scale=pop_std, size=100_000)

# TODO: Draw a sample of n=80 quarters from the population
sample = ...

# TODO: Compute the sample mean and sample std
sample_mean = ...
sample_std = ...

print(f'Population mean (true):  {pop_mean}')
print(f'Sample mean (n=80):      {sample_mean:.4f}')
print(f'Population std (true):   {pop_std}')
print(f'Sample std (n=80):       {sample_std:.4f}')


**Interpretation prompt:**
- Is the sample mean exactly equal to the population mean? Why or why not?
- If you drew a *different* sample of 80 quarters, would you get the same sample mean?
- Write 2–3 sentences.


<a id="sampling-variability"></a>
## Sampling variability

### Goal
See empirically that sample means vary from sample to sample, and that larger samples
produce less variable estimates.

### Why this matters in economics
When you read that "average GDP growth was 2.1% over the last 40 quarters," that number
carries uncertainty. A different 40-quarter window would give a different number.
Understanding sampling variability is the first step toward quantifying that uncertainty.


### Your Turn (1): Draw many samples, compute many means


In [None]:
rng = np.random.default_rng(42)

# Population: 100,000 values from a normal distribution
pop_mean = 0.8
pop_std = 1.5
population = rng.normal(loc=pop_mean, scale=pop_std, size=100_000)

n_reps = 1_000  # number of repeated samples
sample_sizes = [10, 50, 200]

# TODO: For each sample size, draw n_reps samples and compute the mean of each.
# Store results in a dict: {n: array_of_means}
means_by_n = {}
for n in sample_sizes:
    means = ...
    means_by_n[n] = means


### Your Turn (2): Plot histograms of sample means


In [None]:
# TODO: Create a figure with 3 subplots (one per sample size).
# Plot a histogram of sample means for each n.
# Add a vertical line at the true population mean.
# Hint: use plt.subplots(1, 3, figsize=(14, 4))

fig, axes = plt.subplots(1, 3, figsize=(14, 4), sharey=True)

for ax, n in zip(axes, sample_sizes):
    ...
    ax.set_title(f'n = {n} (std = {means_by_n[n].std():.4f})')
    ax.set_xlabel('Sample mean')

axes[0].set_ylabel('Frequency')
fig.suptitle('Sampling Distribution of the Mean', fontsize=13)
plt.tight_layout()
plt.show()


**Interpretation prompt:**
- How does the spread of sample means change as n increases?
- Which sample size gives the most "precise" estimate of the population mean?
- Write 2–3 sentences relating this to the idea of using more data in economics.


<a id="the-law-of-large-numbers"></a>
## The Law of Large Numbers

### Goal
Demonstrate that the sample mean converges to the population mean as $n$ grows.

### Why this matters in economics
The LLN is why we trust averages. If you compute the average inflation rate over
more and more months, it gets closer and closer to the true long-run average.
Without LLN, collecting more data would not help.

### Key statement
If $X_1, X_2, \ldots$ are i.i.d. with mean $\mu$, then

$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \xrightarrow{p} \mu \quad \text{as } n \to \infty$$


### Your Turn (1): Running mean with normal data


In [None]:
rng = np.random.default_rng(42)

true_mean = 2.5
n_obs = 5_000

# TODO: Generate n_obs draws from a normal distribution with mean=true_mean, std=3.0
data_normal = ...

# TODO: Compute the cumulative (running) mean
# Hint: np.cumsum(data_normal) / np.arange(1, n_obs + 1)
running_mean_normal = ...

plt.figure(figsize=(10, 4))
plt.plot(running_mean_normal, label='Running mean (normal data)')
plt.axhline(true_mean, color='red', linestyle='--', label=f'True mean = {true_mean}')
plt.xlabel('Number of observations')
plt.ylabel('Running mean')
plt.title('Law of Large Numbers: Convergence of the Running Mean')
plt.legend()
plt.tight_layout()
plt.show()


### Your Turn (2): Running mean with skewed data


In [None]:
# The LLN does not require normality. It works for any distribution with a finite mean.
# TODO: Generate n_obs draws from an exponential distribution (rate=1.0, so mean=1.0)
data_skewed = ...

# TODO: Compute the running mean
running_mean_skewed = ...

plt.figure(figsize=(10, 4))
plt.plot(running_mean_skewed, label='Running mean (exponential data)')
plt.axhline(1.0, color='red', linestyle='--', label='True mean = 1.0')
plt.xlabel('Number of observations')
plt.ylabel('Running mean')
plt.title('LLN with Highly Skewed (Exponential) Data')
plt.legend()
plt.tight_layout()
plt.show()


**Interpretation prompt:**
- Does the running mean converge for the exponential data even though the data itself is skewed?
- How does the convergence path differ between normal and exponential data?
- Write 2–3 sentences.


<a id="the-central-limit-theorem"></a>
## The Central Limit Theorem

### Goal
Show that the sampling distribution of the mean becomes approximately normal
regardless of the shape of the underlying data distribution.

### Why this matters in economics
Economic data is rarely normally distributed. Income is right-skewed. Unemployment
durations are exponential. Housing prices are multimodal. Yet we use t-tests,
confidence intervals, and regression inference that assume normality of *estimators*.
The CLT is the justification: even if the data is non-normal, the sample mean
(and by extension, OLS coefficients) is approximately normal for large enough $n$.

### Key statement
If $X_1, \ldots, X_n$ are i.i.d. with mean $\mu$ and finite variance $\sigma^2$, then

$$\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty$$

In plain language: the distribution of the sample mean gets closer and closer to
a normal distribution, centered at $\mu$, with standard deviation $\sigma / \sqrt{n}$.


### Your Turn (1): CLT from a uniform distribution


In [None]:
rng = np.random.default_rng(42)

n_reps = 2_000
sample_sizes = [5, 30, 100, 500]

# Source distribution: Uniform(0, 1) -- flat, not bell-shaped at all
# True mean = 0.5, true std = 1/sqrt(12) ~ 0.2887

# TODO: For each sample size, draw n_reps samples from Uniform(0,1),
# compute the mean of each sample, and store the array of means.
uniform_means = {}
for n in sample_sizes:
    ...


In [None]:
# TODO: Plot histograms (with KDE) of sample means for each n in a 1x4 grid.
# Overlay a normal PDF with the theoretical mean and SE.
from scipy import stats

fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)
true_mean = 0.5
true_std = 1.0 / np.sqrt(12)

for ax, n in zip(axes, sample_sizes):
    se = true_std / np.sqrt(n)
    ax.hist(uniform_means[n], bins=40, density=True, alpha=0.6, label='Simulated')
    # TODO: Overlay a normal PDF with mean=true_mean, std=se
    x_grid = np.linspace(true_mean - 4 * se, true_mean + 4 * se, 200)
    ...
    ax.set_title(f'Uniform, n={n}')
    ax.legend(fontsize=8)

fig.suptitle('CLT: Sampling Distribution of the Mean (Uniform Source)', fontsize=13)
plt.tight_layout()
plt.show()


### Your Turn (2): CLT from an exponential distribution


In [None]:
rng = np.random.default_rng(42)

# Source distribution: Exponential(rate=1) -- heavily right-skewed
# True mean = 1.0, true std = 1.0

# TODO: For each sample size, draw n_reps samples from Exponential(scale=1.0),
# compute the mean of each, and store.
exp_means = {}
for n in sample_sizes:
    ...


In [None]:
# TODO: Plot the sampling distributions for exponential source (same layout as above).
fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)
true_mean_exp = 1.0
true_std_exp = 1.0

for ax, n in zip(axes, sample_sizes):
    se = true_std_exp / np.sqrt(n)
    ...
    ax.set_title(f'Exponential, n={n}')

fig.suptitle('CLT: Sampling Distribution of the Mean (Exponential Source)', fontsize=13)
plt.tight_layout()
plt.show()


### Your Turn (3): CLT from a bimodal mixture


In [None]:
rng = np.random.default_rng(42)

# Bimodal mixture: 50% from N(-2, 0.5^2) and 50% from N(2, 0.5^2)
# True mean = 0.0, true variance = 0.5^2 + 2^2 = 4.25, true std ~ 2.062

def draw_bimodal(rng, size):
    """Draw from a 50/50 mixture of N(-2, 0.5) and N(2, 0.5)."""
    mask = rng.random(size) < 0.5
    vals = np.where(mask,
                    rng.normal(-2, 0.5, size),
                    rng.normal(2, 0.5, size))
    return vals

# TODO: For each sample size, draw n_reps samples from the bimodal mixture,
# compute the mean of each, and store.
bimodal_means = {}
for n in sample_sizes:
    ...


In [None]:
# TODO: Plot the sampling distributions for bimodal source.
fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)
true_mean_bi = 0.0
true_std_bi = np.sqrt(0.5**2 + 2.0**2)  # ~ 2.062

for ax, n in zip(axes, sample_sizes):
    se = true_std_bi / np.sqrt(n)
    ...
    ax.set_title(f'Bimodal, n={n}')

fig.suptitle('CLT: Sampling Distribution of the Mean (Bimodal Source)', fontsize=13)
plt.tight_layout()
plt.show()


**Interpretation prompt:**
- For which source distribution does the normal approximation kick in fastest?
- For which source distribution does it take the largest n?
- Does the bimodal source look normal at n=5? At n=100?
- Write 3–4 sentences connecting this to the types of data you encounter in economics.


<a id="when-does-n30-suffice"></a>
## When does n=30 suffice?

### Goal
Test the common textbook rule of thumb that "n=30 is enough for the CLT" and
show where it breaks down.

### Why this matters in economics
Many applied researchers assume their sample is "large enough" without checking.
For mildly skewed data, n=30 is often fine. But for heavily skewed distributions
(e.g., income, wealth, firm size), you may need n=100 or more before the sampling
distribution of the mean is approximately normal.


### Your Turn (1): Shapiro-Wilk test on sampling distributions


In [None]:
from scipy.stats import shapiro

rng = np.random.default_rng(42)

n_reps = 1_000
test_ns = [10, 30, 50, 100, 200, 500]

# We test three distributions with increasing skewness:
# (a) Normal(0, 1)      -- skewness = 0
# (b) Exponential(1)    -- skewness = 2
# (c) Lognormal(0, 1)   -- skewness ~ 6.18 (very heavy right tail)

distributions = {
    'Normal':      lambda rng, size: rng.normal(0, 1, size),
    'Exponential':  lambda rng, size: rng.exponential(1.0, size),
    'Lognormal':    lambda rng, size: rng.lognormal(0, 1, size),
}

# TODO: For each distribution and each n, generate n_reps sample means,
# then run the Shapiro-Wilk test on those means.
# Store p-values in a DataFrame with rows=distributions, columns=sample sizes.
results = {}
for dist_name, draw_fn in distributions.items():
    row = {}
    for n in test_ns:
        means = ...
        _, p_val = shapiro(means)
        row[n] = p_val
    results[dist_name] = row

sw_df = pd.DataFrame(results).T
sw_df.columns = [f'n={n}' for n in test_ns]
sw_df


**Interpretation prompt:**
- For Normal source data, is the Shapiro-Wilk p-value high at every n? Why?
- For Lognormal source data, at what n does the p-value first exceed 0.05?
- Does the "n=30 rule" work for the lognormal? What would you recommend instead?
- Write 3–4 sentences.


<a id="standard-error-of-the-mean"></a>
## Standard error of the mean

### Goal
Verify empirically that the standard deviation of sample means equals $\sigma / \sqrt{n}$.

### Why this matters in economics
The standard error (SE) is the bridge between a point estimate and a confidence interval.
When you see $\hat{\beta} \pm 1.96 \times SE$ in a regression table, the SE is doing the
heavy lifting. Understanding where it comes from—and that it shrinks as $\sqrt{n}$—is
essential for interpreting any empirical result.

### Key formula
$$SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}$$

In practice, $\sigma$ is unknown and estimated by the sample standard deviation $s$,
so $\widehat{SE} = s / \sqrt{n}$.


### Your Turn (1): Empirical SE vs theoretical SE


In [None]:
rng = np.random.default_rng(42)

pop_std = 1.5
n_reps = 5_000
sample_sizes = [5, 10, 25, 50, 100, 200, 500, 1000]

# TODO: For each n, compute the empirical std of the sample means
# and compare to the theoretical SE = pop_std / sqrt(n).
empirical_se = []
theoretical_se = []

for n in sample_sizes:
    means = ...
    empirical_se.append(means.std())
    theoretical_se.append(pop_std / np.sqrt(n))

se_df = pd.DataFrame({
    'n': sample_sizes,
    'Empirical SE': empirical_se,
    'Theoretical SE': theoretical_se,
})
se_df['Ratio (Empirical/Theoretical)'] = se_df['Empirical SE'] / se_df['Theoretical SE']
se_df


### Your Turn (2): Plot SE vs n


In [None]:
# TODO: Plot empirical SE and theoretical SE vs n on the same axes.
# Use a log-log scale to see the sqrt(n) relationship clearly.

fig, ax = plt.subplots(figsize=(8, 5))
...
ax.set_xlabel('Sample size (n)')
ax.set_ylabel('Standard Error')
ax.set_title('Standard Error of the Mean: Empirical vs Theoretical')
ax.legend()
plt.tight_layout()
plt.show()


**Interpretation prompt:**
- Is the ratio of empirical to theoretical SE close to 1.0 for all n?
- On the log-log plot, what is the slope? Why does the SE shrink as $1/\sqrt{n}$?
- To cut the SE in half, you need to multiply n by what factor?
- Write 2–3 sentences connecting this to the cost of data collection in economics.


<a id="clt-with-real-economic-data"></a>
## CLT with real economic data

### Goal
Apply the bootstrap to real macro data and verify that the sampling distribution
of the mean is approximately normal.

### Why this matters in economics
The bootstrap is one of the most practical tools in applied econometrics. When you
cannot derive the sampling distribution analytically, you can approximate it by
resampling your data. This section bridges simulation to real-world practice.


### Your Turn (1): Load the macro quarterly dataset


In [None]:
import pandas as pd

# Load macro_quarterly_sample.csv
# Fall back to sample data if processed is not available.
processed_path = PROCESSED_DIR / 'macro_quarterly.csv'
sample_path = SAMPLE_DIR / 'macro_quarterly_sample.csv'

csv_path = processed_path if processed_path.exists() else sample_path
df_macro = pd.read_csv(csv_path, index_col=0, parse_dates=True)

print(f'Loaded: {csv_path.name}  shape: {df_macro.shape}')
df_macro.head()


### Your Turn (2): Bootstrap the mean GDP growth


In [None]:
rng = np.random.default_rng(42)

# TODO: Identify the GDP growth column (inspect df_macro.columns).
# Drop NaNs for that column.
gdp_col = ...  # e.g., 'gdp_growth_qoq' or similar; inspect df_macro.columns
gdp_data = df_macro[gdp_col].dropna().values

n_boot = 5_000
n_obs = len(gdp_data)

# TODO: Bootstrap: resample n_obs values WITH replacement, compute the mean.
# Repeat n_boot times.
boot_means = ...

print(f'Original sample mean: {gdp_data.mean():.4f}')
print(f'Bootstrap mean of means: {boot_means.mean():.4f}')
print(f'Bootstrap SE: {boot_means.std():.4f}')


### Your Turn (3): Plot bootstrap distribution vs normal


In [None]:
from scipy import stats

# TODO: Plot the bootstrap distribution of means as a histogram.
# Overlay a normal PDF with mean = boot_means.mean() and std = boot_means.std().

fig, ax = plt.subplots(figsize=(8, 5))
...
ax.set_xlabel('Mean GDP growth (quarterly, %)')
ax.set_ylabel('Density')
ax.set_title('Bootstrap Distribution of Mean GDP Growth')
ax.legend()
plt.tight_layout()
plt.show()

# TODO: Run a Shapiro-Wilk test on boot_means to check normality.
stat, p_val = ...
print(f'Shapiro-Wilk test: statistic={stat:.4f}, p-value={p_val:.4f}')


**Interpretation prompt:**
- Does the bootstrap distribution look normal?
- How does the bootstrap SE compare to the analytical SE ($s / \sqrt{n}$)?
- Why might bootstrapping be especially useful when you have a small macro sample?
- Write 3–4 sentences.


## Where This Shows Up Later
The ideas in this notebook appear throughout the rest of the project:

- **Confidence intervals (notebook 04):** The SE formula from this notebook is used
  directly to construct confidence intervals around point estimates.
- **Hypothesis testing (notebook 05):** The CLT justifies using the normal/t-distribution
  to compute p-values for regression coefficients.
- **Regression inference (module 02):** Every `statsmodels` summary table relies on
  the CLT. The `std err` column is the estimated SE; the `t` column is the coefficient
  divided by its SE; the confidence interval is $\hat{\beta} \pm t_{\alpha/2} \times SE$.
- **Bootstrap and HAC standard errors (notebooks 02_regression/04):** When CLT assumptions
  are strained (small n, autocorrelation), robust SE methods and the bootstrap provide
  alternatives grounded in the same logic.


<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2–3 sentences summarizing what you verified.


In [None]:
# Sanity checks (adjust variable names to match your work)

# 1) The mean of many sample means should be close to the true population mean.
# assert abs(np.mean(means_by_n[200]) - pop_mean) < 0.1

# 2) The empirical SE should be close to the theoretical SE.
# assert abs(se_df['Ratio (Empirical/Theoretical)'].mean() - 1.0) < 0.05

# 3) The bootstrap SE should be positive and finite.
# assert 0 < boot_means.std() < 10

# TODO: Uncomment the checks above once you've completed the TODO cells.
# TODO: Write 2-3 sentences:
# - What is the difference between the distribution of data and the distribution of the mean?
# - Why does the CLT not say individual observations become normal?
...


## Extensions (Optional)
- Investigate the CLT for the *sample median* instead of the sample mean. Does it also converge to normal? (Hint: yes, but the SE formula is different.)
- Try the bootstrap on a different column in the macro dataset (e.g., unemployment rate or inflation).
- Explore the relationship between skewness and the minimum $n$ needed for the CLT to hold. Plot "minimum n" vs skewness for several distributions.


## Reflection
- What assumptions does the CLT require that might be violated in real economic data?
  (Hint: think about independence in time series.)
- If you were advising a colleague who only has 25 observations of a skewed variable,
  what would you tell them about using normal-theory confidence intervals?


<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Population vs sample</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02_sampling_and_central_limit_theorem — Population vs sample
import numpy as np

rng = np.random.default_rng(42)
pop_mean = 0.8
pop_std = 1.5
population = rng.normal(loc=pop_mean, scale=pop_std, size=100_000)

sample = rng.choice(population, size=80, replace=False)
sample_mean = sample.mean()
sample_std = sample.std(ddof=1)

print(f'Population mean (true):  {pop_mean}')
print(f'Sample mean (n=80):      {sample_mean:.4f}')
print(f'Population std (true):   {pop_std}')
print(f'Sample std (n=80):       {sample_std:.4f}')
```

</details>

<details><summary>Solution: Sampling variability</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02_sampling_and_central_limit_theorem — Sampling variability
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
pop_mean = 0.8
pop_std = 1.5
population = rng.normal(loc=pop_mean, scale=pop_std, size=100_000)

n_reps = 1_000
sample_sizes = [10, 50, 200]

means_by_n = {}
for n in sample_sizes:
    means = np.array([rng.choice(population, size=n).mean() for _ in range(n_reps)])
    means_by_n[n] = means

fig, axes = plt.subplots(1, 3, figsize=(14, 4), sharey=True)
for ax, n in zip(axes, sample_sizes):
    ax.hist(means_by_n[n], bins=40, alpha=0.7, edgecolor='black')
    ax.axvline(pop_mean, color='red', linestyle='--', label=f'True mean={pop_mean}')
    ax.set_title(f'n = {n} (std = {means_by_n[n].std():.4f})')
    ax.set_xlabel('Sample mean')
    ax.legend(fontsize=8)
axes[0].set_ylabel('Frequency')
fig.suptitle('Sampling Distribution of the Mean', fontsize=13)
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: Law of Large Numbers</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02_sampling_and_central_limit_theorem — Law of Large Numbers
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
true_mean = 2.5
n_obs = 5_000

# Normal data
data_normal = rng.normal(loc=true_mean, scale=3.0, size=n_obs)
running_mean_normal = np.cumsum(data_normal) / np.arange(1, n_obs + 1)

# Skewed (exponential) data
data_skewed = rng.exponential(scale=1.0, size=n_obs)
running_mean_skewed = np.cumsum(data_skewed) / np.arange(1, n_obs + 1)

fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].plot(running_mean_normal)
axes[0].axhline(true_mean, color='red', linestyle='--')
axes[0].set_title('Normal Data')
axes[1].plot(running_mean_skewed)
axes[1].axhline(1.0, color='red', linestyle='--')
axes[1].set_title('Exponential Data')
for ax in axes:
    ax.set_xlabel('n')
    ax.set_ylabel('Running mean')
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: Central Limit Theorem (uniform)</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02_sampling_and_central_limit_theorem — CLT (uniform)
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

rng = np.random.default_rng(42)
n_reps = 2_000
sample_sizes = [5, 30, 100, 500]

uniform_means = {}
for n in sample_sizes:
    uniform_means[n] = np.array([rng.uniform(0, 1, size=n).mean() for _ in range(n_reps)])

fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)
true_mean = 0.5
true_std = 1.0 / np.sqrt(12)

for ax, n in zip(axes, sample_sizes):
    se = true_std / np.sqrt(n)
    ax.hist(uniform_means[n], bins=40, density=True, alpha=0.6, label='Simulated')
    x_grid = np.linspace(true_mean - 4*se, true_mean + 4*se, 200)
    ax.plot(x_grid, stats.norm.pdf(x_grid, true_mean, se), 'r-', lw=2, label='Normal')
    ax.set_title(f'Uniform, n={n}')
    ax.legend(fontsize=8)
fig.suptitle('CLT: Sampling Distribution of the Mean (Uniform Source)', fontsize=13)
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: When does n=30 suffice?</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02_sampling_and_central_limit_theorem — n=30 rule
import numpy as np
import pandas as pd
from scipy.stats import shapiro

rng = np.random.default_rng(42)
n_reps = 1_000
test_ns = [10, 30, 50, 100, 200, 500]

distributions = {
    'Normal':      lambda rng, size: rng.normal(0, 1, size),
    'Exponential':  lambda rng, size: rng.exponential(1.0, size),
    'Lognormal':    lambda rng, size: rng.lognormal(0, 1, size),
}

results = {}
for dist_name, draw_fn in distributions.items():
    row = {}
    for n in test_ns:
        means = np.array([draw_fn(rng, n).mean() for _ in range(n_reps)])
        _, p_val = shapiro(means)
        row[n] = p_val
    results[dist_name] = row

sw_df = pd.DataFrame(results).T
sw_df.columns = [f'n={n}' for n in test_ns]
sw_df
```

</details>

<details><summary>Solution: Standard error of the mean</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02_sampling_and_central_limit_theorem — Standard error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
pop_std = 1.5
n_reps = 5_000
sample_sizes = [5, 10, 25, 50, 100, 200, 500, 1000]

empirical_se = []
theoretical_se = []
for n in sample_sizes:
    means = np.array([rng.normal(0, pop_std, size=n).mean() for _ in range(n_reps)])
    empirical_se.append(means.std())
    theoretical_se.append(pop_std / np.sqrt(n))

se_df = pd.DataFrame({
    'n': sample_sizes,
    'Empirical SE': empirical_se,
    'Theoretical SE': theoretical_se,
})
se_df['Ratio (Empirical/Theoretical)'] = se_df['Empirical SE'] / se_df['Theoretical SE']

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(sample_sizes, empirical_se, 'o-', label='Empirical SE')
ax.plot(sample_sizes, theoretical_se, 's--', label='Theoretical SE')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Sample size (n)')
ax.set_ylabel('Standard Error')
ax.set_title('Standard Error of the Mean: Empirical vs Theoretical')
ax.legend()
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: CLT with real economic data</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02_sampling_and_central_limit_theorem — Bootstrap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import shapiro

rng = np.random.default_rng(42)

# Load data (adjust column name to match your dataset)
processed_path = PROCESSED_DIR / 'macro_quarterly.csv'
sample_path = SAMPLE_DIR / 'macro_quarterly_sample.csv'
csv_path = processed_path if processed_path.exists() else sample_path
df_macro = pd.read_csv(csv_path, index_col=0, parse_dates=True)

# Identify GDP growth column (inspect df_macro.columns)
gdp_col = 'gdp_growth_qoq'  # adjust if needed
gdp_data = df_macro[gdp_col].dropna().values

n_boot = 5_000
n_obs = len(gdp_data)
boot_means = np.array([rng.choice(gdp_data, size=n_obs, replace=True).mean()
                        for _ in range(n_boot)])

fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(boot_means, bins=50, density=True, alpha=0.6, label='Bootstrap')
x_grid = np.linspace(boot_means.min(), boot_means.max(), 200)
ax.plot(x_grid, stats.norm.pdf(x_grid, boot_means.mean(), boot_means.std()),
        'r-', lw=2, label='Normal fit')
ax.set_xlabel('Mean GDP growth')
ax.set_ylabel('Density')
ax.set_title('Bootstrap Distribution of Mean GDP Growth')
ax.legend()
plt.tight_layout()
plt.show()

stat, p_val = shapiro(boot_means)
print(f'Shapiro-Wilk: statistic={stat:.4f}, p-value={p_val:.4f}')
```

</details>
