# 05 Hypothesis Testing Foundations

The logic of statistical testing: null hypotheses, p-values, significance, errors, and power.

## Table of Contents
- [The logic of hypothesis testing](#the-logic-of-hypothesis-testing)
- [Null and alternative hypotheses](#null-and-alternative-hypotheses)
- [Test statistics and p-values](#test-statistics-and-p-values)
- [What p-values are NOT](#what-p-values-are-not)
- [Type I and Type II errors](#type-i-and-type-ii-errors)
- [Statistical power](#statistical-power)
- [Significance levels and multiple testing](#significance-levels-and-multiple-testing)
- [Connecting to regression output](#connecting-to-regression-output)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)

## Why This Notebook Matters
Hypothesis testing is the backbone of empirical economics. Every regression coefficient
comes with a p-value. Every policy evaluation involves a test. Understanding the logic
correctly — especially the common misinterpretations — is essential for reading and
producing credible empirical work. This notebook builds that understanding from scratch.

## Prerequisites (Quick Self-Check)
- Completed notebooks 00-04 (descriptive stats through confidence intervals).
- Understanding of sampling distributions and the CLT.
- Comfort with the t-distribution.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can state null and alternative hypotheses for common economic questions.
- You can correctly interpret (and state the limits of) a p-value.
- You can explain Type I error, Type II error, and statistical power.
- You can simulate hypothesis tests to verify their properties.

## Common Pitfalls
- Interpreting p-value as "probability that H0 is true."
- Treating "not significant" as "no effect" (could be low power).
- Running many tests and reporting only the significant ones (p-hacking).
- Confusing statistical significance with economic/practical significance.
- Using one-sided tests to get smaller p-values without theoretical justification.

## Quick Fixes (When You Get Stuck)
- `scipy.stats.ttest_1samp(x, popmean)` for one-sample t-test.
- `scipy.stats.ttest_ind(x, y)` for two-sample t-test.
- In statsmodels, p-values are in `res.pvalues`.
- If you see `ModuleNotFoundError`, re-run the bootstrap cell.

## Matching Guide
- `docs/guides/00a_statistics_primer/05_hypothesis_testing_foundations.md`

## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2-4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/00a_statistics_primer/05_hypothesis_testing_foundations.md`) for the math, assumptions, and deeper context.

<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.

In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT

<a id="the-logic-of-hypothesis-testing"></a>
## The logic of hypothesis testing

### Goal
Understand the framework of hypothesis testing before computing anything.

### Why this matters in economics
Every empirical economics paper makes claims like "this policy raised wages" or "monetary
tightening slowed GDP growth." Hypothesis testing is the formal framework for deciding
whether data support such claims. Getting the logic wrong leads to false conclusions
about which policies work and which do not.

### The courtroom analogy

Hypothesis testing works like a trial:

| Courtroom | Hypothesis test |
|---|---|
| Defendant is **innocent until proven guilty** | The null hypothesis $H_0$ is assumed true until evidence says otherwise |
| Prosecution presents evidence | We compute a test statistic from the data |
| Jury asks: "Is this evidence strong enough?" | We compare the test statistic to a threshold (or compute a p-value) |
| Verdict: **guilty** or **not guilty** | We **reject** $H_0$ or **fail to reject** $H_0$ |
| "Not guilty" $\neq$ "innocent" | "Fail to reject" $\neq$ "$H_0$ is true" |

The null hypothesis is the **status quo** or the **default** claim. In economics, it is
typically "no effect":

- *Does a minimum wage increase reduce employment?*  $\rightarrow$  $H_0$: no effect on employment.
- *Does this training program raise earnings?*  $\rightarrow$  $H_0$: earnings are unchanged.
- *Is GDP growth different from 2%?*  $\rightarrow$  $H_0$: mean GDP growth = 2%.

We **never** "accept" $H_0$. We either find enough evidence to reject it, or we don't.
Absence of evidence is not evidence of absence.

<a id="null-and-alternative-hypotheses"></a>
## Null and alternative hypotheses

### Goal
Practice writing $H_0$ and $H_1$ for various economic scenarios. Distinguish one-sided
from two-sided tests.

### Why this matters in economics
Stating hypotheses precisely forces you to articulate what you are testing before you
look at the data. Sloppy hypotheses lead to sloppy conclusions. One-sided vs. two-sided
matters: using a one-sided test to halve the p-value without prior theoretical
justification is a form of p-hacking.

### Key concepts

**Two-sided test** (most common):
- $H_0: \mu = \mu_0$ (e.g., mean GDP growth = 2%)
- $H_1: \mu \neq \mu_0$

**One-sided test** (requires strong theoretical motivation):
- $H_0: \mu \leq \mu_0$ vs. $H_1: \mu > \mu_0$, or
- $H_0: \mu \geq \mu_0$ vs. $H_1: \mu < \mu_0$

In regression, the default test for each coefficient is:
- $H_0: \beta_j = 0$ ("this variable has no linear effect")
- $H_1: \beta_j \neq 0$ (two-sided)

### Your Turn (1): Write hypotheses for economic questions

For each scenario below, write $H_0$ and $H_1$ in the markdown cell. State whether
the test should be one-sided or two-sided, and justify your choice.

1. **Mean GDP growth equals 2%.** You want to know if the long-run average quarterly
   (annualized) GDP growth rate differs from 2%.

2. **Unemployment is higher in recession quarters.** You believe recessions push
   unemployment above its non-recession average.

3. **The coefficient on education is positive.** In a wage regression, economic theory
   predicts that more education leads to higher wages.

4. **Federal funds rate has no effect on industrial production.** You are agnostic about
   the direction.

Write your answers below (double-click to edit):

**Your answers:**

1. $H_0$: ... &nbsp; $H_1$: ... &nbsp; (one-sided / two-sided because ...)

2. $H_0$: ... &nbsp; $H_1$: ... &nbsp; (one-sided / two-sided because ...)

3. $H_0$: ... &nbsp; $H_1$: ... &nbsp; (one-sided / two-sided because ...)

4. $H_0$: ... &nbsp; $H_1$: ... &nbsp; (one-sided / two-sided because ...)

### Your Turn (2): One-sample t-test on GDP growth

Load the macro sample data and test whether mean annualized GDP growth equals 2%.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)
print(df.shape)
df.head()

In [None]:
# TODO: Extract the annualized GDP growth column
gdp_growth = df['gdp_growth_qoq_annualized'].dropna()

# TODO: Test H0: mean GDP growth = 2% (two-sided)
t_stat, p_val = ...  # Hint: stats.ttest_1samp(gdp_growth, popmean=2.0)

print(f'Sample mean: {gdp_growth.mean():.3f}')
print(f't-statistic: {t_stat:.3f}')
print(f'p-value:     {p_val:.4f}')
print(f'n =          {len(gdp_growth)}')

**Interpretation prompt:**
- Can you reject $H_0: \mu = 2\%$ at the 5% level? At the 1% level?
- What does it mean economically if you fail to reject?
- Would you conclude that GDP growth IS exactly 2%? Why or why not?

*Write 2-4 sentences here.*

<a id="test-statistics-and-p-values"></a>
## Test statistics and p-values

### Goal
Build intuition for test statistics and p-values by simulating data under the null
hypothesis and seeing how the test statistic behaves.

### Why this matters in economics
Understanding what a p-value actually measures (and what it does not) is the single
most important statistical skill for any empirical economist. When you read that a
coefficient has $p = 0.03$, you need to know exactly what that means.

### The test statistic

A test statistic measures: *how far is our estimate from the null hypothesis value,
measured in standard errors?*

$$
t = \frac{\hat{\theta} - \theta_0}{\widehat{SE}(\hat{\theta})}
$$

- $\hat{\theta}$: your estimate (e.g., sample mean, regression coefficient)
- $\theta_0$: the value under $H_0$ (often 0)
- $\widehat{SE}$: estimated standard error

### The p-value

The p-value is the probability of seeing a test statistic **this extreme or more extreme**,
**if $H_0$ is true**.

Small p-value $\Rightarrow$ the data are unlikely under $H_0$ $\Rightarrow$ evidence against $H_0$.

### Your Turn (1): Simulate the null distribution of a t-statistic

Generate many samples where $H_0$ is true (true mean = 0), compute the t-statistic
each time, and plot the distribution. Then shade the tails to visualize the p-value.

In [None]:
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

n_obs = 30          # sample size per experiment
n_sims = 10_000     # number of simulated experiments
true_mean = 0.0     # H0 is true: population mean is 0

t_stats = np.empty(n_sims)

for i in range(n_sims):
    # TODO: Generate a sample of size n_obs from N(true_mean, 1)
    sample = ...
    # TODO: Compute the t-statistic for H0: mu = 0
    # t = (sample_mean - 0) / (sample_std / sqrt(n))
    t_stats[i] = ...

print(f'Mean of t-stats: {t_stats.mean():.3f} (should be ~0)')
print(f'Std of t-stats:  {t_stats.std():.3f} (should be ~1)')

In [None]:
# TODO: Plot a histogram of t_stats and overlay the theoretical t-distribution
fig, ax = plt.subplots(figsize=(9, 5))

# Histogram of simulated t-statistics
ax.hist(t_stats, bins=60, density=True, alpha=0.6, color='steelblue',
        label='Simulated t-stats under $H_0$')

# Overlay theoretical t-distribution with n-1 degrees of freedom
x_grid = np.linspace(-4, 4, 300)
ax.plot(x_grid, stats.t.pdf(x_grid, df=n_obs - 1), 'k-', lw=2,
        label=f't({n_obs - 1}) theoretical')

# TODO: Shade the rejection region for alpha = 0.05 (two-sided)
# Hint: the critical value is stats.t.ppf(0.975, df=n_obs-1)
t_crit = ...  # stats.t.ppf(0.975, df=n_obs - 1)
ax.axvline(t_crit, color='red', ls='--', label=f'Critical value = +/- {t_crit:.2f}')
ax.axvline(-t_crit, color='red', ls='--')

# Shade tails
x_right = x_grid[x_grid >= t_crit]
x_left = x_grid[x_grid <= -t_crit]
ax.fill_between(x_right, stats.t.pdf(x_right, df=n_obs - 1), alpha=0.3, color='red')
ax.fill_between(x_left, stats.t.pdf(x_left, df=n_obs - 1), alpha=0.3, color='red')

ax.set_xlabel('t-statistic')
ax.set_ylabel('Density')
ax.set_title('Distribution of t-statistics under $H_0$ (true mean = 0)')
ax.legend()
plt.tight_layout()
plt.show()

### Your Turn (2): Compute the p-value for one specific sample

Draw one sample and manually compute the p-value two ways: (1) from the simulation
and (2) from `scipy.stats`.

In [None]:
# Draw one sample (with a small real effect so the t-stat is interesting)
rng2 = np.random.default_rng(99)
one_sample = rng2.normal(loc=0.4, scale=1.0, size=n_obs)
one_t = (one_sample.mean() - 0) / (one_sample.std(ddof=1) / np.sqrt(n_obs))
print(f'Observed t-statistic: {one_t:.3f}')

# TODO: Method 1 -- Simulation-based p-value
# What fraction of the null t-stats are more extreme than one_t?
p_sim = ...  # np.mean(np.abs(t_stats) >= np.abs(one_t))

# TODO: Method 2 -- Exact p-value from scipy
p_exact = ...  # 2 * stats.t.sf(np.abs(one_t), df=n_obs - 1)

print(f'Simulation p-value: {p_sim:.4f}')
print(f'Exact p-value:      {p_exact:.4f}')

**Interpretation prompt:**
- In your own words, what does the p-value you computed mean?
- Why are the simulation and exact p-values close but not identical?
- If $H_0$ is true, what fraction of t-statistics should fall in the red-shaded tails?

*Write 2-4 sentences here.*

<a id="what-p-values-are-not"></a>
## What p-values are NOT

### Goal
Confront and correct the most common misinterpretations of p-values.

### Why this matters in economics
Empirical economics papers are full of p-values. Misinterpreting them leads to
over-confident policy recommendations, spurious "significant" findings, and a failure
to distinguish statistical significance from economic significance. Getting this right
is not pedantic -- it is essential.

### Three things a p-value is NOT

1. **NOT the probability that $H_0$ is true.**
   The p-value is $P(\text{data} \mid H_0)$, not $P(H_0 \mid \text{data})$. Confusing
   these is the "prosecutor's fallacy."

2. **NOT the probability the result is "due to chance."**
   The p-value is computed *assuming* $H_0$ is true. It does not tell you the probability
   that randomness alone produced the result.

3. **NOT a measure of effect size.**
   A tiny, economically meaningless effect can have $p < 0.001$ with a large enough
   sample. A large, important effect can have $p > 0.10$ with a small sample.

### Your Turn (1): Tiny effect, small p-value (large n)

Show that with enough data, even a trivially small effect becomes "statistically
significant."

In [None]:
rng = np.random.default_rng(10)

# A tiny true effect: mean = 0.01 (practically zero in most economic contexts)
tiny_effect = 0.01

# TODO: Generate a very large sample (n = 100_000) with this tiny mean
large_sample = ...  # rng.normal(loc=tiny_effect, scale=1.0, size=100_000)

# TODO: Run a t-test for H0: mu = 0
t_stat, p_val = ...  # stats.ttest_1samp(large_sample, popmean=0.0)

print(f'True effect:      {tiny_effect}')
print(f'Sample mean:      {large_sample.mean():.5f}')
print(f'Sample size:      {len(large_sample):,}')
print(f't-statistic:      {t_stat:.2f}')
print(f'p-value:          {p_val:.6f}')
print(f'"Significant" at 5%? {p_val < 0.05}')

### Your Turn (2): Large effect, large p-value (small n)

Show that with too little data, even a meaningful effect can fail to reach significance.

In [None]:
rng = np.random.default_rng(11)

# A large true effect: mean = 0.8 (economically meaningful)
large_effect = 0.8

# TODO: Generate a small sample (n = 8) with this large mean
small_sample = ...  # rng.normal(loc=large_effect, scale=2.0, size=8)

# TODO: Run a t-test for H0: mu = 0
t_stat, p_val = ...  # stats.ttest_1samp(small_sample, popmean=0.0)

print(f'True effect:      {large_effect}')
print(f'Sample mean:      {small_sample.mean():.3f}')
print(f'Sample size:      {len(small_sample)}')
print(f't-statistic:      {t_stat:.2f}')
print(f'p-value:          {p_val:.4f}')
print(f'"Significant" at 5%? {p_val < 0.05}')

**Interpretation prompt:**
- In scenario 1, the effect is tiny but "significant." Would you advise a policymaker
  to act on this finding? Why or why not?
- In scenario 2, the effect is large but "not significant." Does that mean there is no
  effect? What is actually going on?
- What additional information (beyond p-value) would you want before drawing conclusions?

*Write 2-4 sentences here.*

<a id="type-i-and-type-ii-errors"></a>
## Type I and Type II errors

### Goal
Simulate Type I errors (false positives) and Type II errors (false negatives) to make
the concepts concrete.

### Why this matters in economics
- **Type I error** (false positive): You conclude a policy works when it does not.
  Wasted resources, misguided policy.
- **Type II error** (false negative): You conclude a policy has no effect when it does.
  A beneficial policy gets abandoned.

The cost of each error depends on context. A drug safety test should be very cautious
about Type I (approving a harmful drug). A preliminary program evaluation might worry
more about Type II (killing a helpful program).

### Decision table

|  | $H_0$ is actually true | $H_0$ is actually false |
|---|---|---|
| **Reject $H_0$** | Type I error ($\alpha$) | Correct (Power = $1 - \beta$) |
| **Fail to reject $H_0$** | Correct ($1 - \alpha$) | Type II error ($\beta$) |

### Your Turn (1): Simulate Type I errors (false positives)

Generate data where $H_0$ is true (mean = 0). Run 1,000 hypothesis tests. Count how
many reject at $\alpha = 0.05$. The count should be roughly 50 (5% of 1,000).

In [None]:
rng = np.random.default_rng(55)

n_tests = 1_000
n_obs = 50
alpha = 0.05
rejections = 0

for _ in range(n_tests):
    # TODO: Generate data under H0: true mean = 0
    sample = ...
    # TODO: Run a t-test and check if p < alpha
    _, p = ...
    if p < alpha:
        rejections += 1

print(f'Rejections out of {n_tests}: {rejections}')
print(f'Rejection rate:              {rejections / n_tests:.3f}')
print(f'Expected (alpha):            {alpha}')

### Your Turn (2): Simulate Type II errors (false negatives)

Generate data where $H_0$ is false (there IS a real effect, e.g., mean = 0.3). Run
1,000 tests. Count how many *fail* to reject. These are Type II errors.

In [None]:
rng = np.random.default_rng(56)

n_tests = 1_000
n_obs = 50
true_effect = 0.3   # H0 is FALSE; the true mean is 0.3
alpha = 0.05
failures_to_reject = 0

for _ in range(n_tests):
    # TODO: Generate data with mean = true_effect
    sample = ...
    # TODO: Test H0: mu = 0
    _, p = ...
    if p >= alpha:
        failures_to_reject += 1

print(f'Type II errors out of {n_tests}: {failures_to_reject}')
print(f'Type II error rate (beta):      {failures_to_reject / n_tests:.3f}')
print(f'Power (1 - beta):               {1 - failures_to_reject / n_tests:.3f}')

**Interpretation prompt:**
- Did the Type I error rate match $\alpha = 0.05$? Why or why not?
- What was the power against the effect of 0.3? Is that good enough?
- If you were evaluating a government program with n=50 observations and a modest true
  effect, what is the risk of concluding "no effect"?

*Write 2-4 sentences here.*

<a id="statistical-power"></a>
## Statistical power

### Goal
Understand what determines statistical power and build a power curve through simulation.

### Why this matters in economics
Many empirical studies are underpowered: they have too few observations to detect the
effects they are looking for. This leads to two problems: (1) real effects go undetected,
and (2) the effects that *are* detected tend to be exaggerated ("winner's curse").
Power analysis should be done *before* collecting data, not after.

### What determines power?

Power = $P$(reject $H_0$ $\mid$ $H_0$ is false) = $1 - \beta$.

Power increases when:
- **Effect size** is larger (easier to detect)
- **Sample size** is larger (more information)
- **Noise ($\sigma$)** is smaller (cleaner signal)
- **$\alpha$** is larger (more lenient threshold -- but more false positives)

### Your Turn (1): Build a power curve (varying sample size)

For a fixed small effect (mean = 0.3, $\sigma$ = 1), vary $n$ from 10 to 500 and
compute the proportion of rejections at each sample size.

In [None]:
rng = np.random.default_rng(77)

true_effect = 0.3
sigma = 1.0
alpha = 0.05
n_sims = 2_000

sample_sizes = [10, 20, 30, 50, 75, 100, 150, 200, 300, 500]
power_values = []

for n in sample_sizes:
    rejections = 0
    for _ in range(n_sims):
        # TODO: Simulate data with the true effect
        sample = ...  # rng.normal(loc=true_effect, scale=sigma, size=n)
        # TODO: Test H0: mu = 0 and count rejections
        _, p = ...  # stats.ttest_1samp(sample, popmean=0.0)
        if p < alpha:
            rejections += 1
    power_values.append(rejections / n_sims)

power_df = pd.DataFrame({'n': sample_sizes, 'power': power_values})
print(power_df.to_string(index=False))

In [None]:
# TODO: Plot the power curve
fig, ax = plt.subplots(figsize=(8, 5))

ax.plot(power_df['n'], power_df['power'], 'o-', color='steelblue', lw=2)
ax.axhline(0.80, color='red', ls='--', alpha=0.7, label='Conventional power = 0.80')
ax.axhline(alpha, color='gray', ls=':', alpha=0.5, label=f'alpha = {alpha}')

ax.set_xlabel('Sample Size (n)')
ax.set_ylabel('Power (rejection rate)')
ax.set_title(f'Power Curve: detecting effect = {true_effect}, sigma = {sigma}')
ax.set_ylim(0, 1.05)
ax.legend()
plt.tight_layout()
plt.show()

### Your Turn (2): How much data do you need?

Read off the power curve: approximately what sample size do you need to achieve 80%
power for this effect size?

In [None]:
# TODO: Find the smallest n in your simulation where power >= 0.80
# Hint: filter power_df
required_n = ...  # power_df.loc[power_df['power'] >= 0.80, 'n'].iloc[0]
print(f'Minimum n for 80% power (from simulation): ~{required_n}')

# Cross-check with analytical formula (optional):
# For a one-sample z-test: n = ((z_alpha/2 + z_beta) * sigma / delta)^2
from scipy.stats import norm
z_alpha2 = norm.ppf(0.975)
z_beta = norm.ppf(0.80)
n_analytical = ((z_alpha2 + z_beta) * sigma / true_effect) ** 2
print(f'Analytical approximation:                   ~{n_analytical:.0f}')

**Interpretation prompt:**
- What happens to the power curve if the true effect is even smaller (e.g., 0.1)?
- Many macro datasets have ~80-200 quarterly observations. Given your power curve,
  what size effects can you reliably detect?
- Why should power analysis be done *before* collecting data, not after?

*Write 2-4 sentences here.*

<a id="significance-levels-and-multiple-testing"></a>
## Significance levels and multiple testing

### Goal
Understand why $\alpha = 0.05$ is a convention (not a law) and see how running many
tests inflates false positives.

### Why this matters in economics
Researchers often estimate many specifications, test many variables, or examine many
subgroups. If you run 20 tests at $\alpha = 0.05$ and all null hypotheses are true,
you *expect* one false positive. Reporting only the significant result is p-hacking.
This is a major issue in empirical economics and one reason journals increasingly
require pre-registration of studies.

### Your Turn (1): Simulate the multiple testing problem

Run 20 independent tests where $H_0$ is true for all of them. Repeat this experiment
many times. How often does at least one test reject?

In [None]:
rng = np.random.default_rng(88)

n_experiments = 5_000
n_tests_per_experiment = 20
n_obs = 50
alpha = 0.05

any_rejection_count = 0
total_rejections = 0

for _ in range(n_experiments):
    p_values = []
    for _ in range(n_tests_per_experiment):
        # TODO: Generate pure noise (H0 true for every test)
        sample = ...
        _, p = ...
        p_values.append(p)

    # TODO: Count how many of the 20 tests rejected
    n_rejected = ...  # sum(p < alpha for p in p_values)
    total_rejections += n_rejected
    if n_rejected > 0:
        any_rejection_count += 1

print(f'Experiments with at least one false positive: '
      f'{any_rejection_count}/{n_experiments} = '
      f'{any_rejection_count / n_experiments:.3f}')
print(f'Expected: 1 - (1 - {alpha})^{n_tests_per_experiment} = '
      f'{1 - (1 - alpha)**n_tests_per_experiment:.3f}')
print(f'Average false positives per experiment: '
      f'{total_rejections / n_experiments:.2f}')
print(f'Expected: {n_tests_per_experiment * alpha:.1f}')

### Your Turn (2): Bonferroni correction

The simplest fix: if you run $m$ tests, use $\alpha / m$ as your significance
threshold. Repeat the simulation above with the Bonferroni-corrected threshold.

In [None]:
rng = np.random.default_rng(89)

alpha_bonf = alpha / n_tests_per_experiment
print(f'Bonferroni-corrected alpha: {alpha_bonf:.4f}')

any_rejection_bonf = 0

for _ in range(n_experiments):
    p_values = []
    for _ in range(n_tests_per_experiment):
        # TODO: Generate pure noise, run t-test
        sample = ...
        _, p = ...
        p_values.append(p)

    # TODO: Check if any p < alpha_bonf
    if ...:
        any_rejection_bonf += 1

print(f'\nWith Bonferroni correction:')
print(f'Experiments with at least one false positive: '
      f'{any_rejection_bonf}/{n_experiments} = '
      f'{any_rejection_bonf / n_experiments:.3f}')
print(f'Target family-wise error rate: {alpha}')

**Interpretation prompt:**
- Without correction, what fraction of experiments had at least one false positive?
  How does that compare to the naive expectation of 5%?
- After Bonferroni correction, did the family-wise error rate return to ~5%?
- What is the downside of Bonferroni correction? (Hint: think about power.)

*Write 2-4 sentences here.*

<a id="connecting-to-regression-output"></a>
## Connecting to regression output

### Goal
See that every t-statistic and p-value in a `statsmodels` regression summary is a
hypothesis test for $H_0: \beta_j = 0$.

### Why this matters in economics
When you run a regression in practice, you do not manually compute t-statistics. The
software does it for you. But you need to know *what* is being tested to interpret the
output correctly. Every coefficient in the summary table has a t-stat and p-value that
answer: "Is there evidence that this variable has a non-zero linear relationship with
the outcome, conditional on the other variables in the model?"

### Your Turn (1): Fit a regression and read the hypothesis tests

In [None]:
import statsmodels.api as sm

# Use the macro quarterly sample
df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)

# Regression: GDP growth ~ UNRATE + FEDFUNDS + INDPRO growth
# We'll create a simple INDPRO growth proxy from the lag
df['indpro_growth'] = (df['INDPRO'] - df['INDPRO_lag1']) / df['INDPRO_lag1'] * 100

features = ['UNRATE', 'FEDFUNDS', 'indpro_growth']
target = 'gdp_growth_qoq_annualized'

reg_df = df[features + [target]].dropna()

X = sm.add_constant(reg_df[features])
y = reg_df[target]

res = sm.OLS(y, X).fit()
print(res.summary())

### Your Turn (2): Extract and interpret the hypothesis tests

In [None]:
# TODO: Extract coefficients, standard errors, t-stats, and p-values
coefs = ...      # res.params
se = ...         # res.bse
t_stats = ...    # res.tvalues
p_vals = ...     # res.pvalues

# TODO: Create a summary DataFrame
summary_df = pd.DataFrame({
    'coef': coefs,
    'std_err': se,
    't_stat': t_stats,
    'p_value': p_vals,
    'significant_5pct': ...  # p_vals < 0.05
})
summary_df

In [None]:
# TODO: Verify that t = coef / std_err (they should match)
manual_t = ...  # coefs / se
print('Manual t-stats:')
print(manual_t.round(4))
print('\nstatsmodels t-stats:')
print(t_stats.round(4))

**Interpretation prompt:**
- Which coefficients are statistically significant at the 5% level?
- For each significant coefficient, write a careful one-sentence interpretation.
  (Remember: "significant" means we reject $H_0: \beta_j = 0$, not that the effect
  is large or important.)
- For any non-significant coefficients: can you conclude the variable has no effect?
  Why or why not? (Hint: consider power.)
- How would using HAC standard errors change the p-values?

*Write 2-4 sentences here.*

## Where This Shows Up Later

- **02_regression**: Every coefficient t-test and p-value is the hypothesis test you
  learned here. F-tests for joint significance test whether a *group* of coefficients
  are all zero.
- **02_regression/04a_residual_diagnostics**: Breusch-Pagan and White tests for
  heteroskedasticity are hypothesis tests where $H_0$: homoskedastic errors.
- **07_time_series_econ/00_stationarity_unit_roots**: The Augmented Dickey-Fuller (ADF)
  test has $H_0$: unit root (non-stationary). Rejecting means evidence of stationarity.
- **06_causal**: Hausman tests, overidentification tests, and pre-trend tests all follow
  the same logic: specify $H_0$, compute a test statistic, evaluate the p-value.

<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)

Run the assertions below and answer the conceptual questions.

In [None]:
# Sanity checks on your work

# 1. Type I error rate should be close to alpha
# (from your Type I simulation -- paste your rejection rate here)
# assert abs(your_type1_rate - 0.05) < 0.02, 'Type I rate too far from alpha'

# 2. Power should increase with sample size
# assert power_df['power'].is_monotonic_increasing, 'Power should increase with n'

# 3. t-stat = coef / se in the regression
# assert np.allclose(manual_t, t_stats, atol=1e-3), 't-stats should match'

# TODO: Uncomment and run the asserts above once you have completed the exercises.
# Write 2-3 sentences: what does each check verify?
...

## Extensions (Optional)

- **Power for different effect sizes**: Re-run the power simulation for effect sizes
  of 0.1, 0.2, 0.3, and 0.5 and overlay the power curves on one plot.
- **Two-sample test**: Compare mean unemployment in recession vs. non-recession quarters
  using `stats.ttest_ind`. What is the effect size? What does the p-value tell you?
- **FDR correction**: Instead of Bonferroni, try the Benjamini-Hochberg procedure
  (`statsmodels.stats.multitest.multipletests`). How does it compare?
- **HAC standard errors**: Re-run the regression with
  `res.get_robustcov_results(cov_type='HAC', cov_kwds={'maxlags': 4})` and compare
  p-values. Which coefficients change significance?

## Reflection

- What implicit assumptions does a t-test make? Which of those assumptions might be
  violated in macroeconomic time series data?
- If you were reviewing a paper that reports 20 regression specifications and highlights
  the one with the smallest p-value, what would you be concerned about?
- In your own words, explain the difference between statistical significance and
  economic significance. Give an example where they diverge.

<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Null and alternative hypotheses (writing exercise)</summary>

1. **Mean GDP growth = 2%**: $H_0: \mu = 2\%$, $H_1: \mu \neq 2\%$ (two-sided,
   because growth could be above or below 2%).

2. **Unemployment higher in recessions**: $H_0: \mu_{\text{rec}} \leq \mu_{\text{non-rec}}$,
   $H_1: \mu_{\text{rec}} > \mu_{\text{non-rec}}$ (one-sided, because theory and common
   sense predict the direction).

3. **Coefficient on education is positive**: $H_0: \beta_{\text{educ}} \leq 0$,
   $H_1: \beta_{\text{educ}} > 0$ (one-sided, justified by human capital theory).

4. **Fed funds rate and industrial production**: $H_0: \beta_{\text{ff}} = 0$,
   $H_1: \beta_{\text{ff}} \neq 0$ (two-sided, because you are agnostic about direction).

</details>

<details><summary>Solution: Null and alternative hypotheses (t-test code)</summary>

```python
gdp_growth = df['gdp_growth_qoq_annualized'].dropna()
t_stat, p_val = stats.ttest_1samp(gdp_growth, popmean=2.0)
print(f'Sample mean: {gdp_growth.mean():.3f}')
print(f't-statistic: {t_stat:.3f}')
print(f'p-value:     {p_val:.4f}')
```

</details>

<details><summary>Solution: Test statistics and p-values (simulation)</summary>

```python
rng = np.random.default_rng(42)
n_obs = 30
n_sims = 10_000
true_mean = 0.0

t_stats = np.empty(n_sims)
for i in range(n_sims):
    sample = rng.normal(loc=true_mean, scale=1.0, size=n_obs)
    sample_mean = sample.mean()
    sample_se = sample.std(ddof=1) / np.sqrt(n_obs)
    t_stats[i] = (sample_mean - 0) / sample_se
```

</details>

<details><summary>Solution: Test statistics and p-values (plot and p-value computation)</summary>

```python
t_crit = stats.t.ppf(0.975, df=n_obs - 1)

# Simulation-based p-value
p_sim = np.mean(np.abs(t_stats) >= np.abs(one_t))

# Exact p-value
p_exact = 2 * stats.t.sf(np.abs(one_t), df=n_obs - 1)
```

</details>

<details><summary>Solution: What p-values are NOT</summary>

```python
# Tiny effect, large n
rng = np.random.default_rng(10)
large_sample = rng.normal(loc=0.01, scale=1.0, size=100_000)
t_stat, p_val = stats.ttest_1samp(large_sample, popmean=0.0)
# p-value will be very small despite effect being negligible

# Large effect, small n
rng = np.random.default_rng(11)
small_sample = rng.normal(loc=0.8, scale=2.0, size=8)
t_stat, p_val = stats.ttest_1samp(small_sample, popmean=0.0)
# p-value may be large despite effect being meaningful
```

</details>

<details><summary>Solution: Type I and Type II errors</summary>

```python
# Type I errors (H0 true)
rng = np.random.default_rng(55)
n_tests, n_obs, alpha = 1_000, 50, 0.05
rejections = 0
for _ in range(n_tests):
    sample = rng.normal(loc=0, scale=1.0, size=n_obs)
    _, p = stats.ttest_1samp(sample, popmean=0.0)
    if p < alpha:
        rejections += 1
print(f'Type I rate: {rejections / n_tests:.3f}')  # should be ~0.05

# Type II errors (H0 false, true effect = 0.3)
rng = np.random.default_rng(56)
failures_to_reject = 0
for _ in range(n_tests):
    sample = rng.normal(loc=0.3, scale=1.0, size=n_obs)
    _, p = stats.ttest_1samp(sample, popmean=0.0)
    if p >= alpha:
        failures_to_reject += 1
print(f'Type II rate: {failures_to_reject / n_tests:.3f}')
print(f'Power: {1 - failures_to_reject / n_tests:.3f}')
```

</details>

<details><summary>Solution: Statistical power (power curve)</summary>

```python
rng = np.random.default_rng(77)
true_effect, sigma, alpha, n_sims = 0.3, 1.0, 0.05, 2_000
sample_sizes = [10, 20, 30, 50, 75, 100, 150, 200, 300, 500]
power_values = []

for n in sample_sizes:
    rejections = 0
    for _ in range(n_sims):
        sample = rng.normal(loc=true_effect, scale=sigma, size=n)
        _, p = stats.ttest_1samp(sample, popmean=0.0)
        if p < alpha:
            rejections += 1
    power_values.append(rejections / n_sims)

# Minimum n for 80% power
power_df = pd.DataFrame({'n': sample_sizes, 'power': power_values})
required_n = power_df.loc[power_df['power'] >= 0.80, 'n'].iloc[0]
```

</details>

<details><summary>Solution: Multiple testing</summary>

```python
rng = np.random.default_rng(88)
n_experiments, n_tests_per = 5_000, 20
n_obs, alpha = 50, 0.05
any_rejection_count = 0

for _ in range(n_experiments):
    p_values = []
    for _ in range(n_tests_per):
        sample = rng.normal(loc=0, scale=1.0, size=n_obs)
        _, p = stats.ttest_1samp(sample, popmean=0.0)
        p_values.append(p)
    if any(p < alpha for p in p_values):
        any_rejection_count += 1

# Bonferroni
rng = np.random.default_rng(89)
alpha_bonf = alpha / n_tests_per
any_rejection_bonf = 0
for _ in range(n_experiments):
    p_values = []
    for _ in range(n_tests_per):
        sample = rng.normal(loc=0, scale=1.0, size=n_obs)
        _, p = stats.ttest_1samp(sample, popmean=0.0)
        p_values.append(p)
    if any(p < alpha_bonf for p in p_values):
        any_rejection_bonf += 1
```

</details>

<details><summary>Solution: Connecting to regression output</summary>

```python
import statsmodels.api as sm

df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)
df['indpro_growth'] = (df['INDPRO'] - df['INDPRO_lag1']) / df['INDPRO_lag1'] * 100

features = ['UNRATE', 'FEDFUNDS', 'indpro_growth']
target = 'gdp_growth_qoq_annualized'
reg_df = df[features + [target]].dropna()

X = sm.add_constant(reg_df[features])
y = reg_df[target]
res = sm.OLS(y, X).fit()

coefs = res.params
se = res.bse
t_stats = res.tvalues
p_vals = res.pvalues

summary_df = pd.DataFrame({
    'coef': coefs,
    'std_err': se,
    't_stat': t_stats,
    'p_value': p_vals,
    'significant_5pct': p_vals < 0.05
})

# Verify t = coef / se
manual_t = coefs / se
assert np.allclose(manual_t, t_stats, atol=1e-10)
```

</details>