# 06 Common Statistical Tests

Putting hypothesis testing into practice: t-tests, chi-squared tests, F-tests, and knowing which test fits which question.

## Table of Contents
- [Choosing the right test](#choosing-the-right-test)
- [One-sample t-test](#one-sample-t-test)
- [Two-sample t-test](#two-sample-t-test)
- [Paired t-test](#paired-t-test)
- [Chi-squared test of independence](#chi-squared-test-of-independence)
- [Chi-squared goodness-of-fit](#chi-squared-goodness-of-fit)
- [F-test for equality of variances](#f-test-for-equality-of-variances)
- [F-test in regression](#f-test-in-regression)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)

## Why This Notebook Matters
The previous notebook covered the logic of hypothesis testing. This notebook puts that
logic into practice with the specific tests you will encounter throughout this project.
Each test answers a different kind of question, and choosing the wrong test gives
misleading answers. By the end, you will have a practical toolkit for the most common
statistical comparisons in economics.

## Prerequisites (Quick Self-Check)
- Completed notebooks 00-05 (especially hypothesis testing foundations).
- Understanding of p-values, Type I/II errors, and test statistics.
- Familiarity with the t, chi-squared, and F distributions.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can choose the appropriate test for a given question and data type.
- You can run and interpret t-tests, chi-squared tests, and F-tests in Python.
- You can read the F-statistic from a regression summary and explain what it tests.

## Common Pitfalls
- Using a two-sample t-test when data is paired (losing power).
- Applying chi-squared tests to tables with expected counts below 5.
- Confusing the regression F-test (joint significance) with the F-test for equal variances.
- Reporting "significant" without specifying the test, alpha level, or what was tested.

## Quick Fixes (When You Get Stuck)
- `scipy.stats.ttest_1samp(x, popmean)` -- one-sample t-test.
- `scipy.stats.ttest_ind(x, y, equal_var=False)` -- Welch's two-sample t-test.
- `scipy.stats.ttest_rel(x, y)` -- paired t-test.
- `scipy.stats.chi2_contingency(table)` -- chi-squared test of independence.
- `scipy.stats.levene(x, y)` -- test for equal variances.
- If you see `ModuleNotFoundError`, re-run the bootstrap cell.

## Matching Guide
- `docs/guides/00a_statistics_primer/06_common_statistical_tests.md`

## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2-4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/00a_statistics_primer/06_common_statistical_tests.md`) for the math, assumptions, and deeper context.

<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.

In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT

## Load the sample data

We will use `macro_quarterly_sample.csv` throughout this notebook.
This dataset contains quarterly US macroeconomic indicators including GDP growth,
unemployment rate, the federal funds rate, CPI, industrial production, and
a recession indicator.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)
print('Shape:', df.shape)
print('Columns:', list(df.columns))
df.head()

<a id="choosing-the-right-test"></a>
## Choosing the Right Test

### Goal
Develop a mental decision tree for selecting the appropriate statistical test.

### Why this matters in economics
Economists routinely test hypotheses about means, variances, proportions, and regression
coefficients. Each question calls for a specific test with its own assumptions. Picking
the wrong test -- for example, using an unpaired t-test for before/after data on the
same regions -- wastes statistical power or, worse, gives invalid p-values.

### Decision tree

Ask yourself three questions:

1. **What is your question?** (comparing means, testing proportions, testing variances, testing regression coefficients)
2. **What type of data do you have?** (continuous, categorical/counts, paired)
3. **How many groups?** (one sample vs a reference value, two samples, more than two)

### Reference table

| Question | Test | Python function |
|---|---|---|
| Is a single mean different from a hypothesized value? | **One-sample t-test** | `scipy.stats.ttest_1samp` |
| Are two group means different? | **Two-sample t-test** (Welch's) | `scipy.stats.ttest_ind(equal_var=False)` |
| Are paired observations different on average? | **Paired t-test** | `scipy.stats.ttest_rel` |
| Are proportions / counts independent? | **Chi-squared test of independence** | `scipy.stats.chi2_contingency` |
| Does a distribution match a theoretical one? | **Chi-squared goodness-of-fit** | `scipy.stats.chisquare` |
| Do two groups have equal variance? | **Levene's test** (robust F-test) | `scipy.stats.levene` |
| Are multiple group means all equal? | **ANOVA (F-test)** | `scipy.stats.f_oneway` |
| Is a single regression coefficient = 0? | **t-test** (from regression) | `res.summary()` t-stats |
| Are several regression coefficients jointly = 0? | **F-test** (from regression) | `res.f_test()` / `res.fvalue` |

We will work through each of these in the sections below.

<a id="one-sample-t-test"></a>
## One-Sample t-Test

### Goal
Test whether the mean of a single sample differs from a known or hypothesized value.

### Why this matters in economics
A central bank might target 2% annual GDP growth. An analyst asks: "Is the observed
mean GDP growth rate significantly different from 2%?" The one-sample t-test
formalizes this comparison. The test statistic is:

$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$

where $\bar{x}$ is the sample mean, $\mu_0$ is the hypothesized value (2.0),
$s$ is the sample standard deviation, and $n$ is the sample size.
Under $H_0$, $t$ follows a $t$-distribution with $n-1$ degrees of freedom.

### Your Turn

In [None]:
# TODO: Test whether mean GDP growth (QoQ) is significantly different from 2%.
# 1. Extract the GDP growth series and drop NaNs.
# 2. Run scipy.stats.ttest_1samp with popmean=2.0.
# 3. Print the t-statistic and p-value.

gdp_growth = df['gdp_growth_qoq'].dropna()

t_stat_1samp, p_value_1samp = ...

print(f'Sample mean:  {gdp_growth.mean():.4f}')
print(f't-statistic:  {t_stat_1samp:.4f}')
print(f'p-value:      {p_value_1samp:.4f}')
print(f'n:            {len(gdp_growth)}')

In [None]:
# TODO: Compute the t-statistic manually to verify.
# t = (xbar - mu0) / (s / sqrt(n))

mu0 = 2.0
xbar = gdp_growth.mean()
s = gdp_growth.std(ddof=1)  # sample std (Bessel's correction)
n = len(gdp_growth)

t_manual = ...

print(f'Manual t-stat: {t_manual:.4f}')
print(f'Scipy t-stat:  {t_stat_1samp:.4f}')
print(f'Match: {np.isclose(t_manual, t_stat_1samp)}')

In [None]:
# TODO: Construct a 95% confidence interval for the population mean GDP growth.
# CI = xbar +/- t_crit * (s / sqrt(n))
# Hint: t_crit = stats.t.ppf(0.975, df=n-1)

alpha = 0.05
t_crit = ...
margin = ...
ci_low = ...
ci_high = ...

print(f'95% CI for mean GDP growth: [{ci_low:.4f}, {ci_high:.4f}]')
print(f'Does the CI contain 2.0? {ci_low <= 2.0 <= ci_high}')

**Interpretation prompt** (write 2-4 sentences below):
- Can we reject the null hypothesis that mean GDP growth equals 2% at the 5% level? Why or why not?
- Is the confidence interval consistent with the p-value? (If 2.0 is inside the CI, the p-value should be > 0.05.)
- What does the sign of the t-statistic tell you about the direction of the difference?

<a id="two-sample-t-test"></a>
## Two-Sample t-Test

### Goal
Test whether the means of two independent groups differ.

### Why this matters in economics
"Is mean GDP growth different during recession vs. non-recession quarters?" This is
the bread-and-butter comparison in empirical economics. The two-sample t-test compares
the means of two independent groups. **Welch's t-test** does not assume equal variances
in the two groups and is generally preferred, since economic volatility often changes
across regimes.

**Welch's t-statistic:**
$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

### Your Turn

In [None]:
# TODO: Split GDP growth into recession and non-recession quarters.
# Then run a two-sample t-test (Welch's).
# Hint: use df.loc[df['recession'] == 1, 'gdp_growth_qoq'] to filter.

gdp_recession = ...
gdp_expansion = ...

print(f'Recession quarters:     n={len(gdp_recession)}, mean={gdp_recession.mean():.4f}, std={gdp_recession.std():.4f}')
print(f'Non-recession quarters: n={len(gdp_expansion)}, mean={gdp_expansion.mean():.4f}, std={gdp_expansion.std():.4f}')

In [None]:
# TODO: Run Welch's two-sample t-test.
# Hint: stats.ttest_ind(x, y, equal_var=False)

t_stat_2samp, p_value_2samp = ...

print(f't-statistic: {t_stat_2samp:.4f}')
print(f'p-value:     {p_value_2samp:.4f}')

In [None]:
# TODO: For comparison, run the equal-variance version (equal_var=True).
# How much does the p-value change?

t_eq, p_eq = ...

print(f'Equal-variance t-test: t={t_eq:.4f}, p={p_eq:.4f}')
print(f'Welch t-test:          t={t_stat_2samp:.4f}, p={p_value_2samp:.4f}')
print(f'\nThe standard deviations are quite different ({gdp_recession.std():.4f} vs {gdp_expansion.std():.4f}),')
print('so Welch\'s test is more appropriate here.')

In [None]:
# TODO: Create side-by-side box plots comparing GDP growth in recession vs. expansion.
# Hint: df.boxplot(column='gdp_growth_qoq', by='recession')

fig, ax = plt.subplots(figsize=(7, 5))

...

ax.set_title('GDP Growth: Recession vs Expansion Quarters')
ax.set_xlabel('Recession (0 = No, 1 = Yes)')
ax.set_ylabel('GDP Growth (QoQ %)')
plt.suptitle('')  # remove auto-title from .boxplot()
plt.tight_layout()
plt.show()

**Interpretation prompt** (write 2-4 sentences below):
- Is the difference in mean GDP growth between recession and non-recession quarters statistically significant?
- Why is Welch's t-test preferred here over the equal-variance version?
- Do the box plots visually confirm what the test tells you numerically?

<a id="paired-t-test"></a>
## Paired t-Test

### Goal
Test whether the mean difference between paired observations is zero.

### Why this matters in economics
Imagine measuring an economic indicator in 10 regions **before** and **after** a policy
change. Each region serves as its own control. By computing the difference (after - before)
for each region, you remove region-level variation and focus on the treatment effect.
A paired t-test is more powerful than an unpaired test when the pairing is meaningful,
because it reduces noise.

**Test statistic:**
$$t = \frac{\bar{d}}{s_d / \sqrt{n}}$$

where $d_i = x_{\text{after},i} - x_{\text{before},i}$, $\bar{d}$ is the mean of the
differences, $s_d$ is the standard deviation of the differences, and $n$ is the number
of pairs.

### Your Turn

In [None]:
# We simulate paired data: an economic indicator measured in 10 regions
# before and after a policy intervention.
# The true effect is a small positive shift of +0.8 percentage points,
# but there is substantial region-level variation.

np.random.seed(42)
n_regions = 10
region_baseline = np.random.normal(loc=3.0, scale=2.0, size=n_regions)  # baseline differs by region
effect = 0.8  # true policy effect
noise = np.random.normal(0, 0.5, size=n_regions)

before = region_baseline + np.random.normal(0, 0.3, size=n_regions)
after = region_baseline + effect + noise

paired_df = pd.DataFrame({
    'region': [f'Region_{i+1}' for i in range(n_regions)],
    'before': before,
    'after': after,
})
paired_df['difference'] = paired_df['after'] - paired_df['before']
paired_df

In [None]:
# TODO: Run a paired t-test on the before/after values.
# Hint: stats.ttest_rel(after, before)
# This tests H0: mean(after - before) = 0.

t_stat_paired, p_value_paired = ...

print(f'Mean difference:  {paired_df["difference"].mean():.4f}')
print(f'Std of differences: {paired_df["difference"].std(ddof=1):.4f}')
print(f't-statistic:      {t_stat_paired:.4f}')
print(f'p-value:          {p_value_paired:.4f}')

In [None]:
# TODO: Compare with an unpaired (independent) two-sample t-test.
# Notice how the unpaired test has a higher p-value (less power)
# because it cannot account for region-level variation.

t_unpaired, p_unpaired = ...

print('--- Paired t-test ---')
print(f'  t = {t_stat_paired:.4f}, p = {p_value_paired:.4f}')
print()
print('--- Unpaired (independent) t-test ---')
print(f'  t = {t_unpaired:.4f}, p = {p_unpaired:.4f}')
print()
print('The paired test is more powerful because it removes region-level variation.')

**Interpretation prompt** (write 2-4 sentences below):
- Is the mean difference statistically significant at the 5% level using the paired test?
- How does the p-value change when you use the unpaired test? Why?
- In what real-world economic studies would paired data arise naturally? (Think: same country before/after a trade agreement, same firm before/after a regulation.)

<a id="chi-squared-test-of-independence"></a>
## Chi-Squared Test of Independence

### Goal
Test whether two categorical variables are independent using a contingency table.

### Why this matters in economics
"Is there an association between recession quarters and the direction of interest rate
changes?" We can cross-tabulate recession status (yes/no) against whether the Fed
raised, lowered, or held rates. The chi-squared test of independence tells us whether
the observed pattern differs from what we would expect if the two variables were
unrelated.

**Test statistic:**
$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

where $O_{ij}$ is the observed count and $E_{ij}$ is the expected count under independence.

### Your Turn

In [None]:
# First, create a categorical variable for the direction of interest rate changes.
# We compute the quarter-over-quarter change in FEDFUNDS and classify it as
# 'cut', 'hold', or 'hike'.

df_chi = df[['FEDFUNDS', 'recession']].dropna().copy()
df_chi['rate_change'] = df_chi['FEDFUNDS'].diff()
df_chi = df_chi.dropna()

# Classify direction: cut (< -0.1), hold (between -0.1 and +0.1), hike (> +0.1)
df_chi['direction'] = pd.cut(
    df_chi['rate_change'],
    bins=[-np.inf, -0.1, 0.1, np.inf],
    labels=['cut', 'hold', 'hike']
)

print(df_chi['direction'].value_counts())
print(f'\nRecession quarters: {int(df_chi["recession"].sum())}')

In [None]:
# TODO: Create a contingency table (cross-tabulation) of recession x direction.
# Hint: pd.crosstab(df_chi['recession'], df_chi['direction'])

contingency_table = ...

print('Contingency Table:')
print(contingency_table)

In [None]:
# TODO: Run the chi-squared test of independence.
# Hint: stats.chi2_contingency(contingency_table)
# Returns: chi2, p, dof, expected_freq

chi2_stat, p_value_chi2, dof_chi2, expected_freq = ...

print(f'Chi-squared statistic: {chi2_stat:.4f}')
print(f'p-value:               {p_value_chi2:.4f}')
print(f'Degrees of freedom:    {dof_chi2}')
print(f'\nExpected frequencies (under independence):')
print(pd.DataFrame(expected_freq,
                    index=contingency_table.index,
                    columns=contingency_table.columns).round(2))

In [None]:
# TODO: Visualize the contingency table as a heatmap.
# Hint: plt.imshow() or use ax.pcolormesh(), then add annotations.

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: observed counts
im0 = axes[0].imshow(contingency_table.values, cmap='Blues', aspect='auto')
axes[0].set_xticks(range(len(contingency_table.columns)))
axes[0].set_xticklabels(contingency_table.columns)
axes[0].set_yticks(range(len(contingency_table.index)))
axes[0].set_yticklabels(['Expansion', 'Recession'])
axes[0].set_title('Observed Counts')
for i in range(contingency_table.shape[0]):
    for j in range(contingency_table.shape[1]):
        axes[0].text(j, i, str(contingency_table.values[i, j]),
                     ha='center', va='center', fontsize=14, fontweight='bold')
plt.colorbar(im0, ax=axes[0])

# Right: expected counts
...

plt.tight_layout()
plt.show()

**Interpretation prompt** (write 2-4 sentences below):
- Is there a statistically significant association between recession status and the direction of rate changes?
- Are any expected cell counts below 5? If so, the chi-squared approximation may be unreliable.
- Substantively, does it make economic sense that rate changes and recessions are (or are not) associated?

<a id="chi-squared-goodness-of-fit"></a>
## Chi-Squared Goodness-of-Fit

### Goal
Test whether observed data follows a specific theoretical distribution.

### Why this matters in economics
Many econometric techniques assume normality of errors. "Is GDP growth approximately
normally distributed?" is a practical question. The chi-squared goodness-of-fit test
bins the data, computes expected frequencies under a normal distribution with the
same mean and standard deviation, and checks whether the observed bin counts match.
We also compare results with the Shapiro-Wilk test, which is a more powerful
test of normality for moderate sample sizes.

### Your Turn

In [None]:
# TODO: Bin GDP growth data and compute expected frequencies under normality.
# Steps:
# 1. Compute the histogram (observed counts) with ~8 bins.
# 2. For each bin, compute the expected count from a normal distribution
#    with the same mean and std as the data.
# 3. Run scipy.stats.chisquare(observed, expected).

gdp_gof = df['gdp_growth_qoq'].dropna()
mu_hat = gdp_gof.mean()
sigma_hat = gdp_gof.std()
n_obs = len(gdp_gof)

# Create bins and compute observed counts
n_bins = 8
observed_counts, bin_edges = np.histogram(gdp_gof, bins=n_bins)

# Compute expected counts under a normal distribution
# For each bin, expected proportion = CDF(right_edge) - CDF(left_edge)
expected_probs = ...
expected_counts = ...

print('Bin edges:', np.round(bin_edges, 2))
print('Observed: ', observed_counts)
print('Expected: ', np.round(expected_counts, 2))

In [None]:
# TODO: Run the chi-squared goodness-of-fit test.
# Note: we estimated 2 parameters (mean, std) from the data, so the
# true degrees of freedom = n_bins - 1 - 2 = n_bins - 3.
# scipy.stats.chisquare does not adjust for estimated parameters automatically,
# so we pass ddof=2 to account for the 2 estimated parameters.

chi2_gof, p_value_gof = ...

print(f'Chi-squared statistic: {chi2_gof:.4f}')
print(f'p-value:               {p_value_gof:.4f}')
print(f'Degrees of freedom:    {n_bins - 1 - 2}')

In [None]:
# TODO: Complement with the Shapiro-Wilk test for normality.
# Hint: stats.shapiro(gdp_gof)
# The Shapiro-Wilk test is generally more powerful than the chi-squared
# goodness-of-fit test for testing normality.

shapiro_stat, shapiro_p = ...

print(f'Shapiro-Wilk statistic: {shapiro_stat:.4f}')
print(f'Shapiro-Wilk p-value:   {shapiro_p:.4f}')
print()
print('Comparison:')
print(f'  Chi-squared GOF p-value: {p_value_gof:.4f}')
print(f'  Shapiro-Wilk p-value:    {shapiro_p:.4f}')

**Interpretation prompt** (write 2-4 sentences below):
- Do both tests agree on whether GDP growth is normally distributed?
- If expected counts in some bins are very small (< 5), how might that affect the chi-squared result?
- Why might the Shapiro-Wilk test give a different conclusion than the chi-squared goodness-of-fit?

<a id="f-test-for-equality-of-variances"></a>
## F-Test for Equality of Variances

### Goal
Test whether two groups have equal variance.

### Why this matters in economics
"Is GDP growth more volatile during recessions?" is a question about variances.
If the variance of GDP growth differs between recession and expansion quarters,
this has implications for risk modeling and for the validity of tests that assume
equal variances. In regression, unequal variance of errors across groups is called
**heteroskedasticity** -- a key diagnostic you will encounter in the regression
module (02_regression/04a).

The classic F-test for equal variances ($F = s_1^2 / s_2^2$) is very sensitive to
non-normality. **Levene's test** is more robust because it tests equality of
variances based on deviations from group medians rather than means.

### Your Turn

In [None]:
# TODO: Compute the variance of GDP growth in recession vs. expansion quarters.
# Then run Levene's test for equality of variances.

gdp_rec = df.loc[df['recession'] == 1, 'gdp_growth_qoq'].dropna()
gdp_exp = df.loc[df['recession'] == 0, 'gdp_growth_qoq'].dropna()

var_rec = ...
var_exp = ...

print(f'Variance (recession):  {var_rec:.4f}  (std = {np.sqrt(var_rec):.4f})')
print(f'Variance (expansion):  {var_exp:.4f}  (std = {np.sqrt(var_exp):.4f})')
print(f'Ratio (recession/expansion): {var_rec / var_exp:.4f}')

In [None]:
# TODO: Run Levene's test.
# Hint: stats.levene(gdp_rec, gdp_exp)

levene_stat, levene_p = ...

print(f'Levene\'s test statistic: {levene_stat:.4f}')
print(f'p-value:                  {levene_p:.4f}')
print()
if levene_p < 0.05:
    print('Reject H0: variances are significantly different at alpha=0.05.')
    print('This suggests GDP growth is more (or less) volatile during recessions.')
else:
    print('Fail to reject H0: no significant difference in variances at alpha=0.05.')

In [None]:
# For reference: the classic F-test (less robust, but instructive).
# F = s1^2 / s2^2, compared to the F-distribution with (n1-1, n2-1) df.

F_classic = var_rec / var_exp
df1 = len(gdp_rec) - 1
df2 = len(gdp_exp) - 1

# Two-tailed p-value
p_classic = 2 * min(
    stats.f.cdf(F_classic, df1, df2),
    1 - stats.f.cdf(F_classic, df1, df2)
)

print(f'Classic F-statistic: {F_classic:.4f}')
print(f'Classic F p-value:   {p_classic:.4f}')
print(f'Levene p-value:      {levene_p:.4f}')
print()
print('Levene\'s test is preferred because it is robust to non-normality.')

**Interpretation prompt** (write 2-4 sentences below):
- Is GDP growth significantly more volatile during recessions?
- Do Levene's test and the classic F-test agree? Would you expect them to differ if the data were non-normal?
- How does this result connect to heteroskedasticity in regression? (Hint: if error variance differs across subgroups, OLS standard errors are biased.)

<a id="f-test-in-regression"></a>
## F-Test in Regression (Joint Significance)

### Goal
Read and interpret the F-statistic from a regression summary, and test a subset
of coefficients for joint significance.

### Why this matters in economics
Every OLS regression summary includes an F-statistic that tests whether **all**
predictors are jointly significant (i.e., $H_0$: all slope coefficients = 0).
Beyond the overall F-test, you can test whether a **subset** of coefficients is
jointly zero using `res.f_test()`. This is essential for questions like:
"Are these three lag variables jointly significant, even if none of them is
individually significant?"

**Overall F-statistic:**
$$F = \frac{(\text{TSS} - \text{RSS}) / k}{\text{RSS} / (n - k - 1)} = \frac{R^2 / k}{(1 - R^2) / (n - k - 1)}$$

where $k$ is the number of predictors, $n$ is the sample size, TSS is total sum of squares,
and RSS is residual sum of squares.

### Your Turn

In [None]:
import statsmodels.api as sm

# TODO: Fit an OLS regression predicting GDP growth from several macro indicators.
# Use UNRATE, FEDFUNDS, and T10Y2Y as predictors.
# Hint:
#   X = df[['UNRATE', 'FEDFUNDS', 'T10Y2Y']].dropna()
#   X = sm.add_constant(X)
#   y = df.loc[X.index, 'gdp_growth_qoq']
#   model = sm.OLS(y, X).fit()

predictors = ['UNRATE', 'FEDFUNDS', 'T10Y2Y']
reg_df = df[predictors + ['gdp_growth_qoq']].dropna()

X = ...
y = ...

model = ...

print(model.summary())

In [None]:
# TODO: Extract and interpret the overall F-statistic and its p-value.
# Hint: model.fvalue, model.f_pvalue

f_overall = ...
f_overall_p = ...

print(f'Overall F-statistic: {f_overall:.4f}')
print(f'Overall F p-value:   {f_overall_p:.6f}')
print()
print('This tests H0: all slope coefficients = 0 (the model has no explanatory power).')
if f_overall_p < 0.05:
    print('We reject H0: the predictors are jointly significant at alpha=0.05.')
else:
    print('We fail to reject H0: the predictors are not jointly significant at alpha=0.05.')

In [None]:
# Look at individual t-statistics for each coefficient.
# Some may be individually insignificant even though the overall F-test is significant.

coef_table = pd.DataFrame({
    'coefficient': model.params,
    't_stat': model.tvalues,
    'p_value': model.pvalues,
})
coef_table['significant_5pct'] = coef_table['p_value'] < 0.05
print(coef_table.round(4))

In [None]:
# TODO: Test a subset of coefficients for joint significance.
# Test H0: coefficients on FEDFUNDS and T10Y2Y are both zero.
# Hint: model.f_test('FEDFUNDS = 0, T10Y2Y = 0')
# or: model.f_test(np.array([[0, 0, 1, 0], [0, 0, 0, 1]]))

joint_test = ...

print('Joint F-test: H0: FEDFUNDS = 0 AND T10Y2Y = 0')
print(f'F-statistic: {joint_test.fvalue[0][0]:.4f}')
print(f'p-value:     {joint_test.pvalue:.4f}')
print()
print('This answers: "Are FEDFUNDS and T10Y2Y jointly significant,"')
print('"even if individually one or both are not?"')

**Interpretation prompt** (write 2-4 sentences below):
- Is the overall F-test significant? What does that tell you about the model as a whole?
- Are all individual coefficients significant? If not, does that contradict the overall F-test?
- What does the joint F-test on FEDFUNDS and T10Y2Y tell you that the individual t-tests do not?
- Why might variables be jointly significant but individually insignificant? (Hint: think about multicollinearity.)

## Where This Shows Up Later

The tests in this notebook are not isolated exercises -- they appear throughout the rest
of the project:

- **t-tests on regression coefficients**: Every regression summary (`res.summary()`) reports
  a t-statistic and p-value for each coefficient. You will read these in every notebook
  from `02_regression` onward.

- **F-tests in regression summaries**: The overall F-statistic tests whether your regression
  model explains anything at all. Joint F-tests let you compare nested models
  (e.g., "does adding these 3 variables improve the fit?").

- **Chi-squared tests in diagnostics**: The Breusch-Pagan test for heteroskedasticity
  (`02_regression/04a_residual_diagnostics`) uses a chi-squared statistic. The Ljung-Box
  test for serial correlation also uses chi-squared.

- **ADF test for unit roots**: The Augmented Dickey-Fuller test (`07_time_series_econ`)
  uses a t-distribution variant (with non-standard critical values) to test whether a
  time series has a unit root (is non-stationary).

- **Hausman test**: Comparing fixed and random effects estimators (`06_causal/01a`) uses
  a chi-squared test statistic.

Understanding *which* test is being applied and *why* will help you interpret results
across all these contexts.

<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run these asserts to verify your work. If any fail, go back and fix the corresponding section.

In [None]:
# ---- One-sample t-test ----
assert isinstance(t_stat_1samp, float), 't_stat_1samp should be a float'
assert isinstance(p_value_1samp, float), 'p_value_1samp should be a float'
assert 0 <= p_value_1samp <= 1, 'p-value must be between 0 and 1'
assert np.isclose(t_manual, t_stat_1samp, atol=1e-6), 'Manual and scipy t-stats should match'

# ---- Two-sample t-test ----
assert isinstance(t_stat_2samp, float), 't_stat_2samp should be a float'
assert isinstance(p_value_2samp, float), 'p_value_2samp should be a float'
assert 0 <= p_value_2samp <= 1, 'p-value must be between 0 and 1'

# ---- Paired t-test ----
assert isinstance(t_stat_paired, (float, np.floating)), 't_stat_paired should be numeric'
assert isinstance(p_value_paired, (float, np.floating)), 'p_value_paired should be numeric'
assert p_value_paired < p_unpaired, 'Paired test should have lower p-value than unpaired (more power)'

# ---- Chi-squared test of independence ----
assert chi2_stat >= 0, 'Chi-squared statistic must be non-negative'
assert 0 <= p_value_chi2 <= 1, 'p-value must be between 0 and 1'
assert dof_chi2 > 0, 'Degrees of freedom must be positive'

# ---- F-test in regression ----
assert f_overall > 0, 'F-statistic must be positive'
assert 0 <= f_overall_p <= 1, 'F p-value must be between 0 and 1'

# ---- Levene's test ----
assert levene_stat >= 0, 'Levene statistic must be non-negative'
assert 0 <= levene_p <= 1, 'Levene p-value must be between 0 and 1'

print('All checkpoint assertions passed.')

## Extensions (Optional)
- Run a **one-way ANOVA** using `scipy.stats.f_oneway` to compare GDP growth across three or more groups
  (e.g., bin quarters by unemployment level: low/medium/high).
- Apply the **Kolmogorov-Smirnov test** (`scipy.stats.kstest`) as an alternative to the chi-squared
  goodness-of-fit for testing normality. Compare the results.
- Investigate how **sample size** affects the power of the one-sample t-test by subsampling
  the data at different sizes and tracking how the p-value changes.
- Explore **Fisher's exact test** (`scipy.stats.fisher_exact`) as an alternative to the chi-squared
  test when expected cell counts are small.

## Reflection
- Which test was most intuitive to you? Which was most surprising in its result?
- In your own words, what is the difference between the F-test for equal variances and the F-test in regression?
- When you see a p-value of 0.06, how would you communicate the result to a non-technical audience?
- Think of an economic question you care about. Which test from this notebook would you use to answer it?

<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: One-sample t-test</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 06_common_statistical_tests -- One-sample t-test
gdp_growth = df['gdp_growth_qoq'].dropna()

# Scipy version
t_stat_1samp, p_value_1samp = stats.ttest_1samp(gdp_growth, popmean=2.0)

print(f'Sample mean:  {gdp_growth.mean():.4f}')
print(f't-statistic:  {t_stat_1samp:.4f}')
print(f'p-value:      {p_value_1samp:.4f}')

# Manual version
mu0 = 2.0
xbar = gdp_growth.mean()
s = gdp_growth.std(ddof=1)
n = len(gdp_growth)
t_manual = (xbar - mu0) / (s / np.sqrt(n))

# Confidence interval
alpha = 0.05
t_crit = stats.t.ppf(1 - alpha / 2, df=n - 1)
margin = t_crit * (s / np.sqrt(n))
ci_low = xbar - margin
ci_high = xbar + margin
print(f'95% CI: [{ci_low:.4f}, {ci_high:.4f}]')
```

</details>

<details><summary>Solution: Two-sample t-test</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 06_common_statistical_tests -- Two-sample t-test
gdp_recession = df.loc[df['recession'] == 1, 'gdp_growth_qoq'].dropna()
gdp_expansion = df.loc[df['recession'] == 0, 'gdp_growth_qoq'].dropna()

# Welch's t-test (unequal variances)
t_stat_2samp, p_value_2samp = stats.ttest_ind(
    gdp_recession, gdp_expansion, equal_var=False
)

# Equal-variance version for comparison
t_eq, p_eq = stats.ttest_ind(
    gdp_recession, gdp_expansion, equal_var=True
)

# Box plot
fig, ax = plt.subplots(figsize=(7, 5))
df.boxplot(column='gdp_growth_qoq', by='recession', ax=ax)
ax.set_title('GDP Growth: Recession vs Expansion Quarters')
ax.set_xlabel('Recession (0 = No, 1 = Yes)')
ax.set_ylabel('GDP Growth (QoQ %)')
plt.suptitle('')
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: Paired t-test</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 06_common_statistical_tests -- Paired t-test
t_stat_paired, p_value_paired = stats.ttest_rel(after, before)

# Compare with unpaired
t_unpaired, p_unpaired = stats.ttest_ind(after, before, equal_var=False)

print(f'Paired:   t={t_stat_paired:.4f}, p={p_value_paired:.4f}')
print(f'Unpaired: t={t_unpaired:.4f}, p={p_unpaired:.4f}')
```

</details>

<details><summary>Solution: Chi-squared test of independence</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 06_common_statistical_tests -- Chi-squared independence
contingency_table = pd.crosstab(df_chi['recession'], df_chi['direction'])

chi2_stat, p_value_chi2, dof_chi2, expected_freq = stats.chi2_contingency(
    contingency_table
)

print(f'Chi-squared: {chi2_stat:.4f}')
print(f'p-value:     {p_value_chi2:.4f}')
print(f'DOF:         {dof_chi2}')

# Heatmap of expected counts
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Observed (already shown in the notebook)
im0 = axes[0].imshow(contingency_table.values, cmap='Blues', aspect='auto')
axes[0].set_xticks(range(len(contingency_table.columns)))
axes[0].set_xticklabels(contingency_table.columns)
axes[0].set_yticks(range(len(contingency_table.index)))
axes[0].set_yticklabels(['Expansion', 'Recession'])
axes[0].set_title('Observed Counts')
for i in range(contingency_table.shape[0]):
    for j in range(contingency_table.shape[1]):
        axes[0].text(j, i, str(contingency_table.values[i, j]),
                     ha='center', va='center', fontsize=14, fontweight='bold')
plt.colorbar(im0, ax=axes[0])

# Expected
im1 = axes[1].imshow(expected_freq, cmap='Oranges', aspect='auto')
axes[1].set_xticks(range(len(contingency_table.columns)))
axes[1].set_xticklabels(contingency_table.columns)
axes[1].set_yticks(range(len(contingency_table.index)))
axes[1].set_yticklabels(['Expansion', 'Recession'])
axes[1].set_title('Expected Counts (under independence)')
for i in range(expected_freq.shape[0]):
    for j in range(expected_freq.shape[1]):
        axes[1].text(j, i, f'{expected_freq[i, j]:.1f}',
                     ha='center', va='center', fontsize=14, fontweight='bold')
plt.colorbar(im1, ax=axes[1])

plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: Chi-squared goodness-of-fit</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 06_common_statistical_tests -- Chi-squared goodness-of-fit
gdp_gof = df['gdp_growth_qoq'].dropna()
mu_hat = gdp_gof.mean()
sigma_hat = gdp_gof.std()
n_obs = len(gdp_gof)

n_bins = 8
observed_counts, bin_edges = np.histogram(gdp_gof, bins=n_bins)

# Expected probabilities from the normal CDF
expected_probs = np.diff(stats.norm.cdf(bin_edges, loc=mu_hat, scale=sigma_hat))
expected_counts = expected_probs * n_obs

# Chi-squared test with ddof=2 (estimated mean and std)
chi2_gof, p_value_gof = stats.chisquare(observed_counts, f_exp=expected_counts, ddof=2)

# Shapiro-Wilk
shapiro_stat, shapiro_p = stats.shapiro(gdp_gof)

print(f'Chi-sq GOF:    chi2={chi2_gof:.4f}, p={p_value_gof:.4f}')
print(f'Shapiro-Wilk:  W={shapiro_stat:.4f}, p={shapiro_p:.4f}')
```

</details>

<details><summary>Solution: F-test for equality of variances</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 06_common_statistical_tests -- F-test for variances
gdp_rec = df.loc[df['recession'] == 1, 'gdp_growth_qoq'].dropna()
gdp_exp = df.loc[df['recession'] == 0, 'gdp_growth_qoq'].dropna()

var_rec = gdp_rec.var(ddof=1)
var_exp = gdp_exp.var(ddof=1)

levene_stat, levene_p = stats.levene(gdp_rec, gdp_exp)
print(f'Levene test: stat={levene_stat:.4f}, p={levene_p:.4f}')
```

</details>

<details><summary>Solution: F-test in regression</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 06_common_statistical_tests -- F-test in regression
import statsmodels.api as sm

predictors = ['UNRATE', 'FEDFUNDS', 'T10Y2Y']
reg_df = df[predictors + ['gdp_growth_qoq']].dropna()

X = sm.add_constant(reg_df[predictors])
y = reg_df['gdp_growth_qoq']

model = sm.OLS(y, X).fit()
print(model.summary())

# Overall F-test
f_overall = model.fvalue
f_overall_p = model.f_pvalue

# Joint test: FEDFUNDS = 0 AND T10Y2Y = 0
joint_test = model.f_test('FEDFUNDS = 0, T10Y2Y = 0')
print(f'Joint F: {joint_test.fvalue[0][0]:.4f}, p={joint_test.pvalue:.4f}')
```

</details>