# 07 Correlation and Covariance

Measuring and visualizing relationships between variables — and understanding why correlation is not causation.

## Table of Contents
- [Covariance: measuring joint variability](#covariance)
- [Pearson correlation](#pearson-correlation)
- [Spearman rank correlation](#spearman-rank-correlation)
- [Visualizing relationships](#visualizing-relationships)
- [Spurious correlation](#spurious-correlation)
- [Correlation does not imply causation](#correlation-does-not-imply-causation)
- [The covariance matrix and regression](#the-covariance-matrix-and-regression)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)

## Why This Notebook Matters
Correlation analysis is the gateway to regression. Before you fit any model, you should
understand how your variables relate to each other. But correlation is also one of the
most misused statistics — especially in economics, where spurious correlations from
trending data and confounded relationships are everywhere. This notebook builds the
intuition to use correlation wisely and skeptically.

## Prerequisites (Quick Self-Check)
- Completed notebooks 00–06 (full primer sequence so far).
- Understanding of mean, variance, and standard deviation.
- Familiarity with scatter plots.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can compute and interpret Pearson and Spearman correlations.
- You can visualize a correlation matrix and scatter relationships.
- You can demonstrate spurious correlation with trending data.
- You can explain why correlation does not imply causation with a concrete example.

## Common Pitfalls
- Computing Pearson correlation on non-stationary time series (produces spurious results).
- Interpreting high correlation as evidence of a causal relationship.
- Ignoring non-linear relationships that Pearson correlation misses.
- Using correlation on data with outliers without checking Spearman.

## Quick Fixes (When You Get Stuck)
- `df.corr()` for Pearson, `df.corr(method='spearman')` for Spearman.
- `np.cov(X, Y)` for the 2x2 covariance matrix.
- `seaborn.heatmap(corr_matrix, annot=True)` for visualization.
- If you see `ModuleNotFoundError`, re-run the bootstrap cell.

## Matching Guide
- `docs/guides/00a_statistics_primer/07_correlation_and_covariance.md`

## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2–4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/00a_statistics_primer/07_correlation_and_covariance.md`) for the math, assumptions, and deeper context.

<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.

In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT

## Load the Sample Data

We will use `macro_quarterly_sample.csv` throughout this notebook.
This dataset contains quarterly US macroeconomic indicators including GDP growth,
unemployment rate, the federal funds rate, CPI, industrial production, and more.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)
print('Shape:', df.shape)
print('Columns:', list(df.columns))
df.head()

<a id="covariance"></a>
## Covariance: Measuring Joint Variability

### Goal
Understand what covariance measures, compute it manually from the definition, and
verify the result with `df.cov()`.

### Why this matters in economics
Covariance tells us whether two variables tend to move together. For example, if GDP
growth and unemployment have *negative* covariance, that means quarters with
above-average growth tend to coincide with below-average unemployment (and vice
versa) — exactly what Okun's Law predicts. However, covariance has a critical
weakness: its magnitude depends on the scale (units) of the variables. Covariance
between GDP (measured in billions of dollars) and unemployment (measured in percent)
is hard to interpret on its own.

**Key definition:**

$$\text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})$$

- Positive covariance: $X$ and $Y$ tend to move in the same direction.
- Negative covariance: $X$ and $Y$ tend to move in opposite directions.
- Problem: the number itself is hard to interpret because it depends on the units of $X$ and $Y$.

### Your Turn

In [None]:
# TODO: Compute covariance between GDP growth (QoQ) and unemployment rate MANUALLY
# from the definition: Cov(X,Y) = (1/(n-1)) * sum((X - mean_X) * (Y - mean_Y))
# Then compare with df[['gdp_growth_qoq', 'UNRATE']].cov()

# Drop rows where either variable is NaN
pair = df[['gdp_growth_qoq', 'UNRATE']].dropna()
x = pair['gdp_growth_qoq']
y = pair['UNRATE']

n = len(x)

# Manual calculation
cov_manual = ...

# Pandas calculation
cov_pandas = ...

print(f'Manual covariance:  {cov_manual:.6f}')
print(f'Pandas covariance:  {cov_pandas:.6f}')
print(f'Match: {np.isclose(cov_manual, cov_pandas)}')

In [None]:
# TODO: Compute the full covariance matrix for selected macro columns using df.cov()
# Notice: the diagonal entries are the variances of each variable.

macro_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO']

cov_matrix = ...

print('Covariance matrix:')
cov_matrix

**Interpretation prompt** (write 2–4 sentences below):
- Is the covariance between GDP growth and unemployment positive or negative? Does that match your intuition?
- Can you compare the covariance of (GDP, unemployment) with (GDP, CPI) and say which relationship is "stronger"? Why or why not?
- What problem does the scale-dependence of covariance create for comparing relationships?

<a id="pearson-correlation"></a>
## Pearson Correlation

### Goal
Compute the Pearson correlation coefficient, understand why it fixes the
scale problem, and visualize a correlation matrix.

### Why this matters in economics
Pearson correlation normalizes covariance by the standard deviations of each
variable, giving a unit-free measure bounded between $-1$ and $+1$. This makes it
possible to compare the strength of relationships across different pairs of
variables. Every empirical economics paper starts with a correlation matrix to
understand variable relationships before running regressions.

**Key definition:**

$$r_{XY} = \frac{\text{Cov}(X, Y)}{s_X \cdot s_Y}$$

- $r = +1$: perfect positive linear relationship.
- $r = -1$: perfect negative linear relationship.
- $r = 0$: no *linear* relationship (but there could be a non-linear one!).
- Rule of thumb: $|r| > 0.7$ is "strong", $0.3 < |r| < 0.7$ is "moderate", $|r| < 0.3$ is "weak".

### Your Turn

In [None]:
# TODO: Compute Pearson correlation between GDP growth and unemployment manually
# r = Cov(X,Y) / (std_X * std_Y)
# Then compare with df[cols].corr()

pair = df[['gdp_growth_qoq', 'UNRATE']].dropna()
x = pair['gdp_growth_qoq']
y = pair['UNRATE']

r_manual = ...

r_pandas = ...

print(f'Manual Pearson r:  {r_manual:.6f}')
print(f'Pandas Pearson r:  {r_pandas:.6f}')
print(f'Match: {np.isclose(r_manual, r_pandas)}')

In [None]:
# TODO: Compute the Pearson correlation matrix for key macro variables using df.corr()

macro_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO']

corr_pearson = ...

corr_pearson

In [None]:
# TODO: Visualize the correlation matrix as a heatmap using seaborn
# Hint: sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, vmin=-1, vmax=1)

fig, ax = plt.subplots(figsize=(8, 6))

...

ax.set_title('Pearson Correlation Matrix — Macro Variables')
plt.tight_layout()
plt.show()

**Interpretation prompt** (write 2–4 sentences below):
- Which pair of variables has the strongest positive correlation? The strongest negative?
- The diagonal of the correlation matrix is all 1.0. Why?
- Does any correlation surprise you? Which relationship would you investigate further?

<a id="spearman-rank-correlation"></a>
## Spearman Rank Correlation

### Goal
Compute Spearman rank correlation and understand when it is preferable to Pearson.

### Why this matters in economics
Pearson correlation measures *linear* association. But many economic relationships
are monotonic without being strictly linear (e.g., diminishing returns). Spearman
rank correlation works by ranking the data first and then computing the Pearson
correlation on the ranks. This makes it:

- **Robust to outliers**: a single extreme GDP quarter will not distort the result.
- **Sensitive to non-linear monotonic relationships**: if $Y$ increases whenever $X$ increases (but not necessarily at a constant rate), Spearman will capture this even when Pearson underestimates it.
- **Applicable to ordinal data**: e.g., credit ratings (AAA, AA, A, ...) can be ranked even though the "distance" between grades is not uniform.

### Your Turn

In [None]:
# TODO: Compute Spearman rank correlation matrix and compare it to Pearson
# Hint: df[macro_cols].corr(method='spearman')

macro_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO']

corr_spearman = ...

print('Spearman correlation matrix:')
corr_spearman

In [None]:
# TODO: Compute the difference between Pearson and Spearman correlations.
# Large differences indicate non-linear relationships or outlier effects.

corr_diff = ...

print('Pearson minus Spearman (large values flag non-linearity or outliers):')
corr_diff

In [None]:
# TODO: Demonstrate a case where Pearson and Spearman disagree.
# Simulate a non-linear monotonic relationship: Y = X^3 + noise
# Pearson will underestimate the association; Spearman will capture it.

np.random.seed(42)
n_sim = 200
x_sim = np.random.uniform(-3, 3, n_sim)
y_sim = ...  # TODO: x_sim ** 3 + some noise

from scipy import stats

r_pearson_sim = ...   # TODO: Pearson r for x_sim, y_sim
r_spearman_sim = ...  # TODO: Spearman r for x_sim, y_sim

fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(x_sim, y_sim, alpha=0.5, s=20)
ax.set_xlabel('X')
ax.set_ylabel('Y = X^3 + noise')
ax.set_title(f'Non-linear monotonic: Pearson r = {r_pearson_sim:.3f}, Spearman r = {r_spearman_sim:.3f}')
plt.show()

**Interpretation prompt** (write 2–4 sentences below):
- Where do Pearson and Spearman differ most on the macro data? What might explain this?
- In the simulated $Y = X^3$ example, why is Spearman higher than Pearson?
- When would you choose Spearman over Pearson in an applied economics study?

<a id="visualizing-relationships"></a>
## Visualizing Relationships

### Goal
Use scatter plots and pair plots to explore bivariate relationships, and learn
what patterns a single correlation number can miss.

### Why this matters in economics
A correlation coefficient is a single number that summarizes a relationship. But
Anscombe's quartet famously showed that very different scatter patterns can produce
the same correlation. In economics, you might see clusters (expansion vs recession
regimes), outliers (financial crises), or non-linearities (diminishing returns) that
a number alone cannot capture. Always plot before you summarize.

### Your Turn

In [None]:
# TODO: Create a scatter plot of GDP growth vs unemployment rate.
# Color points by recession status if available.
# Hint: use df['recession'] for the color.

fig, ax = plt.subplots(figsize=(8, 5))

...

ax.set_xlabel('GDP Growth (QoQ %)')
ax.set_ylabel('Unemployment Rate (%)')
ax.set_title('GDP Growth vs Unemployment')
ax.legend(title='Recession')
plt.show()

In [None]:
# TODO: Create a pairplot (scatter matrix) of 4–5 key macro variables.
# Hint: sns.pairplot(df[cols].dropna(), diag_kind='kde')
# This may take a moment to render.

pairplot_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO']

...

plt.suptitle('Pairplot of Key Macro Variables', y=1.02)
plt.show()

In [None]:
# TODO: Create a 2x2 grid of scatter plots for selected pairs.
# Annotate each subplot with the Pearson correlation coefficient.

pairs = [
    ('gdp_growth_qoq', 'UNRATE'),
    ('gdp_growth_qoq', 'FEDFUNDS'),
    ('UNRATE', 'FEDFUNDS'),
    ('CPIAUCSL', 'INDPRO'),
]

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, (col_x, col_y) in zip(axes.flat, pairs):
    sub = df[[col_x, col_y]].dropna()
    ...  # TODO: scatter plot and annotate with r value

plt.tight_layout()
plt.show()

**Interpretation prompt** (write 2–4 sentences below):
- What patterns can you see in the scatter plots that the correlation number alone would miss?
- Do you see any clusters, outliers, or non-linear patterns?
- Which pair of variables has the most "clean" linear relationship, and which is the messiest?

<a id="spurious-correlation"></a>
## Spurious Correlation

### Goal
Demonstrate that two completely independent random walks can appear highly
correlated, and understand why this matters for time series analysis.

### Why this matters in economics
This is THE critical lesson of this notebook. Most macroeconomic variables are
**non-stationary**: GDP, prices, and population all trend upward over time. If
you compute the Pearson correlation between two trending series, you will
almost always get a high value — even if the series are completely unrelated.
This is called **spurious correlation**, and it has led to many false
conclusions in economics.

The classic example: GDP and the number of sunspots both trend upward, so
their correlation is high. Does GDP depend on sunspots? Of course not.

The solution is to work with **stationary** data (e.g., first differences
or growth rates) before computing correlations. This motivates the concept
of stationarity, which is covered in depth in the time series notebooks.

### Your Turn

In [None]:
# TODO: Simulate two INDEPENDENT random walks and compute their correlation.
# A random walk: S_t = S_{t-1} + e_t, where e_t ~ N(0, 1)
# Even though the innovations are independent, the cumulative sums will
# appear correlated because both trend stochastically.

np.random.seed(123)
T = 500  # length of the series

# Generate two independent white noise series
e1 = np.random.randn(T)
e2 = np.random.randn(T)

# Cumulative sums = random walks
rw1 = ...  # TODO: np.cumsum(e1)
rw2 = ...  # TODO: np.cumsum(e2)

# Correlation of the random walks (levels)
r_levels = ...

# Correlation of the innovations (first differences) — should be near zero
r_diffs = ...

print(f'Correlation of random walks (levels):        {r_levels:.4f}')
print(f'Correlation of innovations (first diffs):    {r_diffs:.4f}')

In [None]:
# TODO: Plot the two random walks on the same axes to see why they look related.
# Then plot the innovations (first differences) to show they are not.

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: random walks (levels)
...  # TODO: plot rw1 and rw2
axes[0].set_title(f'Two Independent Random Walks (r = {r_levels:.3f})')
axes[0].set_xlabel('Time')
axes[0].legend(['Walk 1', 'Walk 2'])

# Right: innovations (first differences)
...  # TODO: plot e1 and e2
axes[1].set_title(f'Innovations / First Differences (r = {r_diffs:.3f})')
axes[1].set_xlabel('Time')
axes[1].legend(['Innovation 1', 'Innovation 2'])

plt.tight_layout()
plt.show()

In [None]:
# TODO: Run the experiment many times to show that spurious correlation is systematic.
# For each trial, generate two independent random walks and record their correlation.
# Plot the distribution of these spurious correlations.

np.random.seed(0)
n_trials = 1000
T_trial = 200

spurious_corrs = []
for _ in range(n_trials):
    ...  # TODO: generate two random walks, compute correlation, append

fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(spurious_corrs, bins=40, edgecolor='black', alpha=0.7)
ax.axvline(0, color='red', linestyle='--', label='True correlation = 0')
ax.set_xlabel('Spurious Correlation')
ax.set_ylabel('Frequency')
ax.set_title(f'Distribution of Spurious Correlations ({n_trials} trials, T={T_trial})')
ax.legend()
plt.show()

print(f'Mean |r|: {np.mean(np.abs(spurious_corrs)):.3f}')
print(f'Fraction |r| > 0.5: {np.mean(np.abs(spurious_corrs) > 0.5):.1%}')

**Interpretation prompt** (write 2–4 sentences below):
- The two random walks are generated independently. Why does their correlation in levels end up so high?
- What happens to the correlation when you use first differences instead of levels?
- What fraction of the simulated spurious correlations exceeded $|r| > 0.5$? What does this imply for correlating trending economic series?

<a id="correlation-does-not-imply-causation"></a>
## Correlation Does Not Imply Causation

### Goal
Understand confounding variables and demonstrate how omitting them leads to
misleading correlations and biased regression coefficients.

### Why this matters in economics
This is the most important mantra in empirical economics. Classic example:
ice cream sales and crime both rise in summer. Does ice cream cause crime?
No — temperature is a **confounder** that drives both. In economics, confounders
are everywhere:

- Education and wages are correlated, but ability (unobserved) drives both.
- Government spending and GDP are correlated, but the business cycle drives both.
- Minimum wage and unemployment are correlated, but regional economic conditions confound the relationship.

The entire field of causal inference (instrumental variables, difference-in-differences,
regression discontinuity) exists to address this problem. Here we build the
intuition by simulating a simple confounding scenario.

### Your Turn

In [None]:
# TODO: Simulate a confounding scenario.
# Z (temperature) causes both X (ice cream sales) and Y (crime rate).
# X does NOT cause Y. But the naive correlation between X and Y will be high.
#
# DGP (data generating process):
#   Z ~ N(0, 1)           (confounder: temperature)
#   X = 2*Z + noise_x     (ice cream sales driven by temperature)
#   Y = 3*Z + noise_y     (crime driven by temperature, NOT by ice cream)

np.random.seed(99)
n_obs = 500

Z = np.random.randn(n_obs)           # confounder
X = ...  # TODO: 2*Z + noise
Y = ...  # TODO: 3*Z + noise

print(f'Correlation(X, Y) = {np.corrcoef(X, Y)[0, 1]:.4f}  <-- looks causal!')
print(f'Correlation(X, Z) = {np.corrcoef(X, Z)[0, 1]:.4f}')
print(f'Correlation(Y, Z) = {np.corrcoef(Y, Z)[0, 1]:.4f}')

In [None]:
# TODO: Run two regressions to show the effect of controlling for the confounder.
# (1) Naive: Y = a + b*X (omitting Z) — b will be biased (non-zero).
# (2) Controlled: Y = a + b*X + c*Z — b should be close to 0 (the true effect).
#
# Use numpy (np.linalg.lstsq) or statsmodels.

# Naive regression: Y ~ X
X_naive = np.column_stack([np.ones(n_obs), X])
coef_naive = ...  # TODO: solve for coefficients

# Controlled regression: Y ~ X + Z
X_ctrl = np.column_stack([np.ones(n_obs), X, Z])
coef_ctrl = ...  # TODO: solve for coefficients

print('Naive regression (omitting confounder Z):')
print(f'  Y = {coef_naive[0]:.4f} + {coef_naive[1]:.4f} * X')
print(f'  Coefficient on X: {coef_naive[1]:.4f}  <-- biased! True effect is 0.')
print()
print('Controlled regression (including confounder Z):')
print(f'  Y = {coef_ctrl[0]:.4f} + {coef_ctrl[1]:.4f} * X + {coef_ctrl[2]:.4f} * Z')
print(f'  Coefficient on X: {coef_ctrl[1]:.4f}  <-- close to 0 (the true effect).')
print(f'  Coefficient on Z: {coef_ctrl[2]:.4f}  <-- close to 3 (the true effect).')

In [None]:
# TODO: Visualize the confounding scenario.
# Left panel: scatter of X vs Y (looks correlated).
# Right panel: scatter of X vs Y, colored by Z (reveals the confounder).

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: naive scatter
...  # TODO
axes[0].set_xlabel('X (Ice Cream Sales)')
axes[0].set_ylabel('Y (Crime Rate)')
axes[0].set_title('Naive View: X and Y Look Correlated')

# Right: colored by confounder Z
...  # TODO: scatter with c=Z, cmap='coolwarm'
axes[1].set_xlabel('X (Ice Cream Sales)')
axes[1].set_ylabel('Y (Crime Rate)')
axes[1].set_title('Colored by Confounder Z (Temperature)')
plt.colorbar(axes[1].collections[0], ax=axes[1], label='Z (Temperature)')

plt.tight_layout()
plt.show()

**Interpretation prompt** (write 2–4 sentences below):
- In the naive regression, what is the coefficient on X? Is it close to the true causal effect (zero)?
- In the controlled regression, how does the coefficient on X change? Why?
- What does the right-panel scatter (colored by Z) reveal that the left panel hides?
- Can you think of an economic example where an omitted confounder might lead to wrong policy conclusions?

<a id="the-covariance-matrix-and-regression"></a>
## The Covariance Matrix and Its Role in Regression

### Goal
Understand the covariance matrix of predictors and why high off-diagonal entries
signal multicollinearity.

### Why this matters in economics
In OLS regression, the coefficient estimates depend on the covariance structure of
the predictors. The OLS formula is:

$$\hat{\beta} = (X^\top X)^{-1} X^\top y$$

When predictors are highly correlated (multicollinearity), $X^\top X$ becomes
nearly singular, and the coefficient estimates become unstable — small changes
in the data lead to large swings in the estimates. The covariance (or correlation)
matrix is your first diagnostic tool for multicollinearity.

Later, you will learn about the **Variance Inflation Factor (VIF)**, which
quantifies multicollinearity for each predictor. But the correlation heatmap
is where the diagnosis starts.

### Your Turn

In [None]:
# TODO: Select a set of potential predictors and compute their covariance matrix
# using np.cov().

predictor_cols = ['UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO', 'T10Y2Y']
pred_data = df[predictor_cols].dropna()

# np.cov expects each variable as a ROW, so transpose
cov_np = ...  # TODO: np.cov(pred_data.values, rowvar=False)

cov_df = pd.DataFrame(cov_np, index=predictor_cols, columns=predictor_cols)
print('Covariance matrix of predictors:')
cov_df

In [None]:
# TODO: Compute and visualize the correlation matrix of predictors.
# Identify pairs with |r| > 0.7 (potential multicollinearity).

corr_pred = ...

fig, ax = plt.subplots(figsize=(8, 6))
...  # TODO: heatmap
ax.set_title('Predictor Correlation Matrix (Check for Multicollinearity)')
plt.tight_layout()
plt.show()

# Flag pairs with |r| > 0.7
print('\nPairs with |correlation| > 0.7:')
for i in range(len(predictor_cols)):
    for j in range(i + 1, len(predictor_cols)):
        r = corr_pred.iloc[i, j]
        if abs(r) > 0.7:
            print(f'  {predictor_cols[i]} & {predictor_cols[j]}: r = {r:.3f}')

In [None]:
# TODO: Compute the eigenvalues of the covariance matrix.
# If the smallest eigenvalue is very close to zero (relative to the largest),
# it signals near-singularity (severe multicollinearity).
# The condition number = max_eigenvalue / min_eigenvalue.

eigenvalues = ...  # TODO: np.linalg.eigvalsh(cov_np)

condition_number = ...

print('Eigenvalues of the covariance matrix:')
for i, ev in enumerate(sorted(eigenvalues, reverse=True)):
    print(f'  lambda_{i+1} = {ev:.4f}')
print(f'\nCondition number: {condition_number:.2f}')
print('Rule of thumb: condition number > 30 suggests multicollinearity problems.')

**Interpretation prompt** (write 2–4 sentences below):
- Which predictor pairs have high correlation? Would including both in a regression be problematic?
- What does the condition number tell you about the predictor set?
- How does the covariance matrix relate to the OLS formula? Why does near-singularity matter?

---

## Where This Shows Up Later

- **Correlation matrix** is the first step before regression (`02_regression`). You will use it to select predictors and diagnose multicollinearity.
- **Spurious correlation** motivates the concept of stationarity (`07_time_series_econ/00`). You must difference or detrend non-stationary series before computing meaningful correlations.
- **Confounding** motivates the entire field of causal inference (`06_causal`). Instrumental variables, diff-in-diff, and regression discontinuity all exist to deal with confounders.
- **Covariance matrix** appears in PCA (`04_unsupervised/01`), where eigenvectors of the covariance matrix define the principal components.

<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run these asserts to verify your work. If any fail, go back and fix the corresponding section.

In [None]:
# ---- Covariance checks ----
assert isinstance(cov_manual, float), 'cov_manual should be a float'
assert np.isclose(cov_manual, cov_pandas), 'Manual and pandas covariance should match'

# ---- Pearson correlation checks ----
assert isinstance(corr_pearson, pd.DataFrame), 'corr_pearson should be a DataFrame'
assert corr_pearson.shape[0] == corr_pearson.shape[1], 'Correlation matrix should be square'
assert np.allclose(np.diag(corr_pearson.values), 1.0), 'Diagonal of corr matrix should be 1.0'
assert (corr_pearson.values >= -1).all() and (corr_pearson.values <= 1).all(), 'Correlations must be in [-1, 1]'

# ---- Spearman checks ----
assert isinstance(corr_spearman, pd.DataFrame), 'corr_spearman should be a DataFrame'
assert corr_spearman.shape == corr_pearson.shape, 'Spearman and Pearson matrices should have same shape'

# ---- Spurious correlation checks ----
assert abs(r_diffs) < 0.2, f'Correlation of innovations should be near 0, got {r_diffs:.4f}'

# ---- Confounding checks ----
assert abs(coef_naive[1]) > 0.5, 'Naive coefficient on X should be significantly biased'
assert abs(coef_ctrl[1]) < 0.5, 'Controlled coefficient on X should be close to 0'

# ---- Covariance matrix checks ----
assert cov_np.shape[0] == cov_np.shape[1] == len(predictor_cols), 'Cov matrix shape mismatch'
assert np.allclose(cov_np, cov_np.T), 'Covariance matrix must be symmetric'

print('All checkpoint assertions passed.')

## Extensions (Optional)
- Compute a **rolling correlation** between GDP growth and unemployment (e.g., 20-quarter window) and plot it over time. Has the relationship been stable?
- Investigate **partial correlation**: the correlation between X and Y after removing the effect of Z. Compare to the simple correlation.
- Download data from FRED for two obviously unrelated trending series (e.g., US GDP and world population) and compute their correlation. Then difference both series and recompute. This is the spurious correlation problem in real data.
- Explore the **correlation between lagged variables**: does unemployment at time $t$ correlate more strongly with GDP growth at $t$ or at $t-1$? This foreshadows Granger causality.

## Reflection
- What is the most important lesson you take away from this notebook?
- If a colleague shows you a correlation of 0.85 between two macroeconomic series, what questions would you ask before drawing conclusions?
- How does the spurious correlation demonstration change how you think about correlation in time series data?
- Can you think of a policy-relevant example where confusing correlation with causation would lead to a harmful decision?

<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Covariance — measuring joint variability</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 07 — Covariance
pair = df[['gdp_growth_qoq', 'UNRATE']].dropna()
x = pair['gdp_growth_qoq']
y = pair['UNRATE']
n = len(x)

# Manual calculation
cov_manual = ((x - x.mean()) * (y - y.mean())).sum() / (n - 1)

# Pandas calculation
cov_pandas = df[['gdp_growth_qoq', 'UNRATE']].cov().loc['gdp_growth_qoq', 'UNRATE']

print(f'Manual covariance:  {cov_manual:.6f}')
print(f'Pandas covariance:  {cov_pandas:.6f}')
print(f'Match: {np.isclose(cov_manual, cov_pandas)}')

# Full covariance matrix
macro_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO']
cov_matrix = df[macro_cols].cov()
```

</details>

<details><summary>Solution: Pearson correlation</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 07 — Pearson correlation
pair = df[['gdp_growth_qoq', 'UNRATE']].dropna()
x = pair['gdp_growth_qoq']
y = pair['UNRATE']

r_manual = ((x - x.mean()) * (y - y.mean())).sum() / ((n - 1) * x.std() * y.std())
r_pandas = df[['gdp_growth_qoq', 'UNRATE']].corr().loc['gdp_growth_qoq', 'UNRATE']

# Full correlation matrix
macro_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO']
corr_pearson = df[macro_cols].corr()

# Heatmap
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr_pearson, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, vmin=-1, vmax=1, ax=ax)
ax.set_title('Pearson Correlation Matrix \u2014 Macro Variables')
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: Spearman rank correlation</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 07 — Spearman rank correlation
macro_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO']

corr_spearman = df[macro_cols].corr(method='spearman')

corr_diff = corr_pearson - corr_spearman

# Non-linear example: Y = X^3 + noise
np.random.seed(42)
n_sim = 200
x_sim = np.random.uniform(-3, 3, n_sim)
y_sim = x_sim ** 3 + np.random.randn(n_sim) * 3

from scipy import stats
r_pearson_sim = np.corrcoef(x_sim, y_sim)[0, 1]
r_spearman_sim = stats.spearmanr(x_sim, y_sim).correlation
```

</details>

<details><summary>Solution: Visualizing relationships</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 07 — Visualizing relationships

# Scatter: GDP growth vs unemployment, colored by recession
fig, ax = plt.subplots(figsize=(8, 5))
colors = df['recession'].map({0: 'steelblue', 1: 'red'})
ax.scatter(df['gdp_growth_qoq'], df['UNRATE'], c=colors, alpha=0.6, s=30)
ax.set_xlabel('GDP Growth (QoQ %)')
ax.set_ylabel('Unemployment Rate (%)')
ax.set_title('GDP Growth vs Unemployment')
# Manual legend
from matplotlib.lines import Line2D
handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor='steelblue', label='No Recession'),
           Line2D([0], [0], marker='o', color='w', markerfacecolor='red', label='Recession')]
ax.legend(handles=handles, title='Recession')
plt.show()

# Pairplot
pairplot_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO']
sns.pairplot(df[pairplot_cols].dropna(), diag_kind='kde')
plt.suptitle('Pairplot of Key Macro Variables', y=1.02)
plt.show()

# Annotated scatter grid
pairs = [
    ('gdp_growth_qoq', 'UNRATE'),
    ('gdp_growth_qoq', 'FEDFUNDS'),
    ('UNRATE', 'FEDFUNDS'),
    ('CPIAUCSL', 'INDPRO'),
]
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, (col_x, col_y) in zip(axes.flat, pairs):
    sub = df[[col_x, col_y]].dropna()
    ax.scatter(sub[col_x], sub[col_y], alpha=0.5, s=20)
    r = sub[col_x].corr(sub[col_y])
    ax.set_xlabel(col_x)
    ax.set_ylabel(col_y)
    ax.set_title(f'{col_x} vs {col_y} (r = {r:.3f})')
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: Spurious correlation</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 07 — Spurious correlation
np.random.seed(123)
T = 500

e1 = np.random.randn(T)
e2 = np.random.randn(T)

rw1 = np.cumsum(e1)
rw2 = np.cumsum(e2)

r_levels = np.corrcoef(rw1, rw2)[0, 1]
r_diffs = np.corrcoef(e1, e2)[0, 1]

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(rw1, label='Walk 1')
axes[0].plot(rw2, label='Walk 2')
axes[0].set_title(f'Two Independent Random Walks (r = {r_levels:.3f})')
axes[0].set_xlabel('Time')
axes[0].legend()

axes[1].plot(e1, alpha=0.5, label='Innovation 1')
axes[1].plot(e2, alpha=0.5, label='Innovation 2')
axes[1].set_title(f'Innovations / First Differences (r = {r_diffs:.3f})')
axes[1].set_xlabel('Time')
axes[1].legend()
plt.tight_layout()
plt.show()

# Many trials
np.random.seed(0)
n_trials = 1000
T_trial = 200
spurious_corrs = []
for _ in range(n_trials):
    a = np.cumsum(np.random.randn(T_trial))
    b = np.cumsum(np.random.randn(T_trial))
    spurious_corrs.append(np.corrcoef(a, b)[0, 1])
```

</details>

<details><summary>Solution: Correlation does not imply causation</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 07 — Confounding / causation
np.random.seed(99)
n_obs = 500

Z = np.random.randn(n_obs)
X = 2 * Z + np.random.randn(n_obs) * 0.5
Y = 3 * Z + np.random.randn(n_obs) * 0.5

# Naive regression: Y ~ X
X_naive = np.column_stack([np.ones(n_obs), X])
coef_naive, _, _, _ = np.linalg.lstsq(X_naive, Y, rcond=None)

# Controlled regression: Y ~ X + Z
X_ctrl = np.column_stack([np.ones(n_obs), X, Z])
coef_ctrl, _, _, _ = np.linalg.lstsq(X_ctrl, Y, rcond=None)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X, Y, alpha=0.3, s=15)
axes[0].set_xlabel('X (Ice Cream Sales)')
axes[0].set_ylabel('Y (Crime Rate)')
axes[0].set_title('Naive View: X and Y Look Correlated')

sc = axes[1].scatter(X, Y, c=Z, cmap='coolwarm', alpha=0.5, s=15)
axes[1].set_xlabel('X (Ice Cream Sales)')
axes[1].set_ylabel('Y (Crime Rate)')
axes[1].set_title('Colored by Confounder Z (Temperature)')
plt.colorbar(sc, ax=axes[1], label='Z (Temperature)')
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: The covariance matrix and regression</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 07 — Covariance matrix and regression
predictor_cols = ['UNRATE', 'FEDFUNDS', 'CPIAUCSL', 'INDPRO', 'T10Y2Y']
pred_data = df[predictor_cols].dropna()

cov_np = np.cov(pred_data.values, rowvar=False)
cov_df = pd.DataFrame(cov_np, index=predictor_cols, columns=predictor_cols)

# Correlation matrix and heatmap
corr_pred = pred_data.corr()
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr_pred, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, vmin=-1, vmax=1, ax=ax)
ax.set_title('Predictor Correlation Matrix (Check for Multicollinearity)')
plt.tight_layout()
plt.show()

# Eigenvalues and condition number
eigenvalues = np.linalg.eigvalsh(cov_np)
condition_number = eigenvalues.max() / eigenvalues.min()
```

</details>