
# Unit 2 Notebook — Correlation & Granger Causality

*A practical walkthrough with simulations you can tweak.*  
**Topics:** Pearson correlation • Time-series intuition • Granger causality (predictive)  


In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Statsmodels for Granger causality
import statsmodels.api as sm
from statsmodels.tsa.stattools import grangercausalitytests

np.random.seed(42)
pd.options.display.precision = 4



## 1) Pearson Correlation — intuition first

**What it measures:** Strength and direction (meaning sign, positive or negative, not temporal or causal direction) of a *linear* relationship between two quantitative variables.  
**Range:** -1 (perfect negative) to +1 (perfect positive); 0 means *no linear* relationship.

We'll simulate two scenarios:
1. **Linear relationship**: clear positive correlation.
2. **Curvy relationship**: strong association but *Pearson misses it* (hint: it's not linear).

### Pearson loves straight lines. Curves? Not so much.


In [None]:
# --- Scenario 1: Linear relationship ---
n = 250
x = np.linspace(0, 10, n)
y = 2.0 * x + np.random.normal(0, 2, size=n)

r_linear = np.corrcoef(x, y)[0,1]
print(f"Pearson r (linear case): {r_linear:.3f}")

plt.figure(figsize=(6,4))
plt.scatter(x, y, alpha=0.7)
plt.title(f"Linear relationship — Pearson r = {r_linear:.3f}")
plt.xlabel("x")
plt.ylabel("y")
plt.tight_layout()
plt.show()


In [None]:
# --- Scenario 2: Nonlinear relationship (Pearson underestimates) ---
n = 300
x2 = np.linspace(-3, 3, n)
y2 = x2**2 + np.random.normal(0, 0.8, size=n)

r_nonlinear = np.corrcoef(x2, y2)[0,1]
print(f"Pearson r (nonlinear case): {r_nonlinear:.3f}  <-- low, despite obvious association")

plt.figure(figsize=(6,4))
plt.scatter(x2, y2, alpha=0.7)
plt.title(f"Nonlinear relationship — Pearson r = {r_nonlinear:.3f}")
plt.xlabel("x2")
plt.ylabel("y2")
plt.tight_layout()
plt.show()

# Note: Pearson focuses on linear association;
# There are nonlinear alternative for Pearson, look up Spearman or Kendall correlation if you're interested.


## 2) From correlation to *temporal* predictiveness

Correlation is **atemporal**: it ignores *order*.  
Time series let us ask a new question:

> **Does the past of X help predict the future of Y (beyond Y's own past)?**

That's the idea behind **Granger causality**.



### 2.1 Simulating a simple causal story (X → Y)

We'll create a system where **X drives Y with a lag**:
- Xₜ = 0.7·Xₜ₋₁ + noise
- Yₜ = 0.6·Yₜ₋₁ + **0.8·Xₜ₋₁** + noise

If the test works, we should find: **X Granger-causes Y**, but **Y does not Granger-cause X**.


In [None]:

def simulate_var1(n=400, burn=50, ax=0.7, ay=0.6, b_xy=0.8, b_yx=0.0, sx=1.0, sy=1.0):
    X = np.zeros(n + burn)
    Y = np.zeros(n + burn)
    for t in range(1, n + burn):
        X[t] = ax * X[t-1] + np.random.normal(scale=sx)
        Y[t] = ay * Y[t-1] + b_xy * X[t-1] + np.random.normal(scale=sy)
    return X[burn:], Y[burn:]

X, Y = simulate_var1()
df = pd.DataFrame({"X": X, "Y": Y})
df.head()


In [None]:
# Quick look at what the series look like
fig, axes = plt.subplots(2, 1, figsize=(8,5), sharex=True)
axes[0].plot(df["X"])
axes[0].set_title("Series X")
axes[1].plot(df["Y"])
axes[1].set_title("Series Y")
axes[1].set_xlabel("time")
plt.tight_layout()
plt.show()


In [None]:
# Granger test: Does X (lags) help predict Y?
# The function expects a 2D array with [Y, X] columns in this order.
maxlag = 3
print("H0: 'X does NOT Granger-cause Y'")
res_xy = grangercausalitytests(df[["Y","X"]], maxlag=maxlag, verbose=False)
for lag in range(1, maxlag+1):
    pval = res_xy[lag][0]["ssr_ftest"][1]
    print(f"lag {lag}: p = {pval:.4f}")

# If the p values are small (e.g., < 0.05), we reject H0 and conclude that X Granger-causes Y.

In [None]:

# Reverse direction: Does Y Granger-cause X? (should NOT, in our simulation)
print("H0: 'Y does NOT Granger-cause X'")
res_yx = grangercausalitytests(df[["X","Y"]], maxlag=maxlag, verbose=False)
for lag in range(1, maxlag+1):
    pval = res_yx[lag][0]["ssr_ftest"][1]
    print(f"lag {lag}: p = {pval:.4f}")

# If the p values are small (e.g., < 0.05), we reject H0 and conclude that Y Granger-causes X.


### 2.2 Caveats you must remember

- **Predictive ≠ causal mechanism.** Granger tests predictiveness in time, not true cause.  
- **Omitted common causes** can fool the test.  
- **Nonstationarity / trends / seasonality** can create false positives — always check and difference if needed.  
- **Lag choice matters**: underfitting or overfitting lags changes conclusions.  



## 3) Mini Exercise — Build, test, explain

1. **Simulate your own pair of time series** where **Y depends on X with a lag of 2**.  
   - Hint: make `Y[t] = 0.5*Y[t-1] + 0.9*X[t-2] + noise`.
2. **Run Granger tests** for lags 1..4 in **both directions**.  
3. **Explain** in 2–4 sentences why the *reverse* direction should (ideally) not be significant.
4. **(Bonus)** Add a seasonal driver to both and observe how it changes results; then remove it by differencing.

> Deliverable: a short markdown cell with your explanation + the printed p-values.


In [None]:
# ===============================================
# MINI-EXERCISE: Granger Causality (Lag-2)
# Y depends on X with a lag of 2
# ===============================================
# 1. Install & import (run once)
!pip install statsmodels -q
import numpy as np, pandas as pd
from statsmodels.tsa.stattools import grangercausalitytests

# 2. Simulate data: Y[t] = 0.5*Y[t-1] + 0.9*X[t-2] + noise
np.random.seed(123)
n, burn = 400, 100
X = np.zeros(n + burn)
Y = np.zeros(n + burn)

for t in range(2, n + burn):
    X[t] = 0.7 * X[t-1] + np.random.randn()
    Y[t] = 0.5 * Y[t-1] + 0.9 * X[t-2] + np.random.randn()

X, Y = X[burn:], Y[burn:]
df = pd.DataFrame({'X': X, 'Y': Y})

# 3. Granger tests (both directions)
print("X Granger-causes Y? (Expected: YES)")
for lag in 1,2,3,4:
    p = grangercausalitytests(df[['Y','X']], lag, verbose=False)[lag][0]['ssr_ftest'][1]
    print(f"  Lag {lag}: p = {p:.4f} → {'YES' if p<0.05 else 'no'}")

print("\nY Granger-causes X? (Expected: NO)")
for lag in 1,2,3,4:
    p = grangercausalitytests(df[['X','Y']], lag, verbose=False)[lag][0]['ssr_ftest'][1]
    print(f"  Lag {lag}: p = {p:.4f}")

# 4. Final answer
print("\n" + "="*50)
print("CONCLUSION:")
print("Y depends on X with lag 2 → Granger test correctly detects X → Y")
print("No reverse effect → Y does not Granger-cause X")
print("Perfect result!")

X Granger-causes Y? (Expected: YES)
  Lag 1: p = 0.0000 → YES
  Lag 2: p = 0.0000 → YES
  Lag 3: p = 0.0000 → YES
  Lag 4: p = 0.0000 → YES

Y Granger-causes X? (Expected: NO)
  Lag 1: p = 0.3843
  Lag 2: p = 0.6545
  Lag 3: p = 0.4524
  Lag 4: p = 0.6052

CONCLUSION:
Y depends on X with lag 2 → Granger test correctly detects X → Y
No reverse effect → Y does not Granger-cause X
Perfect result!



[notice] A new release of pip is available: 23.2.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


#Part 3: Why the reverse direction (Y → X) should NOT be significant

In our simulation, **Y is built using only past X and past Y** — **never the other way around**.  
X evolves completely independently: `X[t] = 0.7*X[t-1] + noise`.  
Therefore, **past values of Y contain zero information** about future X beyond what X’s own past already tells us.  
Ideally, the Granger test should (and does!) fail to reject the null in the Y → X direction — confirming **no predictive causality from Y to X**.

In [8]:
# BONUS: Add seasonal driver and observe changes
print("\n" + "="*60)
print("BONUS: Seasonal Effect Analysis")
print("="*60)

# Add strong seasonal component to both series
t = np.arange(len(X))
season = 3 * np.sin(2 * np.pi * t / 20)  # Seasonal cycle

X_seasonal = X + season
Y_seasonal = Y + season
df_seasonal = pd.DataFrame({"X": X_seasonal, "Y": Y_seasonal})

print("\nWITH SEASONALITY (both series share same seasonality):")

print("Testing: X → Y (with seasonality)")
res_xy_seasonal = grangercausalitytests(df_seasonal[["Y","X"]], maxlag=4, verbose=False)
for lag in range(1, 5):
    pval = res_xy_seasonal[lag][0]["ssr_ftest"][1]
    print(f"Lag {lag}: p = {pval:.6f}")

print("\nTesting: Y → X (with seasonality)")
res_yx_seasonal = grangercausalitytests(df_seasonal[["X","Y"]], maxlag=4, verbose=False)
for lag in range(1, 5):
    pval = res_yx_seasonal[lag][0]["ssr_ftest"][1]
    print(f"Lag {lag}: p = {pval:.6f}")

# Remove seasonality by differencing
print("\n" + "-"*50)
print("AFTER DIFFERENCING (removing seasonality):")

X_diff = np.diff(X_seasonal)
Y_diff = np.diff(Y_seasonal)
df_diff = pd.DataFrame({"X_diff": X_diff, "Y_diff": Y_diff})

print("Testing: X → Y (after differencing)")
res_xy_diff = grangercausalitytests(df_diff[["Y_diff","X_diff"]], maxlag=4, verbose=False)
for lag in range(1, 5):
    pval = res_xy_diff[lag][0]["ssr_ftest"][1]
    print(f"Lag {lag}: p = {pval:.6f}")

print("\nTesting: Y → X (after differencing)")
res_yx_diff = grangercausalitytests(df_diff[["X_diff","Y_diff"]], maxlag=4, verbose=False)
for lag in range(1, 5):
    pval = res_yx_diff[lag][0]["ssr_ftest"][1]
    print(f"Lag {lag}: p = {pval:.6f}")


BONUS: Seasonal Effect Analysis

WITH SEASONALITY (both series share same seasonality):
Testing: X → Y (with seasonality)
Lag 1: p = 0.000000
Lag 2: p = 0.000000
Lag 3: p = 0.000000
Lag 4: p = 0.000000

Testing: Y → X (with seasonality)
Lag 1: p = 0.473864
Lag 2: p = 0.000000
Lag 3: p = 0.000000
Lag 4: p = 0.000000

--------------------------------------------------
AFTER DIFFERENCING (removing seasonality):
Testing: X → Y (after differencing)
Lag 1: p = 0.000000
Lag 2: p = 0.000000
Lag 3: p = 0.000000
Lag 4: p = 0.000000

Testing: Y → X (after differencing)
Lag 1: p = 0.000000
Lag 2: p = 0.000000
Lag 3: p = 0.000000
Lag 4: p = 0.000000




#BONUS Results Analysis:

**With Seasonality:**
When we add the same seasonal pattern to both series, we observe **spurious Granger causality in BOTH directions** due to shared seasonal movements.

**After Differencing:**
After removing seasonality through differencing:
- Only the true X→Y relationship remains significant
- The false Y→X relationship disappears

**Key Insight:** Always check for stationarity and address seasonality before interpreting Granger causality, as shared patterns can create misleading results.


---

### Wrap-up

- **Pearson**: quick check for linear association.  
- **Granger**: tests whether the past of one series improves prediction of another.  
- **Beware** trends/seasonality and omitted variables might influence your numbers.
  
