# 01 Cointegration and Error Correction

Engle-Granger cointegration and error correction models (ECM).


## Table of Contents
- [Construct cointegrated pair](#construct-cointegrated-pair)
- [Engle-Granger test](#engle-granger-test)
- [Error correction model](#error-correction-model)
- [Interpretation](#interpretation)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)


## Why This Notebook Matters
Time-series econometrics notebooks build the classical toolkit you need before trusting macro regressions:
- stationarity + unit roots,
- cointegration + error correction,
- VAR dynamics and impulse responses.


## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Running tests without plotting or transforming the series first.
- Treating impulse responses as structural causality without an identification story.

## Matching Guide
- `docs/guides/08_time_series_econ/01_cointegration_error_correction.md`



## How To Use This Notebook
- This notebook is hands-on. Most code cells are incomplete on purpose.
- Complete each TODO, then run the cell.
- Use the matching guide (`docs/guides/08_time_series_econ/01_cointegration_error_correction.md`) for deep explanations and alternative examples.
- Write short interpretation notes as you go (what changed, why it matters).



<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.



In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT



## Goal
Learn cointegration and error correction models (ECM):
- long-run equilibrium relationship
- short-run dynamics that correct deviations



## Primer: Classical time-series econometrics with statsmodels (ADF/KPSS, VAR)

This repo already uses time-aware evaluation for ML.
This primer introduces the “classical” time-series econometrics toolkit in `statsmodels`.

### Stationarity and unit roots (ADF / KPSS)
Two common tests:
- **ADF**: null = unit root (nonstationary)
- **KPSS**: null = stationary

```python
from statsmodels.tsa.stattools import adfuller, kpss

# x is a 1D array-like (no missing)
# adf_stat, adf_p, *_ = adfuller(x)
# kpss_stat, kpss_p, *_ = kpss(x, regression='c', nlags='auto')
```

Interpretation habit:
- If ADF p-value is small → evidence against unit root.
- If KPSS p-value is small → evidence against stationarity.

### VAR: multivariate autoregression
VAR models multiple series together:
```python
from statsmodels.tsa.api import VAR

# df: DataFrame of stationary-ish series with a DatetimeIndex
# model = VAR(df)
# res = model.fit(maxlags=8, ic='aic')  # or choose lags manually
# print(res.summary())
```

Useful tools:
```python
# res.test_causality('y', ['x1', 'x2'])      # Granger causality tests
# irf = res.irf(12)                         # impulse responses to 12 steps
# irf.plot(orth=True)                       # orthogonalized (ordering matters)
```

### Practical cautions
- Nonstationary series can create **spurious regression** results.
- IRFs depend on identification choices (e.g., Cholesky ordering).
- Macro series are revised and can have structural breaks; treat results as conditional and fragile.


<a id="construct-cointegrated-pair"></a>
## Construct cointegrated pair

### Goal
Construct a pair of series that are individually nonstationary but cointegrated.



### Your Turn: Simulate a cointegrated pair


In [None]:
import numpy as np
import pandas as pd

rng = np.random.default_rng(0)
n = 240
idx = pd.date_range('2000-01-31', periods=n, freq='ME')

x = rng.normal(size=n).cumsum()
y = 1.0 * x + rng.normal(scale=0.5, size=n)

df = pd.DataFrame({'x': x, 'y': y}, index=idx)
df.head()



<a id="engle-granger-test"></a>
## Engle-Granger test

### Goal
Run a cointegration test and interpret the p-value carefully.



### Your Turn: Cointegration test


In [None]:
from statsmodels.tsa.stattools import coint

t_stat, p_val, _ = coint(df['y'], df['x'])
{'t': t_stat, 'p': p_val}



<a id="error-correction-model"></a>
## Error correction model

### Goal
Fit an ECM:
- short-run changes depend on long-run disequilibrium (lagged residual)



### Your Turn: Fit ECM


In [None]:
import statsmodels.api as sm

# Long-run regression
lr = sm.OLS(df['y'], sm.add_constant(df[['x']], has_constant='add')).fit()
df['u'] = lr.resid

# ECM regression
ecm = pd.DataFrame({
    'dy': df['y'].diff(),
    'dx': df['x'].diff(),
    'u_lag1': df['u'].shift(1),
}).dropna()

res = sm.OLS(ecm['dy'], sm.add_constant(ecm[['dx', 'u_lag1']], has_constant='add')).fit()
res.params



<a id="interpretation"></a>
## Interpretation

Write 5-8 sentences:
- What does the error-correction coefficient mean?
- What would you expect if there were no cointegration?



<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2-3 sentences summarizing what you verified.



In [None]:
import pandas as pd

# TODO: Validate your time series table is well-formed.
# Example (adjust variable names):
# assert isinstance(df.index, pd.DatetimeIndex)
# assert df.index.is_monotonic_increasing
# assert df.shape[0] > 30
#
# TODO: If you built transformed series (diff/logdiff), confirm no future leakage.
# Hint: transformations should only use past/current values (shift/diff), never future.
...



## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.



## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?



<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Construct cointegrated pair</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_cointegration_error_correction — Construct cointegrated pair
import numpy as np
import pandas as pd

rng = np.random.default_rng(0)
n = 240
idx = pd.date_range('2000-01-31', periods=n, freq='ME')

x = rng.normal(size=n).cumsum()  # random walk
y = 1.0 * x + rng.normal(scale=0.5, size=n)  # cointegrated with x

df = pd.DataFrame({'x': x, 'y': y}, index=idx)
df.head()
```

</details>

<details><summary>Solution: Engle-Granger test</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_cointegration_error_correction — Engle-Granger test
from statsmodels.tsa.stattools import coint

t_stat, p_val, _ = coint(df['y'], df['x'])
{'t': t_stat, 'p': p_val}
```

</details>

<details><summary>Solution: Error correction model</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_cointegration_error_correction — Error correction model
import statsmodels.api as sm

# Step 1: long-run relationship
lr = sm.OLS(df['y'], sm.add_constant(df[['x']], has_constant='add')).fit()
df['u'] = lr.resid

# Step 2: ECM
ecm = pd.DataFrame({
    'dy': df['y'].diff(),
    'dx': df['x'].diff(),
    'u_lag1': df['u'].shift(1),
}).dropna()

res = sm.OLS(ecm['dy'], sm.add_constant(ecm[['dx', 'u_lag1']], has_constant='add')).fit()
res.params
```

</details>

<details><summary>Solution: Interpretation</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_cointegration_error_correction — Interpretation
# Explain what the error-correction coefficient implies about mean reversion.
```

</details>

