In [None]:
# === Environment Setup ===
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller, coint
from statsmodels.tsa.api import VECM
try:
    import pandas_datareader.data as web
    PDR_AVAILABLE = True
except ImportError:
    PDR_AVAILABLE = False
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 8), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-info'>📝 {msg}</div>"))
def sec(title): print(f'\n{80*"="}\n| {title.upper()} |\n{80*"="}')

note("Environment initialized for Cointegration and VECM.")

# Chapter 8.6: Cointegration and Error Correction Models

---

### Table of Contents

1.  [**Introduction: The Problem of Non-Stationary Variables**](#intro)
    - [Spurious Regression](#spurious)
2.  [**Cointegration: The Long-Run Relationship**](#cointegration)
3.  [**Vector Error Correction Models (VECM)**](#vecm)
4.  [**Case Study: Consumption and Income**](#case-study)
5.  [**Exercises**](#exercises)
6.  [**Summary and Key Takeaways**](#summary)

<a id='intro'></a>
## 1. Introduction: The Problem of Non-Stationary Variables

A major challenge in econometrics is that most macroeconomic time series are non-stationary (typically I(1)). Regressing one non-stationary series on another can lead to a **spurious regression**: a regression that shows a statistically significant relationship between the variables, even when they are completely unrelated in reality. This happens because two independent random walks will both trend in some direction by pure chance, and OLS will pick up this coincidental correlation.

<a id='spurious'></a>
### Spurious Regression
The tell-tale signs of a spurious regression are a very high $R^2$ combined with a very low Durbin-Watson statistic (indicating highly autocorrelated residuals). The residuals themselves will be non-stationary. This suggests that the model has not captured a true economic relationship, but merely a coincidental trend.

In [None]:
sec("Demonstration of Spurious Regression")
np.random.seed(42)
n_obs = 500

# Generate two independent random walks
x = np.random.normal(size=n_obs).cumsum()
y = np.random.normal(size=n_obs).cumsum()

# Regress y on x
X = sm.add_constant(x)
model = sm.OLS(y, X)
results = model.fit()

note("Even though x and y are completely unrelated, the regression shows a highly significant relationship and a high R-squared. This is a classic spurious regression.")
print(results.summary())

<a id='cointegration'></a>
## 2. Cointegration: The Long-Run Relationship

While regressing one I(1) series on another is generally problematic, there is a crucial exception. The concept of **cointegration**, developed by **Clive Granger** and **Robert Engle** (work for which they shared the 2003 Nobel Prize), provides a framework for modeling meaningful long-run relationships between non-stationary variables.

**Definition: Cointegration**
A set of I(1) variables are said to be cointegrated if there exists a linear combination of them that is stationary, I(0). This stationary linear combination is called the **cointegrating equation** and is interpreted as the long-run equilibrium relationship between the variables.

Think of two drunkards walking home from a bar. Each person's path is a random walk (I(1)). However, they are holding hands, so they cannot drift infinitely far apart from each other. Their individual paths are non-stationary, but the *distance between them* is stationary. They are cointegrated.

![Visualization of Cointegration](../images/png/cointegration_visualization.png)

For example, if consumption ($c_t$) and income ($y_t$) are both I(1), but the linear combination $c_t - \beta y_t = u_t$ is I(0), then consumption and income are cointegrated. The term $u_t$ represents the deviation from the long-run equilibrium. While consumption and income can drift apart in the short run, the fact that $u_t$ is stationary means they cannot wander arbitrarily far from each other; there is an economic force that pulls them back to their long-run relationship.

<a id='vecm'></a>
## 3. Vector Error Correction Models (VECM)

If a set of variables are cointegrated, we cannot use a standard VAR model on their levels (because they are non-stationary) or on their first-differences (because we would lose the long-run equilibrium information). The appropriate model is a **Vector Error Correction Model (VECM)**.

A VECM is a restricted VAR designed for cointegrated series. It can be thought of as a VAR in first-differences, with an additional **error correction term**. For two variables, the VECM is:
$$ \Delta y_{1,t} = \alpha_1(y_{2,t-1} - \beta y_{1,t-1}) + \text{lags of } \Delta y_{1,t}, \Delta y_{2,t} + \epsilon_{1,t} $$
$$ \Delta y_{2,t} = \alpha_2(y_{2,t-1} - \beta y_{1,t-1}) + \text{lags of } \Delta y_{1,t}, \Delta y_{2,t} + \epsilon_{2,t} $$

- The term $(y_{2,t-1} - \beta y_{1,t-1})$ is the lagged **error correction term**—the deviation from the long-run equilibrium in the previous period.
- The coefficient $\alpha$ is the **speed of adjustment**. It measures how strongly the variable responds to deviations from equilibrium. For example, a negative $\alpha_1$ means that if $y_1$ was too high relative to its long-run relationship with $y_2$ in the previous period, it will tend to decrease in the current period, moving back towards equilibrium.

In [None]:
<a id='case-study'></a>
sec("Case Study: Cointegration between Consumption and Income")

# 1. Load data
series_to_load = {
    'PCECC96': 'LogCons',
    'DPIC96': 'LogInc'
}
if PDR_AVAILABLE:
    note("Attempting to download quarterly US macro data from FRED.")
    try:
        start = '1960-01-01'
        end = '2019-12-31' # End before COVID for a more stable period
        data_raw = web.DataReader(list(series_to_load.keys()), 'fred', start, end)
        note("Data downloaded successfully.")
    except Exception as e:
        note(f"Could not download data from FRED ({e}). Falling back to local CSVs.")
        cons = pd.read_csv('data/PCECC96.csv', index_col='observation_date', parse_dates=True)
        inc = pd.read_csv('data/DPIC96.csv', index_col='observation_date', parse_dates=True)
        data_raw = pd.concat([cons, inc], axis=1)
else:
    note("pandas_datareader not available. Loading data from local CSVs.")
    cons = pd.read_csv('data/PCECC96.csv', index_col='observation_date', parse_dates=True)
    inc = pd.read_csv('data/DPIC96.csv', index_col='observation_date', parse_dates=True)
    data_raw = pd.concat([cons, inc], axis=1)

df = np.log(data_raw).dropna()
df.columns = ['LogCons', 'LogInc']
df = df.loc['1960-01-01':'2019-12-31']
df.plot(title='Log of Real Consumption and Disposable Income')
plt.show()

# 2. Test for Cointegration
note("Testing for cointegration between the two series using the Engle-Granger test.")
score, p_value, _ = coint(df['LogCons'], df['LogInc'])
note(f"Cointegration test p-value: {p_value:.4f}")
if p_value < 0.05:
    note("The p-value is less than 0.05, so we reject the null hypothesis of no cointegration. The series appear to have a long-run equilibrium relationship.")
else:
    note("The p-value is greater than 0.05, so we fail to reject the null of no cointegration.")

# 3. Estimate and Analyze the VECM
note("Since the series are cointegrated, we can estimate a VECM.")
# k_ar_diff is the number of lags in the VAR part of the VECM
model_vecm = VECM(df, k_ar_diff=1, coint_rank=1, deterministic='ci')
results_vecm = model_vecm.fit()
print(results_vecm.summary())


### Interpreting the VECM Results

The VECM summary provides a wealth of information. Let's break down the key parts:

1.  **Error Correction Section (`alpha` coefficients):**
    - This section shows the **speed of adjustment** coefficients. The `alpha` for the `Delta(LogCons)` equation tells us how quickly consumption responds to deviations from the long-run equilibrium.
    - A statistically significant and negative coefficient (e.g., -0.05) would mean that when consumption is above its long-run path relative to income, it adjusts downwards by 5% of the deviation in the next quarter.
    - The `alpha` for the `Delta(LogInc)` equation is often not significant, implying that income is 'weakly exogenous' and does not adjust to restore the equilibrium; rather, consumption does all the work.

2.  **Cointegrating-vector Section (`beta` coefficients):**
    - This section shows the **long-run cointegrating relationship**. The coefficients are normalized. If the `beta` vector is `[1.0000, -0.9200]`, it means the long-run relationship is:
      $$ LogCons_{t-1} - 0.92 \times LogInc_{t-1} = 0 $$
      $$ LogCons = 0.92 \times LogInc $$
    - This implies a long-run marginal propensity to consume of 0.92, which is economically very sensible.

### Beyond Engle-Granger: The Johansen Test

The Engle-Granger test is simple and intuitive, but it has limitations. It can only find at most one cointegrating relationship, and the result can be sensitive to which variable is chosen as the dependent variable in the first-stage regression. 

The **Johansen test** is a more powerful and widely used procedure that overcomes these issues. It is a maximum likelihood-based test that can determine the **cointegration rank** (the number of cointegrating relationships) in a system of multiple time series. It is the standard approach for systems with more than two variables.

<a id='exercises'></a>\n## 5. Exercises\n\n1.  **Interpreting Cointegration:** If the price of a stock traded in New York and the price of its cross-listed depository receipt in London are both I(1) but are found to be cointegrated, what is the economic interpretation of this relationship? What does the error correction term represent?\n2.  **Testing for Cointegration:** The Engle-Granger two-step method involves first running an OLS regression of one I(1) variable on another, and then performing a unit root test (like the ADF test) on the residuals of that regression. What is the null hypothesis of this ADF test, and what would a rejection of the null imply?\n3.  **VECM vs. VAR in Differences:** If two variables are cointegrated, why is it better to use a VECM rather than a standard VAR on the first differences of the data?\n4.  **Speed of Adjustment:** In the VECM summary for consumption and income, look at the `alpha` coefficient in the `Delta(LogCons)` equation. If this coefficient is, for example, -0.2, what does this mean in practical terms?

<a id='summary'></a>\n## 6. Summary and Key Takeaways\n\nThis chapter introduced the concepts of cointegration and error correction, which are essential for modeling long-run relationships between non-stationary economic variables.\n\n**Key Concepts**:\n- **Spurious Regression**: A naive regression of one I(1) variable on another can produce statistically significant results even when the variables are unrelated, due to coincidental trends.\n- **Cointegration**: Two or more I(1) variables are cointegrated if a linear combination of them is stationary (I(0)). This stationary combination represents the long-run equilibrium relationship.\n- **Error Correction Term**: The deviations from the long-run cointegrating relationship. Because this term is stationary, it tends to revert to its mean of zero.\n- **Vector Error Correction Model (VECM)**: The appropriate model for cointegrated time series. It is a VAR in first-differences that includes the lagged error correction term, allowing it to model both short-run dynamics and the long-run equilibrium adjustment.

### Solutions to Exercises\n\n---\n\n**1. Interpreting Cointegration:**\nThis means the two prices share a common stochastic trend and are bound by a long-run equilibrium relationship, likely enforced by arbitrage. While the prices can drift apart temporarily due to market frictions, they cannot deviate from each other indefinitely. The error correction term represents the (temporary) deviation from the law of one price, or the arbitrage opportunity. \n\n---\n\n**2. Engle-Granger Test:**\nThe null hypothesis of the ADF test on the residuals is that the residuals have a unit root (i.e., they are non-stationary). A rejection of the null means the residuals are stationary, which, by definition, implies that the original series are cointegrated.\n\n---\n\n**3. VECM vs. VAR in Differences:**\nA standard VAR in first-differences is misspecified for cointegrated data because it throws away the crucial information about the long-run equilibrium relationship. By differencing all variables, you remove the common trend, but you also remove the information about how the variables move together in the long run. The VECM correctly includes this long-run information through the error correction term, leading to a better model specification and more accurate forecasts.\n\n---\n\n**4. Speed of Adjustment:**\nAn alpha coefficient of -0.2 in the consumption equation means that when consumption is above its long-run equilibrium level relative to income, it will adjust downwards in the next period to correct for about 20% of that deviation. It measures the speed at which consumption returns to the long-run path after a shock.