# Yield Curve Frictions — Clean Analysis Notebook (Daily Data)

This is the **tidied analysis** version of the original notebook.  
All reusable functions now live in `yc_frictions_tools.py`. The notebook focuses on **data**, **estimation calls**, and **results** with clear commentary.

**Heads-up**
- Half-lives are reported in **trading days** (daily frequency).
- End-of-month (**EOM**) effects are more robust than quarter-end (**EOQ**).
- HAC (Newey–West) errors are used; ADF guides differencing.


# Audit & Discrepancy Notes *(added 2025-08-14)*

This notebook estimates short-run and long-run relationships between ACMY and (i) **foreign constraints** (GBPUSD/EURUSD/JPY 3M basis, PCA) and (ii) **domestic constraints** (IORB–SOFR, GCF_survey), with curve and volatility controls. Below we flag places where the empirical results **in the body** differ from statements made **up front** or in earlier drafts, and we add brief commentary to keep the narrative aligned with the estimates.

---

### Key empirical facts from this notebook
- **Cointegration**: `BPBS_3MO` and `ACMY` are cointegrated (Engle–Granger residual ADF rejects the unit-root null). This supports a **long-run equilibrium** in levels.
- **ECM adjustment**: The error-correction coefficient `ECT_L1` is **negative and small in magnitude** → **very slow** mean reversion (long half-life).
- **Short-run basis effects**: In Δ models and in ECM short-run equations, **Δbasis is not statistically significant** once curve controls are included.
- **PCA basis (`Basis_PC1`)**: Identified as **I(0)** and **not cointegrated** with ACMY. Large ARDL long-run multipliers in levels are therefore **mechanical** (due to high persistence in **Y**) and **should not** be read as structural long-run effects.
- **Domestic proxies (IORB–SOFR, GCF_survey)**: Show **economically meaningful long-run slopes** in levels with ACMY; **short-run pass-through** in Δ is **small/insignificant** conditional on controls.
- **Curve slopes (Δ10–2, Δ2–1MO)**: **Consistently strong** and economically large in ΔACMY regressions → they are the **dominant short-run drivers**.
- **EOM dummy**: **Robustly significant** (typically around −1bp) in Δ models and ECM; the effect strengthens when using an **EOM window** (last 2bd + first day of next month).
- **Volatility controls**: MOVE/VIX are marginal in ΔACMY once slopes are present; using **both** induces collinearity.

---

### Discrepancies to correct (if present up front)
- If the introduction claims **“basis has significant short-run effects”**, update to: *basis primarily influences the **long-run level**; **short-run impact is not robust** conditional on curve controls.*
- If it claims **“PCA basis delivers a large long-run multiplier”**, add: *the series is I(0) and **not cointegrated** with ACMY; large LRM in ARDL levels is likely **spurious** from persistence.*
- If it downplays **EOM**, update to: *EOM is **consistently significant** and economically small but persistent (≈ −1bp).* 
- If both **10–2 and 2–1MO** are interpreted simultaneously, add a note on **multicollinearity** and prefer a **parsimonious baseline** with a single slope and single vol control.

---

### Interpretation guide
- **Long run** = cointegrating slope in the levels regression (`θ`): permanent 1bp change in the proxy is associated with `θ` bp change in ACMY in equilibrium.
- **Short run** = coefficients on **Δ** regressors in Δ models/ECM. **λ (ECT_L1)** gives the **speed of adjustment** back to equilibrium; half-life = `ln(0.5) / ln(1+λ)`.
- **Beware mechanical overlap**: Δ(10–2) contains Δ(10y) by construction; treat it as a **control/hedge**, not as a standalone trading signal.


## Contents

- [Notes](#notes)
- [Data loading & cleaning](#data-loading-cleaning)
- [Stationarity tests (ADF)](#stationarity-tests-adf)
- [Lag construction](#lag-construction)
- [Regression (static / Δ model)](#regression-static-δ-model)
- [ECM estimation](#ecm-estimation)


In [125]:
# Core imports
import sys, numpy as np, pandas as pd, matplotlib.pyplot as plt
# Make sure the tools module is on path
sys.path.append('../src')
from yc_frictions_tools import (
    run_ols, tidy_results, adf_test, adf_classify, make_lags,
    fit_distributed_lag, build_reg_data_for_dY, engle_granger_test, run_ecm, dlog_safe, run_ecm_with_lags
)

# Plot defaults (keep simple to avoid style coupling)
plt.rcParams['figure.figsize'] = (7, 3.5)
plt.rcParams['axes.grid'] = True


### Notes

## Augmented Dickey–Fuller (ADF) Test

The **Augmented Dickey–Fuller (ADF) test** is a statistical test used to determine whether a time series is **stationary** or has a **unit root** (non-stationary).

---

### 1. Purpose

In time-series econometrics, many models (e.g., OLS in levels) assume stationarity.  
The ADF test helps check if a variable is:
- **I(0)**: stationary — mean, variance, and autocovariance are constant over time.
- **I(1)**: non-stationary — often requires differencing to achieve stationarity.

---

### 2. Hypotheses

- **Null hypothesis (H₀)**: The series has a unit root → **non-stationary**.
- **Alternative hypothesis (H₁)**: The series is stationary (no unit root).

---

### 3. Test structure

The ADF regression takes the form:

Δyₜ = α + β·t + γ·yₜ₋₁ + Σᵢ δᵢ·Δyₜ₋ᵢ + εₜ

Where:
- Δyₜ = first difference of yₜ
- α = constant (optional)
- β·t = deterministic time trend (optional)
- γ = coefficient on lagged level → **key for stationarity**
- Lagged Δy terms account for autocorrelation.

---

### 4. Decision rule

- Look at the **p-value**:
  - If **p ≤ 0.05** → reject H₀ → series is stationary.
  - If **p > 0.05** → fail to reject H₀ → series has a unit root.
- Alternatively, compare the ADF statistic to critical values:
  - ADF stat < critical value → reject H₀.

---

### 5. Practical notes

- Financial and macroeconomic variables are often **I(1)** in levels but stationary in **first differences**.
- ADF can be run with or without trend/constant; choice should match the series’ properties.
- For cointegration tests (e.g., Engle–Granger), ADF is applied to **regression residuals**.

---

**References**:
- Dickey, D. A., & Fuller, W. A. (1979). "Distribution of the estimators for autoregressive time series with a unit root."
- Hamilton, J. D. (1994). *Time Series Analysis*.


### Multicolinearity Check
It is important to check if a group of regressors are multicolinear, which would affect the precision of the regression. 

This indicates there is an issue of multicolinearity which in turn deteriorate the precision of the full regressions. We have two choices (i) drop EUBS_3MO entirely, or (ii) construct a PCA to represent 'foreign channel' of intermadiary frictions.

No multicolinearity between these two variables and thus we can use both as controls at the same time.

This is a sign of multicolinearity, I will drop VIX from the set of controls and use it solely for robustness check.

No multicolinearity between these two variables thus they can be admitted as controls.

No multicolinearity between these two variables thus they can be admitted as controls.

### Static Regressions
The first regressions separately regress each of the basis variables, to see the effect of each variable individually. Then, we proceed to regress them altogether, to see what is more significant. Controls are included. Lagged variables are not included.

We see that GBPUSD 3 month basis have the highest explanatory power among the basis variables, as evident in both statistic significance and R-squared. When putting in the same regression, GBPUSD 3 month basis has the most economic significance too.

PCA1 captures more than 80 percent of the variation of the cross-currency factor, which perform very well. Further, the static regression indicates that Basis_PC1 is statistically and economically significant, even after including the controls. 

## ARDL(1,1) Regression with Controls — ACMY10 on Basis_PC1

**Model specification:**  
$Y_t = \alpha + \rho Y_{t-1} + \beta_0 \,\text{Basis\_PC1}_t + \beta_1 \,\text{Basis\_PC1}_{t-1} + \gamma' \mathbf{X}_t + u_t$  
where $\mathbf{X}_t$ includes MOVE, 10–2 slope, 2–1MO slope, IORB–SOFR spread, GCF survey rate, end-of-month (eom) and end-of-quarter (eoq) dummies.



---

### 1. Model Fit
- **$R^2 = 0.995$** — Very high, driven largely by persistence ($Y_{t-1} \approx 0.991$).  
- High $R^2$ in highly persistent series should be interpreted with caution.

---

### 2. Basis_PC1 Effects
- **Short-run coefficients**:  
  - $\beta_0 = 1.9577$ (p ≈ 0.130) — not significant.  
  - $\beta_1 = -1.4503$ (p ≈ 0.274) — not significant.
- **Cumulative short-run effect**: $\beta_0 + \beta_1 \approx 0.5074$ (insignificant).
- **Long-run multiplier (LRM)**:  
  $$
  \text{LRM} = \frac{\beta_0 + \beta_1}{1 - \rho} \approx \frac{0.5074}{1 - 0.991} \approx 57.26
  $$  
  → Large LRM arises mainly from high persistence, not strong short-run effects.

---

### 3. Control Variables
- MOVE: $0.0293$ (p ≈ 0.119) — marginal.  
- 2_1MO slope: $0.9282$ (p ≈ 0.009) — significant positive effect.  
- IORB–SOFR: $-0.1490$ (p ≈ 0.111) — not significant.  
- eom dummy: $-1.7518$ (p ≈ 0.004) — significant, likely settlement/calendar pattern.  
- eoq dummy: insignificant.  
- GCF survey: not significant.

---

### 4. Multicollinearity
- **Condition number**: $3.65 \times 10^3$ → suggests strong multicollinearity or scaling issues.
- Likely sources:  
  - Persistent variables ($Y_{t-1}$, Basis_PC1 lags)  
  - Correlated controls (e.g., yield curve slopes, MOVE/VIX if included together)

---

### 5. Residual Diagnostics
- DW ≈ $2.02$ — no major autocorrelation.  
- JB p-value < 0.001 — residuals not normally distributed.  
- Omnibus p < 0.001 — confirms non-normality.

---

### Interpretation
- No significant short-run effect of Basis_PC1 once controls are included.  
- Large LRM is mechanical due to high persistence in ACMY10.  
- Multicollinearity may be inflating standard errors — robustness checks could include:
  - Dropping correlated controls one at a time.
  - Orthogonalizing Basis_PC1 against controls.


## Augmented Dickey–Fuller (ADF) Tests

We test each series for stationarity under the null hypothesis:

- **$H_0$**: Series has a unit root (nonstationary)  
- **$H_1$**: Series is stationary

---

### Interpretation
- **ACMY10, 10_2, 2_1MO, IORB_SOFR, and GCF_survey** are I(1), requiring differencing for stationarity.  
- **Basis_PC1, MOVE, VIX, eom, and eoq** are already stationary, so no differencing is needed.  
- Since the series are of different integration orders, they cannot be cointegrated in the Engle–Granger sense — ARDL models remain appropriate for analysis.

## Regression Interpretation

The dependent variable is  
$\Delta Y = \Delta \text{ACMY10}$  
with the main explanatory variable being the proxy $\text{Basis\_PC1}$ (in levels, with lags 0 and 1), plus controls in differences or levels depending on stationarity.

---

**Key results:**

- **Proxy coefficients:**
  - $\text{proxy\_L0} = 1.0755$ (p = 0.370)  
  - $\text{proxy\_L1} = -1.2371$ (p = 0.300)  
  Neither coefficient is individually significant at the 5% level.
  - **Cumulative short-run effect**:  
    $1.0755 + (-1.2371) = -0.1616$  
    Small and statistically insignificant.

- **Controls:**
  - $\Delta 10\_2$ and $\Delta 2\_1MO$ are **highly significant** (p < 0.001), with large positive coefficients (68.31 and 46.85 respectively), suggesting term-structure slope changes explain much of the variation in $\Delta\text{ACMY10}$.
  - $\text{eom}$ is negative and significant (p = 0.042), indicating a systematic end-of-month effect.
  - MOVE, $\Delta$IORB\_SOFR, $\Delta$GCF\_survey, and eoq are not statistically significant.

- **Model fit:**
  - $R^2 = 0.449$, meaning about 45% of the variation in $\Delta\text{ACMY10}$ is explained.
  - Condition number = $2.6 \times 10^3$, high enough to suggest some multicollinearity concerns, though below extreme values in the earlier PCA runs.

---

**Interpretation:**
- The proxy variable **Basis\_PC1** does not have a statistically significant short-run effect on $\Delta\text{ACMY10}$ after controlling for domestic term-structure slopes, policy spreads, and calendar dummies.
- The term-structure slope variables dominate the explanatory power in this model.
- The end-of-month dummy remains a consistent and significant predictor.
- Multicollinearity among controls is possible, but the strongest effects are concentrated in slope measures.


### Engle–Granger Cointegration Test: ACMY10 vs BPBS_3MO

**Step 1 – Levels Regression**  
$$
ACMY10_t = 399.46 + 5.8982 \cdot BPBS\_3MO_t + u_t
$$ 
- $ R^2 $ = 0.342  
- Slope highly significant (t = 22.75, p < 0.001)

**Step 2 – ADF on Residuals**  
- ADF statistic = **-3.1747**  
- p-value = **0.0215**  
- Critical value (5%) = **-2.8645**  
- Result: **Reject unit root null → residuals are stationary**

**Conclusion**  
BPBS_3MO and ACMY10 are **cointegrated at the 5% level**, implying a stable long-run relationship.  
This supports using an **Error Correction Model (ECM)** for short-run dynamics while preserving the long-run equilibrium.


## ECM Results — ACMY10 vs BPBS_3MO (with controls)

**Long-run (levels):**  
$ACMY10_t = 399.46 + 5.8982 \cdot BPBS\_3MO_t + u_t$  
- Slope highly significant (t ≈ 22.7) → stable long-run link confirmed by EG test.

**Short-run ECM:**
$\Delta ACMY10_t = \alpha + \gamma\,\Delta BPBS\_3MO_t + \lambda\,ECT_{t-1} + \delta'\text{Controls}_t + \varepsilon_t$

- **Error-correction term**: $\lambda = -0.0072$ (p = 0.028)  
  → About **0.72%** of the disequilibrium closes per period → **half-life ≈ 96 periods** (very slow mean reversion).
- **Short-run basis effect**: $\gamma = 0.1699$ (p = 0.373)  
  → **Not significant** once controls are included.
- **Controls**  
  - $\Delta 10\_2$: **66.83** (p < 0.001) — strongly significant  
  - $\Delta 2\_1MO$: **46.54** (p < 0.001) — strongly significant  
  - MOVE: 0.0259 (p = 0.097) — marginal  
  - $\Delta$IORB\_SOFR: 0.0859 (p = 0.366) — ns  
  - $\Delta$GCF\_survey: 0.0479 (p = 0.608) — ns  
  - **EOM**: −0.983 (p = 0.065) — borderline negative month-end effect  
  - **EOQ**: 0.974 (p = 0.320) — ns
- **Fit**: $R^2 = 0.450$ (Adj. $R^2 = 0.445$)

### Interpretation
- There is a **validated long-run relationship** (levels slope ≈ 5.90), but the **short-run impact of basis changes is not significant** after accounting for domestic slopes and other controls.
- Adjustment back to the long-run equilibrium is **statistically significant but very slow** (half-life ≈ 3 months if data are daily business days).
- **Yield-curve slope changes dominate** short-run movements in $\Delta ACMY10$, while month-end effects remain meaningful.

### Next steps (optional)
- Try alternative lag blocks or local-projection IRFs for the basis shock.
- Use **BPBS\_3MO as an instrument** for domestic constraints (test first-stage F for relevance).
- Check sub-samples / structural breaks (e.g., 2020, QT periods) to see if $\gamma$ strengthens in stress regimes.


### Testing for the Domestic Channel
Next step is to repeat the above technique to test the significance of the domestic channel. The variables to test include the difference between Interest on Reserve Balance  and Secured Overnight Finance Rate (IORB_SOFR); and the difference between General Collateral Finance Repo rate and Repo survey rate (GCF_survey). 

The static regressions point to the significance of both proxies, though **GCF_survey** exhibits greater statistic significance. The R^2 is quite comparable at around 70 percent. But when regressing altogether, **GCF_survey** is the only significant, eating up the explanatory power of **IORB_SOFR**.

**Domestic_PC1** captures well (at 75 percent loadings) of the variation of both variables. When regressing together with all the controls, **Domestic_PC1** is statistically significant. Let us dig deeper by looking at dynamic regressions. 

Even with the lagged dependent variable **Y_L1**, we can still observe the statistic significance of **Domestic_PC1_L0**. 

# ECM summary

## Setup
- **Goal:** Link 10Y yield changes to a funding-constraint proxy with a long-run equilibrium.
- **Data (monthly, bp):**
  - $Y_t$: `ACMY10` (10Y, bp)
  - $X_t$: `GCF_survey` (bp)
  - Controls (in Δ or level per stationarity): `MOVE`, `10_2`, `2_1MO`, `abs_JYBS3M`, `EUBS_3MO`, `BPBS_3MO`, `eom`, `eoq`.

## Cointegration (Step A)
Levels regression: \( Y_t = \alpha + \beta X_t \;(+\text{trend}) + \varepsilon_t \)

- **Long-run slope (bp/bp):** **β ≈ 14.76**
- **Residual stationarity (EG/ADF):** residual is I(0) ⇒ **cointegration holds**.
- **Error-correction term:** \( \mathrm{EC}_{t-1} = \hat{\varepsilon}_{t-1} \)

## ECM (Step B)
$$
\Delta Y_t = b_0\,\Delta X_t + b_1\,\Delta X_{t-1} + \phi\,\mathrm{EC}_{t-1} + \Gamma' Z_t + u_t
$$

**Key estimates (HAC SEs):**
- **Speed of adjustment:** \( \phi \approx \mathbf{-0.0155} \) (significant)  
  Half-life $ \approx $ **44–45 months**
- **Short-run X effects:**  
  $ b_0 $ ≈ small / n.s., $ b_1 \approx \mathbf{-0.175} $ (significant)
- **Controls:** `2_1MO` and `BPBS_3MO` significant; `eom` negative & significant; `eoq` borderline
- **Fit:** \(R^2 \approx 0.04\) (typical for ΔY)

**Impacts:**
- **Long-run effect (bp/bp):** **14.76** (this is **β** from Step A)
- **Short-run impact (sum \(b_0+b_1\)):** **≈ −0.153 bp per 1 bp** (≈ **−1% of long-run**)

## Dynamics intuition
- A **+1 bp permanent** increase in `GCF_survey` raises 10Y by **~14.8 bp** in the long run.
- Adjustment is **glacial**: with \( \phi \approx -0.0155 \), cumulative ΔY after 240 months ≈ **14.40 bp**, converging to β as \((1+\phi)^H\) decays.

## Diagnostics / cautions
- Do **not** include `Domestic_PC1` together with its components (`IORB_SOFR`, `GCF_survey`) → multicollinearity.
- Keep **units consistent** (bp ↔ bp).
- If both series drift, include a **trend** in Step A (and build EC from that same LR spec), or use a **Johansen** VECM if other I(1) terms belong in the long-run vector.

## Next robustness checks
- Add **trend** to Step-A LR and re-estimate ECM; compare β and φ.
- **Johansen** with `[ACMY10, GCF_survey, Domestic_PC1]` to allow a richer long-run relation.
- Subsample stability (e.g., pre/post regulatory changes).
- If cointegration fails in variants, switch to a **short-run Δ-factor** proxy (PC of `[ΔIORB_SOFR, ΔGCF_survey]`) and avoid long-run claims.


### BPBS_3MO Significance — Summary

- **Main ECM regression**: `BPBS_3MO` was statistically significant.  
  → This coefficient reflects both the *direct* effect of BPBS_3MO on ΔY and any effect operating through correlation with other regressors.

- **Partial regression test**: Isolated the component of BPBS_3MO orthogonal to other controls (residual-on-residual regression).  
  - Without HAC: borderline significance (t ≈ 1.96, p ≈ 0.051).  
  - With HAC(5): no longer significant (t ≈ 1.03, p ≈ 0.30).

- **Interpretation**:
  1. The main ECM significance may be partly due to correlation with other explanatory variables.
  2. Once we isolate BPBS_3MO’s *unique* variation and adjust for heteroskedasticity/autocorrelation, the statistical evidence weakens.
  3. This suggests the BPBS_3MO–ΔY relationship is **fragile** and sensitive to specification and robust standard errors.


### Data loading & cleaning

In [151]:
from functools import reduce

# Data
df_y = pd.read_csv('../data/cleaned/term_premia_1961_present.csv',parse_dates = ['DATE'])

df_GCF = pd.read_csv('../data/instruments/GCF_survey.csv',parse_dates = ['Date']) #bps difference
df_IORB = pd.read_csv("../data/instruments/IORB_SOFR.csv", parse_dates = ['Date']) #bps difference


df_treasury2y = pd.read_csv("../data/DGS2.csv", parse_dates = ["observation_date"])
df_treasury10y = pd.read_csv("../data/DGS10.csv", parse_dates = ["observation_date"])
df_treasury1mo = pd.read_csv("../data/DGS1MO.csv", parse_dates = ["observation_date"])
df_us3mo = pd.read_csv("../data/cleaned/US_SWAP_OIS_3M_2001_present.csv", parse_dates = ["Date"])

df_jpybs = pd.read_csv("../data/cleaned/JYBS2021_present.csv", parse_dates = ["Date"])
df_eubs = pd.read_csv("../data/basis/EURUSD_BS_2021_present.csv", parse_dates = ['Date'])
df_gbpbs = pd.read_csv("../data/basis/GBPUSD_BS_2021_present.csv", parse_dates = ['Date'])

df_vix = pd.read_csv("../data/cleaned/VIX_2015_present.csv", parse_dates = ['Date'])
df_move = pd.read_csv("../data/cleaned/MOVE_2011_present.csv")
df_move['Date'] = pd.to_datetime(df_move['Date'], format='%d/%m/%Y')

In [152]:
# Clean data and combine into one main dataframe
df_y = df_y.rename(columns={"DATE":"Date"})

df_GCF = df_GCF.rename(columns={"diff":"GCF_survey"})
df_IORB = df_IORB.rename(columns={"diff":"IORB_SOFR"})

df_treasury2y = df_treasury2y.rename(columns={"observation_date":"Date"})
df_treasury10y = df_treasury10y.rename(columns={"observation_date":"Date"})
df_treasury1mo = df_treasury1mo.rename(columns={"observation_date":"Date"})

dfs = [df_y, df_GCF, df_IORB, df_treasury2y,df_treasury10y,df_treasury1mo, df_jpybs, df_eubs, df_gbpbs, df_vix, df_move]
main_df = reduce(lambda left, right: pd.merge(left, right, on='Date', how='inner'), dfs)

main_df.head()

Unnamed: 0,Date,ACMY01,ACMY02,ACMY03,ACMY05,ACMY10,GCF,Survey,GCF_survey,SOFR,...,EUBS_3MO,EUBS_1,EUBS_2,EUBS_6MO,BPBS_2,BPBS_3MO,BPBS_1,BPBS_6MO,VIX,MOVE
0,2021-07-29,7.454911,19.565586,37.90542,73.351583,133.628751,0.05,0.05,0.0,0.05,...,-9.697,-13.409,-13.9165,-15.9515,-5.355,-7.841,-6.31,-10.572,17.7,62.08
1,2021-07-30,7.510232,18.308177,35.780793,70.32944,130.044208,0.047,0.05,-0.3,0.05,...,-9.221,-12.893,-13.525,-15.636,-4.97,-7.71,-6.0526,-10.405,18.24,61.19
2,2021-08-02,7.045642,17.274342,33.581155,66.451106,124.938277,0.073,0.05,2.3,0.05,...,-9.0739,-12.599,-12.9274,-15.0036,-4.9287,-7.74,-6.035,-10.467,19.46,64.29
3,2021-08-03,7.364041,16.901834,32.869399,65.454984,123.775457,0.068,0.05,1.8,0.05,...,-8.6216,-12.291,-12.3437,-14.7808,-4.781,-7.5986,-5.859,-10.242,18.04,65.42
4,2021-08-04,6.770883,17.975155,34.858284,67.470352,123.228257,0.064,0.05,1.4,0.05,...,-7.48,-12.7151,-12.5668,-13.718,-4.5186,-6.869,-5.478,-9.695,17.97,62.67


### VARIABLES CONSTRUCTION

### Lagged term premia (dependent variable)

In [155]:
# Lagged dependent variable
main_df['Y_L1'] = main_df['ACMY10'].shift(1)

### Construction of seasonality dummies

In [157]:
# --- End-of-Quarter (EOQ) Dummy ---
quarter_ends = main_df.loc[main_df['Date'].dt.is_quarter_end, 'Date']

# Build window
eoq_dates = set()
for qd in quarter_ends:
    for offset in range(-2, 3):  # 2 days before to 2 days after
        day = qd + pd.tseries.offsets.BDay(offset)
        eoq_dates.add(day)

main_df['eoq'] = main_df['Date'].isin(eoq_dates).astype(int)

In [158]:
# --- End-of-Month (EOM) Dummy ---
main_df = main_df.copy()
main_df = main_df.sort_values("Date").reset_index(drop=True)

# Flag calendar end-of-month dates
main_df["eom"] = main_df["Date"].dt.is_month_end.astype(int)

# Optional: widen window ±2 trading days
main_df["eom"] = main_df["eom"].rolling(5, center=True, min_periods=1).max()

### Term structure slopes

In [160]:
main_df['10_2'] = main_df['DGS10']-main_df['DGS2']
main_df['2_1MO'] = main_df['DGS2']-main_df['DGS1MO']

### JPY Basis
Since JPY cross currency basis is constantly negative, I take absolute value for clearer interpretation.

In [162]:
main_df['abs_JYBS3M'] = abs(main_df['JYBS3M'])

### Lag construction

In [164]:
# Lagged dependent variable
main_df['Y_L1'] = main_df['ACMY10'].shift(1)

# GCF_survey
proxy_var = 'GCF_survey'  
main_df[f'{proxy_var}_L0'] = main_df[proxy_var]
main_df[f'{proxy_var}_L1'] = main_df[proxy_var].shift(1)

In [165]:
'''

main_df = main_df.sort_values('Date').reset_index(drop=True)
main_df['ACMY10_lag'] = main_df['ACMY10'].shift(1)
main_df['ACMY10_lag2'] = main_df['ACMY10'].shift(2) 

main_df = main_df.dropna(subset=['ACMY10_lag'])
main_df = main_df.dropna(subset=['ACMY10_lag2'])

# Create lags 0–5 for GCF_survey
for lag in range(6):
   main_df[f'GCF_survey_L{lag}'] = main_df['GCF_survey'].shift(lag)

# Drop rows with NaNs introduced by shifting
main_df = main_df.dropna().reset_index(drop=True)
'''

"\n\nmain_df = main_df.sort_values('Date').reset_index(drop=True)\nmain_df['ACMY10_lag'] = main_df['ACMY10'].shift(1)\nmain_df['ACMY10_lag2'] = main_df['ACMY10'].shift(2) \n\nmain_df = main_df.dropna(subset=['ACMY10_lag'])\nmain_df = main_df.dropna(subset=['ACMY10_lag2'])\n\n# Create lags 0–5 for GCF_survey\nfor lag in range(6):\n   main_df[f'GCF_survey_L{lag}'] = main_df['GCF_survey'].shift(lag)\n\n# Drop rows with NaNs introduced by shifting\nmain_df = main_df.dropna().reset_index(drop=True)\n"

### PCA construction for Basis with lags

In [167]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# First we test for multicolinearity among the basis variables; VIF>0 indicates multicolinearity

X = main_df[['abs_JYBS3M','EUBS_3MO','BPBS_3MO']].dropna()
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

     Variable        VIF
0  abs_JYBS3M   4.361872
1    EUBS_3MO  15.497536
2    BPBS_3MO   8.378342


Potential multicolinearity justifies PCA.

In [169]:
from sklearn.decomposition import PCA

# PCA for Basis
X = main_df[['abs_JYBS3M','EUBS_3MO','BPBS_3MO']].dropna()

# Standardize (important for PCA)
X_std = (X - X.mean()) / X.std()

pca = PCA(n_components=1)
main_df['Basis_PC1'] = pca.fit_transform(X_std)

print("Explained variance by PC1:", pca.explained_variance_ratio_[0])
print("PC1 loadings:", pca.components_[0])

Explained variance by PC1: 0.820457121105438
PC1 loadings: [-0.52977133  0.62269261  0.57584395]


In [170]:
# Proxy for 'Basis'
proxy_var = 'Basis_PC1'  

# Lagged proxy variables
main_df[f'{proxy_var}_L0'] = main_df[proxy_var]
main_df[f'{proxy_var}_L1'] = main_df[proxy_var].shift(1)

### PCA construction for Domestic variables with lags

In [172]:
# First we test for multicolinearity among the basis variables; VIF>0 indicates multicolinearity

X = main_df[['IORB_SOFR','GCF_survey']].dropna()
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

     Variable       VIF
0   IORB_SOFR  1.518312
1  GCF_survey  1.518312


No multicolinearity detected. But for comparibility, let us do PCA.

In [174]:
# PCA for Domestic channel
X = main_df[['IORB_SOFR','GCF_survey']].dropna()

# Standardize (important for PCA)
X_std = (X - X.mean()) / X.std()

pca = PCA(n_components=1)
main_df['Domestic_PC1'] = pca.fit_transform(X_std)

print("Explained variance by PC1:", pca.explained_variance_ratio_[0])
print("PC1 loadings:", pca.components_[0])

Explained variance by PC1: 0.7554110833935439
PC1 loadings: [-0.70710678  0.70710678]


In [175]:
# Proxy for 'Domestic channel'
proxy_var = 'Domestic_PC1'  

# Lagged proxy variables
main_df[f'{proxy_var}_L0'] = main_df[proxy_var]
main_df[f'{proxy_var}_L1'] = main_df[proxy_var].shift(1)

### Stationarity tests (ADF) 

In [177]:
# --- Core Python Packages ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import statsmodels.api as sm
from functools import reduce
from statsmodels.tsa.stattools import adfuller

# --- Configuration for Plotting ---
plt.style.use("ggplot")

# --- Project Modules ---
import sys
import os
sys.path.append(os.path.abspath("../data"))

In [178]:
# Variables to test
vars_to_test = [
    "ACMY10", "Basis_PC1", "Domestic_PC1", "BPBS_3MO",
    "10_2", "2_1MO", "IORB_SOFR", "GCF_survey", "abs_JYBS3M", "EUBS_3MO",
    "MOVE", "VIX"
]

# Dummies (kept in levels automatically)
dummy_vars = ["eom", "eoq"]

# Exclude list (anything here is ignored everywhere)
exclude_vars = [] 

# Proxy for the model 
# proxy_var = "Basis_PC1"  # or "BPBS_3MO"

# Confidence level for ADF classification
alpha_adf = 0.05

In [179]:
adf_rows = []
for v in vars_to_test:
    if v in exclude_vars:
        adf_rows.append({"Variable": v, "ADF stat": np.nan, "p-value": np.nan,
                         "Lags": np.nan, "Obs": 0, "Order": "excluded"})
    elif v not in main_df.columns:
        adf_rows.append({"Variable": v, "ADF stat": np.nan, "p-value": np.nan,
                         "Lags": np.nan, "Obs": 0, "Order": "missing"})
    else:
        adf_rows.append(adf_classify(main_df[v], v, alpha=alpha_adf))

adf_table = pd.DataFrame(adf_rows).set_index("Variable").sort_index()
print("=== ADF SUMMARY ===")
display(adf_table.round(4))

=== ADF SUMMARY ===


Unnamed: 0_level_0,ADF stat,p-value,Lags,Obs,Order
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10_2,-1.8212,0.37,7,988,I(1)
2_1MO,-1.3209,0.6195,12,983,I(1)
ACMY10,-2.2805,0.1783,2,993,I(1)
BPBS_3MO,-2.4233,0.1353,10,985,I(1)
Basis_PC1,-3.03,0.0322,1,994,I(0)
Domestic_PC1,-2.284,0.1772,18,977,I(1)
EUBS_3MO,-2.536,0.107,4,991,I(1)
GCF_survey,-2.0876,0.2495,19,976,I(1)
IORB_SOFR,-2.7304,0.0689,18,977,I(1)
MOVE,-3.0496,0.0305,12,983,I(0)


### Variables transformation
Given the results of the ADF test for stationarity, we transform the variables accordingly. That is, for those that are I(0) (stationary), we maintain the level, whereas for those that are I(1) (non-stationary), we create a new variable based on its difference.

In [181]:
# Transformation Plan (levels vs Δ or Δlog)

I0 = [v for v in vars_to_test
      if v in main_df.columns and v not in exclude_vars and adf_table.loc[v, "Order"] == "I(0)"]
I1 = [v for v in vars_to_test
      if v in main_df.columns and v not in exclude_vars and adf_table.loc[v, "Order"] == "I(1)"]

plan = []

# I(0) → level
for v in I0:
    plan.append({"Variable": v, "Transform": "level", "NewCol": v})

# I(1) → Δ (or Δlog for vol indices)
for v in I1:
    if v in ("MOVE", "VIX"):
        newcol = f"dlog_{v}"
        main_df[newcol] = np.log(main_df[v]).replace([-np.inf, np.inf], np.nan).diff()
        plan.append({"Variable": v, "Transform": "Δlog", "NewCol": newcol})
    else:
        newcol = f"d_{v}"
        main_df[newcol] = main_df[v].diff()
        plan.append({"Variable": v, "Transform": "Δ", "NewCol": newcol})

# Dummies (levels)
for d in dummy_vars:
    if d in main_df.columns and d not in exclude_vars:
        plan.append({"Variable": d, "Transform": "dummy(level)", "NewCol": d})

transform_plan = pd.DataFrame(plan)
print("=== TRANSFORMATION PLAN ===")
display(transform_plan)

=== TRANSFORMATION PLAN ===


Unnamed: 0,Variable,Transform,NewCol
0,Basis_PC1,level,Basis_PC1
1,abs_JYBS3M,level,abs_JYBS3M
2,MOVE,level,MOVE
3,VIX,level,VIX
4,ACMY10,Δ,d_ACMY10
5,Domestic_PC1,Δ,d_Domestic_PC1
6,BPBS_3MO,Δ,d_BPBS_3MO
7,10_2,Δ,d_10_2
8,2_1MO,Δ,d_2_1MO
9,IORB_SOFR,Δ,d_IORB_SOFR


### Static OLS (First Differences) with Controls — ΔACMY10 on Proxy\(_t\), Proxy\(_{t-1}\) *(HAC NW, L=5)*


\Delta Y_t
= \alpha
+ \beta_0\,\text{Proxy}_t
+ \beta_1\,\text{Proxy}_{t-1}
+ \delta_{\text{MOVE}}\;\text{MOVE}_t
+ \delta_{10\!-\!2}\;\Delta(10\!-\!2)_t
+ \delta_{2\!-\!1\text{M}}\;\Delta(2\!-\!1\text{M})_t
+ \delta_{\text{IORB}}\;\Delta(\text{IORB}\!-\!\text{SOFR})_t
+ \delta_{\text{GCF}}\;\Delta(\text{GCF\_survey})_t
+ \eta_{\text{eom}}\;\text{eom}_t
+ \eta_{\text{eoq}}\;\text{eoq}_t
+ \varepsilon_t .
\]

- \(Y_t\): ACM 10Y term premium (daily); \(\Delta\) is the first difference.  
- **Proxy**: your factor (e.g., **Basis\_PC1**) — in the code this is `proxy_L0` and `proxy_L1`.  
- Controls used in the run: `MOVE`, `d_10_2`, `d_2_1MO`, `d_IORB_SOFR`, `d_GCF_survey`, `eom`, `eoq`.  
- Standard errors: **HAC (Newey–West)** with **maxlags = 5**.

**Cumulative short-run effect of Proxy:** \(\beta_0 + \beta_1\).

> If you use the differenced proxy instead, replace \(\text{Proxy}_t\) and \(\text{Proxy}_{t-1}\) by \(\Delta\text{Proxy}_t\) and \(\Delta\text{Proxy}_{t-1}\) (i.e., `d_proxy_L0`, `d_proxy_L1`).


**Model specification:**  
$\Delta Y_t = \alpha + + \beta_0 \,\text{Basis\_PC1}_t + \beta_1 \,\text{Basis\_PC1}_{t-1} + \gamma_1' \mathbf{X}_{1t} + \gamma_2' \mathbf{\Delta X}_{2t} + u_t$  
where $\mathbf{X}_{1t}$ includes level controls: MOVE, end-of-month (eom) and end-of-quarter (eoq) dummies; $\mathbf{X}_{2t}$ 10–2 slope, 2–1MO slope, IORB–SOFR spread, GCF survey rate.

In [184]:
proxy_var = 'Basis_PC1' 

controls_whitelist = ['MOVE','10_2','2_1MO','IORB_SOFR','GCF_survey','eom','eoq']

df_reg, X_cols = build_reg_data_for_dY(proxy_var=proxy_var,
                                       include_vars=controls_whitelist, main_df=main_df, transform_plan = transform_plan)

print("=== REGRESSORS USED ===")
print(X_cols)

# RUN OLS WITH HAC SEs

y = df_reg["dY"]
X = sm.add_constant(df_reg[X_cols])
res = sm.OLS(y, X).fit(cov_type="HAC", cov_kwds={"maxlags": 5})
print(res.summary())

# Helpful post-estimates: cumulative short-run effect of proxy
if proxy_var == "Basis_PC1":
    cum = res.params.get("d_proxy_L0", 0.0) + res.params.get("d_proxy_L1", 0.0)
    print(f"\nCumulative short-run effect (proxy = {proxy_var}): {cum:.4f}")
else:
    cum = res.params.get("proxy_L0", 0.0) + res.params.get("proxy_L1", 0.0)
    print(f"\nCumulative short-run effect (proxy = {proxy_var}): {cum:.4f}")

=== REGRESSORS USED ===
['proxy_L0', 'proxy_L1', 'MOVE', 'd_10_2', 'd_2_1MO', 'd_IORB_SOFR', 'd_GCF_survey', 'eom', 'eoq']
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.449
Model:                            OLS   Adj. R-squared:                  0.444
Method:                 Least Squares   F-statistic:                     17.68
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           2.23e-27
Time:                        16:27:06   Log-Likelihood:                -3058.0
No. Observations:                 995   AIC:                             6136.
Df Residuals:                     985   BIC:                             6185.
Df Model:                           9                                         
Covariance Type:                  HAC                                         
                   coef    std err          z      P>|z|      [0.025      0.975]
------

**Model specification:**  
$
\Delta Y_t
= \alpha
+ \beta_0\,\Delta\text{BPBS\_3MO}_t
+ \beta_1\,\Delta\text{BPBS\_3MO}_{t-1}
+ \boldsymbol{\gamma}_1' \mathbf{X}_{1t}
+ \boldsymbol{\gamma}_2' \,\Delta\mathbf{X}_{2t}
+ u_t .
$

where $\mathbf{X}_{1t}$ includes level controls: MOVE, end-of-month (eom) and end-of-quarter (eoq) dummies; $\mathbf{X}_{2t}$ 10–2 slope, 2–1MO slope, IORB–SOFR spread, GCF survey rate.

In [186]:
proxy_var = 'BPBS_3MO' 

controls_whitelist = ['MOVE','10_2','2_1MO','IORB_SOFR','GCF_survey','eom','eoq']

df_reg, X_cols = build_reg_data_for_dY(proxy_var=proxy_var,
                                       include_vars=controls_whitelist, main_df=main_df, transform_plan = transform_plan)

print("=== REGRESSORS USED ===")
print(X_cols)

# RUN OLS WITH HAC SEs

y = df_reg["dY"]
X = sm.add_constant(df_reg[X_cols])
res = sm.OLS(y, X).fit(cov_type="HAC", cov_kwds={"maxlags": 5})
print(res.summary())

# Helpful post-estimates: cumulative short-run effect of proxy
if proxy_var == "Basis_PC1":
    cum = res.params.get("d_proxy_L0", 0.0) + res.params.get("d_proxy_L1", 0.0)
    print(f"\nCumulative short-run effect (proxy = {proxy_var}): {cum:.4f}")
else:
    cum = res.params.get("proxy_L0", 0.0) + res.params.get("proxy_L1", 0.0)
    print(f"\nCumulative short-run effect (proxy = {proxy_var}): {cum:.4f}")

=== REGRESSORS USED ===
['d_proxy_L0', 'd_proxy_L1', 'MOVE', 'd_10_2', 'd_2_1MO', 'd_IORB_SOFR', 'd_GCF_survey', 'eom', 'eoq']
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.446
Model:                            OLS   Adj. R-squared:                  0.441
Method:                 Least Squares   F-statistic:                     15.66
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           4.17e-24
Time:                        16:27:06   Log-Likelihood:                -3057.7
No. Observations:                 994   AIC:                             6135.
Df Residuals:                     984   BIC:                             6184.
Df Model:                           9                                         
Covariance Type:                  HAC                                         
                   coef    std err          z      P>|z|      [0.025      0.975]
--

**Model specification:**  
$
\Delta Y_t
= \alpha
+ \beta_0\,\Delta\text{GCF\_survey}_t
+ \beta_1\,\Delta\text{GCF\_survey}_{t-1}
+ \boldsymbol{\gamma}_1' \mathbf{X}_{1t}
+ \boldsymbol{\gamma}_2' \,\Delta\mathbf{X}_{2t}
+ u_t .
$

where $\mathbf{X}_{1t}$ includes level controls: MOVE, absolute JPY basis (abs_JYBS3M), end-of-month (eom) and end-of-quarter (eoq) dummies; $\mathbf{X}_{2t}$ 10–2 slope, 2–1MO slope, IORB–SOFR spread, EU basis (EUBS_3MO), and GBP basis (BPBS_3MP).

In [188]:
proxy_var = 'GCF_survey' 

controls_whitelist = ['MOVE','10_2','2_1MO','IORB_SOFR','abs_JYBS3M','EUBS_3MO','BPBS_3MO','eom','eoq']

df_reg, X_cols = build_reg_data_for_dY(proxy_var=proxy_var,
                                       include_vars=controls_whitelist, main_df=main_df, transform_plan = transform_plan)

print("=== REGRESSORS USED ===")
print(X_cols)

# RUN OLS WITH HAC SEs

y = df_reg["dY"]
X = sm.add_constant(df_reg[X_cols])
res = sm.OLS(y, X).fit(cov_type="HAC", cov_kwds={"maxlags": 5})
print(res.summary())

# Helpful post-estimates: cumulative short-run effect of proxy
if proxy_var == "Basis_PC1":
    cum = res.params.get("d_proxy_L0", 0.0) + res.params.get("d_proxy_L1", 0.0)
    print(f"\nCumulative short-run effect (proxy = {proxy_var}): {cum:.4f}")
else:
    cum = res.params.get("proxy_L0", 0.0) + res.params.get("proxy_L1", 0.0)
    print(f"\nCumulative short-run effect (proxy = {proxy_var}): {cum:.4f}")

=== REGRESSORS USED ===
['d_proxy_L0', 'd_proxy_L1', 'abs_JYBS3M', 'MOVE', 'd_BPBS_3MO', 'd_10_2', 'd_2_1MO', 'd_IORB_SOFR', 'd_EUBS_3MO', 'eom', 'eoq']
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.447
Model:                            OLS   Adj. R-squared:                  0.441
Method:                 Least Squares   F-statistic:                     15.68
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           1.58e-28
Time:                        16:27:06   Log-Likelihood:                -3057.0
No. Observations:                 994   AIC:                             6138.
Df Residuals:                     982   BIC:                             6197.
Df Model:                          11                                         
Covariance Type:                  HAC                                         
                  coef    std err          z      P>|z|  

**Model specification:**  
$
\Delta Y_t
= \alpha
+ \beta_0\,\Delta\text{Domestic\_PC1}_t
+ \beta_1\,\Delta\text{Domestic\_PC1}_{t-1}
+ \boldsymbol{\gamma}_1' \mathbf{X}_{1t}
+ \boldsymbol{\gamma}_2' \,\Delta\mathbf{X}_{2t}
+ u_t .
$

where $\mathbf{X}_{1t}$ includes level controls: MOVE, absolute JPY basis (abs_JYBS3M), end-of-month (eom) and end-of-quarter (eoq) dummies; $\mathbf{X}_{2t}$ 10–2 slope, 2–1MO slope, IORB–SOFR spread, EU basis (EUBS_3MO), and GBP basis (BPBS_3MP).

In [190]:
proxy_var = 'd_Domestic_PC1' 

controls_whitelist = ['MOVE','10_2','2_1MO','abs_JYBS3M','EUBS_3MO','BPBS_3MO','eom','eoq']

df_reg, X_cols = build_reg_data_for_dY(proxy_var=proxy_var,
                                       include_vars=controls_whitelist, main_df=main_df, transform_plan = transform_plan)

print("=== REGRESSORS USED ===")
print(X_cols)

# RUN OLS WITH HAC SEs

y = df_reg["dY"]
X = sm.add_constant(df_reg[X_cols])
res = sm.OLS(y, X).fit(cov_type="HAC", cov_kwds={"maxlags": 5})
print(res.summary())

# Helpful post-estimates: cumulative short-run effect of proxy
if proxy_var == "Basis_PC1":
    cum = res.params.get("d_proxy_L0", 0.0) + res.params.get("d_proxy_L1", 0.0)
    print(f"\nCumulative short-run effect (proxy = {proxy_var}): {cum:.4f}")
else:
    cum = res.params.get("proxy_L0", 0.0) + res.params.get("proxy_L1", 0.0)
    print(f"\nCumulative short-run effect (proxy = {proxy_var}): {cum:.4f}")

=== REGRESSORS USED ===
['dproxy_L0', 'dproxy_L1', 'abs_JYBS3M', 'MOVE', 'd_BPBS_3MO', 'd_10_2', 'd_2_1MO', 'd_EUBS_3MO', 'eom', 'eoq']
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.446
Model:                            OLS   Adj. R-squared:                  0.441
Method:                 Least Squares   F-statistic:                     17.01
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           1.05e-28
Time:                        16:27:06   Log-Likelihood:                -3057.3
No. Observations:                 994   AIC:                             6137.
Df Residuals:                     983   BIC:                             6191.
Df Model:                          10                                         
Covariance Type:                  HAC                                         
                 coef    std err          z      P>|z|      [0.025      0.

To proceed, we further check if $ACMY\_10$ are cointegrated with either $\text{Domestic\_PC1}$, or $\text{BPBS\_3MO}$.

### Cointegration/Error correction model (ECM)
First we only select pairs of dependent ($AMCY10$) and independent (Domestic_PC1 or BSBP_3MO) such that both are non-stationary (I(1)). We then conduct Engle Granger test for cointegration. If we reject the nul hypothesis, we can proceed with the error correction model (ECM).

#### Domestic_PC1

In [194]:
eg_result = engle_granger_test("ACMY10", "Domestic_PC1", main_df)

=== Step 1: Levels Regression ===
                            OLS Regression Results                            
Dep. Variable:                 ACMY10   R-squared:                       0.280
Model:                            OLS   Adj. R-squared:                  0.279
Method:                 Least Squares   F-statistic:                     386.0
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           7.35e-73
Time:                        16:27:06   Log-Likelihood:                -5826.1
No. Observations:                 996   AIC:                         1.166e+04
Df Residuals:                     994   BIC:                         1.167e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         

Despite not being able to reject the null at 5 percent, the p-value is low enough to warrant an ECM model. However, the results need to be interpreted with caution.  

In [196]:
# --- Run ECM ---
ecm_out = run_ecm(
    y_var="ACMY10",
    x_var="Domestic_PC1",
    controls_whitelist=['10_2','2_1MO','abs_JYBS3M','EUBS_3MO','BPBS_3MO','MOVE','eom','eoq'],
    cov_type="HAC", hac_maxlags=5, main_df = main_df, transform_plan = transform_plan
)


=== Long-run (levels) ===
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const          358.9930      2.664    134.774      0.000     353.766     364.220
Domestic_PC1    42.5953      2.168     19.646      0.000      38.341      46.850

=== Error-Correction Model (short run) ===
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.450
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     18.42
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           3.47e-31
Time:                        16:27:06   Log-Likelihood:                -3056.5
No. Observations:                 995   AIC:                             6135.
Df Residuals:                     984   BIC:         

#### BPBS_3MO

In [198]:
eg_result = engle_granger_test("ACMY10", "BPBS_3MO", main_df)

=== Step 1: Levels Regression ===
                            OLS Regression Results                            
Dep. Variable:                 ACMY10   R-squared:                       0.342
Model:                            OLS   Adj. R-squared:                  0.342
Method:                 Least Squares   F-statistic:                     517.4
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           1.55e-92
Time:                        16:27:06   Log-Likelihood:                -5780.8
No. Observations:                 996   AIC:                         1.157e+04
Df Residuals:                     994   BIC:                         1.158e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        399.4

In [199]:
# --- Run it ---
ecm_out = run_ecm(
    y_var="ACMY10",
    x_var="BPBS_3MO",
    controls_whitelist=['10_2','2_1MO','MOVE','eom','eoq','IORB_SOFR','GCF_survey'],
    cov_type="HAC", hac_maxlags=5, main_df = main_df, transform_plan = transform_plan
)


=== Long-run (levels) ===
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        399.4570      3.105    128.636      0.000     393.363     405.551
BPBS_3MO       5.8982      0.259     22.745      0.000       5.389       6.407

=== Error-Correction Model (short run) ===
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.450
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     16.15
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           6.71e-25
Time:                        16:27:06   Log-Likelihood:                -3056.7
No. Observations:                 995   AIC:                             6133.
Df Residuals:                     985   BIC:                 

#### GCF_survey

In [201]:
eg_result = engle_granger_test("ACMY10", "GCF_survey", main_df)

=== Step 1: Levels Regression ===
                            OLS Regression Results                            
Dep. Variable:                 ACMY10   R-squared:                       0.350
Model:                            OLS   Adj. R-squared:                  0.349
Method:                 Least Squares   F-statistic:                     535.1
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           4.70e-95
Time:                        16:27:07   Log-Likelihood:                -5775.0
No. Observations:                 996   AIC:                         1.155e+04
Df Residuals:                     994   BIC:                         1.156e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        294.7

In [202]:
# --- Run it ---
ecm_out = run_ecm(
    y_var="ACMY10",
    x_var="GCF_survey",
    controls_whitelist=['10_2','2_1MO','MOVE','eom','eoq','IORB_SOFR','abs_JYBS3M','EUBS_3MO','BPBS_3MO'],
    cov_type="HAC", hac_maxlags=5, main_df = main_df, transform_plan = transform_plan
)


=== Long-run (levels) ===
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        294.7898      3.756     78.486      0.000     287.419     302.160
GCF_survey    14.7611      0.638     23.131      0.000      13.509      16.013

=== Error-Correction Model (short run) ===
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.451
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     17.67
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           2.42e-32
Time:                        16:27:07   Log-Likelihood:                -3056.0
No. Observations:                 995   AIC:                             6136.
Df Residuals:                     983   BIC:                 

### Robustness check

In [204]:
# --- 1) ECM with one ΔX lag (delayed pass-through) and an optional ΔY lag
ecm_lagged = run_ecm_with_lags(
    main_df=main_df,
    y_var="ACMY10",
    x_var="GCF_survey",
    controls=['10_2','2_1MO','MOVE','abs_JYBS3M','EUBS_3MO','BPBS_3MO','eom','eoq'],
    transform_plan=transform_plan,
    x_lags=1,   # add ΔGCF_{t-1}
    y_lags=0,   # set to 1 to add ΔY_{t-1}
    cov_type="HAC", hac_maxlags=5
)


                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        294.7898      3.756     78.486      0.000     287.419     302.160
GCF_survey    14.7611      0.638     23.131      0.000      13.509      16.013
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.451
Model:                            OLS   Adj. R-squared:                  0.444
Method:                 Least Squares   F-statistic:                     17.19
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           2.00e-31
Time:                        16:27:07   Log-Likelihood:                -3053.5
No. Observations:                 994   AIC:                             6131.
Df Residuals:                     982   BIC:                             6190.
Df Model:                          11               

In [205]:
# --- 2) Parsimony: keep ONE slope at a time
ecm_10_2_only = run_ecm_with_lags(
    main_df, "ACMY10", "GCF_survey",
    controls=['10_2','MOVE','abs_JYBS3M','EUBS_3MO','BPBS_3MO','eom','eoq'],
    transform_plan=transform_plan, x_lags=0, y_lags=0
)
ecm_2_1MO_only = run_ecm_with_lags(
    main_df, "ACMY10", "GCF_survey",
    controls=['2_1MO','MOVE','abs_JYBS3M','EUBS_3MO','BPBS_3MO','eom','eoq'],
    transform_plan=transform_plan, x_lags=0, y_lags=0
)

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        294.7898      3.756     78.486      0.000     287.419     302.160
GCF_survey    14.7611      0.638     23.131      0.000      13.509      16.013
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.072
Model:                            OLS   Adj. R-squared:                  0.063
Method:                 Least Squares   F-statistic:                     5.737
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           8.66e-08
Time:                        16:27:07   Log-Likelihood:                -3317.2
No. Observations:                 995   AIC:                             6654.
Df Residuals:                     985   BIC:                             6703.
Df Model:                           9               

In [206]:
# --- 3) Vol control swap / drop
ecm_vix = run_ecm_with_lags(
    main_df, "ACMY10", "GCF_survey",
    controls=['10_2','2_1MO','VIX','abs_JYBS3M','EUBS_3MO','BPBS_3MO','eom','eoq'],
    transform_plan=transform_plan, x_lags=0, y_lags=0
)
ecm_no_vol = run_ecm_with_lags(
    main_df, "ACMY10", "GCF_survey",
    controls=['10_2','2_1MO','abs_JYBS3M','EUBS_3MO','BPBS_3MO','eom','eoq'],
    transform_plan=transform_plan, x_lags=0, y_lags=0
)

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        294.7898      3.756     78.486      0.000     287.419     302.160
GCF_survey    14.7611      0.638     23.131      0.000      13.509      16.013
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.451
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     21.53
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           1.40e-36
Time:                        16:27:07   Log-Likelihood:                -3056.2
No. Observations:                 995   AIC:                             6134.
Df Residuals:                     984   BIC:                             6188.
Df Model:                          10               

In [207]:
# --- 4) HAC bandwidth sensitivity
ecm_hac10 = run_ecm_with_lags(
    main_df, "ACMY10", "GCF_survey",
    controls=['10_2','2_1MO','MOVE','abs_JYBS3M','EUBS_3MO','BPBS_3MO','eom','eoq'],
    transform_plan=transform_plan, x_lags=0, y_lags=0, hac_maxlags=10
)

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        294.7898      3.756     78.486      0.000     287.419     302.160
GCF_survey    14.7611      0.638     23.131      0.000      13.509      16.013
                            OLS Regression Results                            
Dep. Variable:                     dY   R-squared:                       0.450
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     15.82
Date:                Thu, 14 Aug 2025   Prob (F-statistic):           1.33e-26
Time:                        16:27:07   Log-Likelihood:                -3056.6
No. Observations:                 995   AIC:                             6135.
Df Residuals:                     984   BIC:                             6189.
Df Model:                          10               

### Test for EOM significance

In [222]:
import statsmodels.api as sm

# Build ΔY and keep rows with needed columns
df = main_df[['ACMY10','10_2','2_1MO','MOVE','eom','abs_JYBS3M','EUBS_3MO','BPBS_3MO']].copy()
df['dY'] = df['ACMY10'].diff()
df = df.dropna()

y = df['dY']
X_base = sm.add_constant(df[['10_2','2_1MO','MOVE','abs_JYBS3M','EUBS_3MO','BPBS_3MO']], has_constant='add')        # controls only
X_full = sm.add_constant(df[['10_2','2_1MO','MOVE','eom','abs_JYBS3M','EUBS_3MO','BPBS_3MO']], has_constant='add')  # + EOM

# --- 1) Likelihood-Ratio test (requires NON-robust fits) ---
m0 = sm.OLS(y, X_base).fit()   # restricted (no EOM)
m1 = sm.OLS(y, X_full).fit()   # full (with EOM)

lr_stat, lr_p, df_diff = m1.compare_lr_test(m0)  # m1 nests m0
print(f"LR test (add EOM): stat={lr_stat:.3f}, df={int(df_diff)}, p={lr_p:.4f}")
print(f"AIC: base={m0.aic:.1f}, full={m1.aic:.1f} | BIC: base={m0.bic:.1f}, full={m1.bic:.1f}")

# --- 2) Robust significance check for EOM (HAC) ---
m1_hac = sm.OLS(y, X_full).fit(cov_type='HAC', cov_kwds={'maxlags':5})
wald = m1_hac.wald_test('eom = 0', use_f=True)   # robust Wald on EOM
print("\nRobust Wald (HAC) for EOM:")
print(wald)
print(f"EOM coef (HAC) = {m1_hac.params['eom']:.4f}, p = {m1_hac.pvalues['eom']:.4f}")


LR test (add EOM): stat=10.372, df=1, p=0.0013
AIC: base=6708.8, full=6700.4 | BIC: base=6743.1, full=6739.6

Robust Wald (HAC) for EOM:
<F test: F=array([[10.21446089]]), p=0.0014376793864651133, df_denom=987, df_num=1>
EOM coef (HAC) = -1.8953, p = 0.0014




# Overall Summary *(updated 2025-08-14)*

**Foreign constraints (basis)** matter for the **equilibrium level** of ACMY (cointegration with `BPBS_3MO`), but they have **limited short-run explanatory power** once domestic curve dynamics are controlled. **Domestic constraint proxies** (IORB–SOFR, GCF_survey) behave similarly: **meaningful long-run link**, **weak short-run pass-through**.

**Short-run ACMY moves** are overwhelmingly explained by **curve slopes** (Δ10–2, Δ2–1MO) and a **repeatable month-end (EOM) effect**. Volatility controls (MOVE/VIX) are secondary and often collinear if included together. Results are robust to small changes in lag structure (adding one ΔX lag), HAC bandwidth, parsimony (one slope), and a basic EOM-window specification.

**Practical implications.**
- Treat basis/domestic constraints as **slow-moving anchors** (long-run). Use the ECM error-correction term as a **mean-reversion overlay**, not as a daily signal.
- Harvest **EOM** seasonality with modest size; it is small (≈ −1bp) but persistent.
- For clarity and stability, prefer a **parsimonious baseline** (one slope + one vol), with the fuller set in robustness.

*If future edits change core empirical outputs (e.g., cointegration results, ECT λ, or EOM size), please update the bullet points above to keep the narrative synchronized with the estimates.*
