# Cointegration Analysis

Loads all price data from `data/`, gets the ticker universe from `research/raw/*.csv`, and runs an **Engle-Granger cointegration test** for every pair. Output: `research/processed/cointegration_results.csv` with **ticker1**, **ticker2**, **coint_t** (test statistic), **pvalue**, **crit_1pct**, **crit_5pct**, **crit_10pct** (critical values), **cointegrated_5pct**, **collinear** (True when y0/y1 are almost perfectly colinear).

In [19]:
import os
import sys
from itertools import combinations
from pathlib import Path

import pandas as pd

_root = Path.cwd().resolve()
while _root != _root.parent and not (_root / ".git").exists():
    _root = _root.parent
sys.path.insert(0, str(_root))

from research.config.constants import get_universe
from research.functions.load_data import load_prices

PROJECT_ROOT = _root
PROCESSED_DIR = PROJECT_ROOT / "research" / "processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

In [20]:
tickers = get_universe()
print(f"Universe: {len(tickers)} tickers")

Universe: 316 tickers


## Load prices and build panel

Use `adj_close`; pivot to one column per ticker. Pairs are formed only from tickers present in the panel.

In [21]:
prices = load_prices(tickers=tickers, data_dir=PROJECT_ROOT / "data", columns=["date", "ticker", "adj_close"])
prices["date"] = pd.to_datetime(prices["date"]).dt.date
wide = prices.pivot(index="date", columns="ticker", values="adj_close").dropna(how="all")
print(f"Panel: {len(wide)} dates × {len(wide.columns)} tickers")

Panel: 1495 dates × 315 tickers


## Run cointegration tests (parallel)

Tests run in parallel with `ProcessPoolExecutor`. Each test uses `coint(..., autolag=None, maxlag=0)` for speed. Pairs with fewer than 100 overlapping dates are skipped.

In [22]:
from concurrent.futures import ProcessPoolExecutor

from research.functions.coint_worker import init_worker, test_pair

pairs = list(combinations(wide.columns, 2))
t1_list, t2_list = [p[0] for p in pairs], [p[1] for p in pairs]
n_workers = min(os.cpu_count() or 4, 16)
print(f"Running {len(pairs)} pairs with {n_workers} processes...")

with ProcessPoolExecutor(max_workers=n_workers, initializer=init_worker, initargs=(wide,)) as ex:
    raw = ex.map(test_pair, t1_list, t2_list, chunksize=500)
    results = []
    for i, r in enumerate(raw):
        if r is not None:
            results.append(r)
        if (i + 1) % 5000 == 0 or i == 0:
            print(f"  {i + 1} / {len(pairs)} pairs done...")
    print(f"  {len(pairs)} / {len(pairs)} pairs done.")

df_results = pd.DataFrame(results)
print(f"Done. {len(df_results)} pairs.")

Running 49455 pairs with 8 processes...
  1 / 49455 pairs done...
  5000 / 49455 pairs done...
  10000 / 49455 pairs done...
  15000 / 49455 pairs done...
  20000 / 49455 pairs done...
  25000 / 49455 pairs done...
  30000 / 49455 pairs done...
  35000 / 49455 pairs done...
  40000 / 49455 pairs done...
  45000 / 49455 pairs done...
  49455 / 49455 pairs done.
Done. 49455 pairs.


In [23]:
out_path = PROCESSED_DIR / "cointegration_results.csv"
df_results.to_csv(out_path, index=False)
print(f"Saved {len(df_results)} rows to {out_path}")
print(f"Cointegrated at 5%: {df_results['cointegrated_5pct'].sum()}")
df_results.head(10)

Saved 49455 rows to /Users/mdabdullahalmahin/Desktop/Projects/quant-trading/research/processed/cointegration_results.csv
Cointegrated at 5%: 3437


Unnamed: 0,ticker1,ticker2,coint_t,pvalue,crit_1pct,crit_5pct,crit_10pct,cointegrated_5pct,collinear
0,AAPL,ABBV,-3.525804,0.030175,-3.904904,-3.340845,-3.047722,True,False
1,AAPL,ABNB,-0.755512,0.940304,-3.906851,-3.341928,-3.048473,False,False
2,AAPL,ABT,-1.922887,0.568356,-3.904904,-3.340845,-3.047722,False,False
3,AAPL,ACN,-0.063132,0.984321,-3.904904,-3.340845,-3.047722,False,False
4,AAPL,ADBE,-0.418254,0.969668,-3.906107,-3.341514,-3.048186,False,False
5,AAPL,AEP,-2.817596,0.16015,-3.90533,-3.341082,-3.047886,False,False
6,AAPL,AGG,-0.982245,0.906163,-3.904904,-3.340845,-3.047722,False,False
7,AAPL,AIG,-2.62687,0.226975,-3.90533,-3.341082,-3.047886,False,False
8,AAPL,AIQ,-2.343526,0.352047,-3.904904,-3.340845,-3.047722,False,False
9,AAPL,AMD,-2.671017,0.210175,-3.904904,-3.340845,-3.047722,False,False
