A pairs trading framework built from first principles on 10-second order book data, developed independently in 72 hours. The project covers the full research pipeline: from microstructure-aware data cleaning through cointegration analysis, dynamic hedge ratio estimation, and a complete backtesting engine with realistic execution assumptions.
Standard pairs trading tutorials assume clean price series and ignore market microstructure entirely. This project takes the opposite approach — starting from the raw order book and asking: what does it actually mean for two instruments to be in equilibrium, and how do you detect and trade deviations from it in a high-frequency setting?
The two instruments exhibit persistent co-movement across most trading regimes, making them a natural candidate for a mean-reversion strategy. The key challenge is doing the statistical analysis rigorously rather than naively — distinguishing genuine cointegration from spurious correlation, and building execution logic that respects the bid-ask spread rather than assuming trades occur at midprice.
Raw 10-second snapshots require significant preprocessing before any statistical analysis is valid:
- Crossed market detection — rows where
ASK ≤ BIDindicate executed trades and are removed to preserve a clean quote-only dataset - Weekend and public holiday removal — weekends exhibit near-zero volume and stale quotes; keeping them contaminates rolling statistics with unrepresentative observations
- Inactive market detection — contiguous periods where all quote fields are frozen (≥ 1 minute) are identified and removed; these represent pre-market, post-market, or halted trading windows
- Overnight gap handling — log return calculations respect session boundaries so overnight gaps do not produce spurious large returns
Rather than using naive midprice, a Volume-Weighted Average Price (VWAP) microprice is constructed:
This accounts for order book imbalances — when bid volume significantly exceeds ask volume, buying pressure is dominant and the fair value sits closer to the ask. Using midprice in these conditions introduces a systematic bias into spread estimation.
Before testing cointegration, stationarity of the individual price series is established from first principles. Starting from the AR(1) process:
Recursive substitution shows:
Stationarity requires
The null hypothesis is
With both series confirmed as I(1), the Engle-Granger procedure tests whether a stationary linear combination exists. The hedge ratio is estimated via OLS in both directions (X on Y, Y on X), with the direction yielding the more negative ADF statistic selected. The Engle-Granger test with MacKinnon-adjusted critical values is then applied to the residual:
Results indicate weak cointegration — the null of no cointegration is rejected at the 10% level but not the 5% level. This motivates a conservative strategy design that does not assume a strongly stable long-run equilibrium, and reinforces the need for a rolling (rather than static) hedge ratio.
Note on spurious regression: Direct OLS on price levels produces very high
$R^2$ driven by shared non-stationary drift — not genuine co-movement. All inference is correctly performed on residuals and returns, not levels.
A static hedge ratio is insufficient given the observed slow-moving drift in the spread mean across the sample. A 10-day rolling OLS window estimates the time-varying parameters
The 10-day window reflects a deliberate bias-variance tradeoff: short enough to adapt to medium-term regime shifts, long enough to prevent noisy parameter estimates from microstructure effects.
Before constructing the spread, raw microprices are smoothed using a time-decay Exponential Moving Average that accounts for irregular time gaps in the data:
This is critical for two reasons: (1) standard span-based EMAs assume uniform time steps, which is violated by the inactive-market gaps removed during cleaning; (2) smoothing attenuates bid-ask bounce and transient liquidity shocks that would otherwise inflate spread variance and generate false signals.
Entry and exit signals are based on the rolling z-score of the spread:
| Signal | Condition | Action |
|---|---|---|
| Long spread | Buy |
|
| Short spread | Sell |
|
| Exit | Close position |
The
Order Book Imbalance (OBI) as a signal — OBI measures relative buy/sell pressure at the top of book:
OBI captures short-term order flow pressure that is orthogonal to the spread z-score. A natural extension is a composite entry signal that requires both a z-score threshold crossing and a confirming OBI direction, reducing adverse selection on entry.
Regime-aware strategy via Hidden Markov Model — rolling correlation between the instruments exhibits clear distributional shifts, with the COVID-19 volatility shock (March–April 2020) producing a distinct high-correlation, high-spread regime. An HMM over observable features (rolling volatility, rolling correlation, bid-ask spread) could identify the latent market regime and allow the strategy to adjust thresholds or pause trading in unfavourable regimes. This draws on multi-target tracking methodology where similar state-space approaches are used for regime identification.
Kalman filter for hedge ratio estimation — rolling OLS is a reasonable baseline but treats the hedge ratio as piecewise-constant within each window. A Kalman filter would model
numpy
pandas
statsmodels
matplotlib
seaborn
PairsTrading.ipynb # Full analysis notebook
final_data_10s.csv # 10-second order book snapshots (not included)