<img style="float: right; margin: 0px 0px 15px 15px;" src="https://educationusa.state.gov/sites/default/files/wysiwyg/iteso_logo.jpg" width="520px" height="230px" />

# <span style="color: darkblue; ">Advanced Trading Strategies: _Statistical Arbitrage_</span>
`MICROSTRUCTURE AND TRADING SYSTEMS`

Juan Ramón Rocha López 

- Exp: 739950

Mariana Valenzuela Lafarga
    
- Exp: 749770

Repository on GitHub: [Link del repositorio](https://github.com/RaemonRoch/algo-trading)

### Strategy Description and Rationale

##### 1. Overview
- Strategy: Pairs trading (statistical arbitrage) that captures mean-reversion of the spread between two assets (e.g., GOOGL vs AMD). The execution flow and main backtest parameters (splits, fees, borrow rates, windows, thresholds) are orchestrated in main.py.

##### 2. Pairs trading approach
- Spread definition: model P1 as a linear combination of P2: P1 ≈ beta0 + beta1 * P2; the residual (spread = P1 - (beta0 + beta1*P2)) is traded. Trading logic (spread, z-score, thresholds, history) is implemented in MarketNeutralStrategy.
- Signals: compute z-score of spread over a rolling window. Enter when |z-score| > entry_threshold; exit when |z-score| < exit_threshold.
- Sizing and execution: StrategyOrchestrator sizes the primary leg as a percentage of capital and sizes the secondary leg via the estimated hedge ratio; order execution, commissions and borrow costs are modeled in Exchange.

##### 3. Why cointegration indicates an arbitrage opportunity
- Intuition: two non-stationary price series can have a stationary linear combination (cointegration). If the spread is stationary (finite mean and variance), large deviations tend to revert—this expected reversion is the basis of statistical arbitrage.
- Correlation vs cointegration: correlation measures contemporaneous co-movement; cointegration implies a long-run equilibrium in levels, which justifies trading the spread.
- Risks: cointegration doesn’t guarantee quick or sufficiently large reversion to cover trading costs; your implementation models commissions and borrow costs to assess feasibility.

##### 4. Justification for the Kalman Filter for dynamic hedging
- Need for dynamic hedge: the hedge ratio (beta1) can change over time due to relative volatility or structural shifts.
- Technical advantage: the Kalman Filter provides recursive, adaptive estimates of beta0 and beta1 (state is random-walk; observation is linear), updating every tick without full re-calibration. This reduces hedging error versus a fixed beta. The KF setup and filter_update logic live inside MarketNeutralStrategy.
- Practical benefits: improved dynamic coverage, fewer batch re-estimations, faster reaction to regime changes.
- Limitations: assumes linear-Gaussian dynamics; large structural breaks or non-linearities can degrade performance and require regime-change detection.

##### 5. Expected market conditions for success
- Pairs with stable economic/fundamental relationship (same sector, similar exposures).
- Sufficient liquidity to limit slippage in both legs.
- Low-to-moderate volatility: clearer mean-reversion signals; in extreme volatility, signals can be noisy.
- Controlled costs: sensitivity to commissions and borrow fees—these are modeled in Exchange and included in performance reports.
- Not recommended during structural breaks (mergers, idiosyncratic news) without detection mechanisms.

##### 6. Implementation, validation and metrics
- Pipeline & parameters: main.py handles data loading, splits (60/20/20), Exchange and Strategy initialization, and test execution.
- Market engine & costs: Exchange models execution, fees, daily borrow costs, and tracks portfolio history and executed trades.
- Robustness testing: BackTesting class supports K-Fold evaluation to analyze stability across different time folds.
- Metrics: generate_performance_report computes Sharpe, Sortino, Max Drawdown, Calmar and trade stats to quantify cost impact and performance.
- Recommendations: sensitivity analysis on rolling_window, entry/exit thresholds and capital allocation; stress-tests with higher fees/borrow costs; monitor hedge_ratio_history and spread/z-score for signal degradation.

### Pair Selection and Methodology
![table](images/table.png)

We selected the GOOGL–AMD pair after a systematic statistical screening and both univariate and multivariate cointegration testing. Although their returns correlation is moderate (≈ 0.366), the formal tests provide stronger evidence: the Engle–Granger test yields a t‑statistic of about −4.145 with a p‑value ≈ 0.0044 (95% critical ≈ −3.338), so we reject the null of no cointegration at conventional significance levels; the Johansen trace statistic is ≈ 28.22 which also exceeds the 95% critical value (≈ 15.49), confirming the presence of at least one cointegrating vector. The analysis was performed on a large sample (n ≈ 3,772), which increases the robustness of these inferences and reduces the chance that the result is an artifact of small sample noise.

In practical terms, the cointegration results mean there exists a stationary linear combination of GOOGL and AMD prices —i.e., a spread— that tends to revert to its long‑run mean, providing the basis for mean‑reversion trading. For this pair, the OLS (static) hedge ratio is roughly 0.953, indicating an approximately 1:1 exposure in levels on average, while the Kalman filter’s most recent slope estimate is about 0.809. That difference implies the relation between the two assets is not strictly constant through time: a static hedge would approximate average behavior, but a dynamic hedge can adapt to gradual or abrupt changes in the relationship and reduce hedging error in live execution. In this project the dynamic hedge using a Kalman Filter is implemented inside the MarketNeutralStrategy and produces a per‑tick intercept and slope that the orchestrator can use for sizing the second leg dynamically [6].

To validate and document the selection visually and operationally, include: normalized price series for GOOGL and AMD (base 100) to inspect long‑term co‑movement and detect structural breaks; a scatter plot of P_GOOGL versus P_AMD with the OLS regression line to show average linear fit; the spread series computed with the chosen hedge ratio and its rolling z‑score (e.g., 30‑day window) to show reversion episodes and potential entry/exit thresholds; and a time series of the Kalman‑estimated hedge ratio compared against the OLS value to illustrate how the hedge changes over time. Those visual checks complement the formal tests and help detect whether reversion is frequent and large enough to overcome transaction costs and borrow fees.

Operational considerations are critical: the backtesting pipeline, parameter choices and train/test splits are orchestrated in main.py (including the rolling window and z‑score thresholds) so you should run walk‑forward and sensitivity analyses there [1]. The Exchange implementation models execution costs, commission deductions and per‑day borrow costs and records historical balance and executed trades, enabling realistic PnL accounting in backtests [3]. Use the performance report generator to compute Sharpe, Sortino, max drawdown and other metrics once you have simulated trades, since profitability must be net of fees and borrow costs to be meaningful [4].

Finally, acknowledge risks and next steps: cointegration may break after corporate events, sector shocks, or regime shifts, so implement detection rules and limits for structural breaks; prefer log‑prices for regression/cointegration if not already used to reduce scale effects and handle splits; and prefer dynamic hedging (Kalman) in live execution while periodically comparing it to batch OLS estimates as a sanity check. Run stress tests with higher commissions and borrow fees and perform out‑of‑sample/walk‑forward validation before moving towards any live deployment.

### Sequential Decision Analysis Framework

**Sequential Decision Analysis (SDA) Framework**
This framework models the entire trading process as a sequential loop of observing, learning, and acting.   
* Detailed Mathematical Formulation of State-Space Model:
    The system is defined by a linear state-space model, which includes a hidden state (the hedge ratio) and a measurable observation (the asset prices).
    * State Equation (Transition): $X_t = F \cdot X_{t-1} + \omega_t$ 
        * The hidden state $X_t$ is the vector of dynamic regression coefficients: $X_t = [\beta_{0,t} \text{ (intercept)}, \beta_{1,t} \text{ (slope)}]^T$.
        * The transition matrix $F$ is set to the identity matrix (np.eye(2)), modeling the state as a random walk. This assumes the coefficients at time $t$ are the same as at $t-1$, plus some process noise $\omega_t$.
    * Observation Equation (Measurement): $Y_t = H_t \cdot X_t + \epsilon_t$
        * The observation $Y_t$ is the price of Asset 1: $P_{1,t}$.
        * The observation matrix $H_t$ is dynamic and updated at each step: $H_t = [1, P_{2,t}]$, where $P_{2,t}$ is the price of Asset 2.
        * The equation expands to: $P_{1,t} = 1 \cdot \beta_{0,t} + P_{2,t} \cdot \beta_{1,t} + \epsilon_t$, where $\epsilon_t$ is the measurement noise.

* Description of Sequential Process (predict $\rightarrow$ observe $\rightarrow$ update $\rightarrow$ decide $\rightarrow$ act $\rightarrow$ learn):
This loop is executed by the StrategyOrchestrator for each timestamp in the historical data.
    * Predict: The Kalman Filter (internally within pykalman) predicts the a priori state $[\beta_0, \beta_1]$ for time $t$.
    * Observe: The StrategyOrchestrator reads the current market prices $P_1$ and $P_2$ for the current timestamp.
    * Update (Learn): The orchestrator calls strategy.get_signals(). This method provides the new observation $P_1$ and observation matrix $[1, P_2]$ to the kf_hedge.filter_update function. The filter calculates the a posteriori (updated) state, effectively "learning" the new, most likely hedge ratio.
    * Decide: get_signals() uses the updated state to compute the current spread and its Z-score. It compares this Z-score against entry_threshold and exit_threshold to generate a signal: (1: Open Long), (0: Open Short), (-2: Close Position).
    * Act: The StrategyOrchestrator receives the signal. If it's an entry signal (1 or 0) and the position is neutral, it calls _open_position. If it's an exit signal (-2) and a position is open, it calls _close_position. These methods execute trades via exchange.execute_trade.
* Q and R Matrix Selection with Justification:
    * R (Observation Covariance): observation_covariance = 1.0. This represents the variance of the measurement noise $\epsilon_t$. A value of 1.0 is a standard baseline, implying moderate uncertainty in the observed price relationship ($P_1$ vs. $P_2$) that isn't captured by the hedge ratio.
    * Q (Transition Covariance): transition_covariance = np.eye(2) * 1e-5. This represents the variance of the process noise $\omega_t$. This critically small value is a key design choice. It signifies a strong belief that the hedge ratio coefficients $[\beta_0, \beta_1]$ are highly stable and evolve very slowly. This makes the filter smooth and resistant to short-term market noise.

### Kalman Filter Implementation

**Kalman Filter Implementation**
 
This section details the practical setup and operation of the filter.
* Initialization Procedures:The filter is initialized with a "prior belief" about the state:
    * initial_state_mean = np.array([0.0, 1.0]): The filter's initial belief is that the intercept is 0 and the slope is 1. This is a neutral and logical starting point for a pair relationship.
    
    * initial_state_covariance = np.eye(2) * 1.0: This initializes the filter with moderate uncertainty about its initial belief, allowing it to converge quickly to the true parameters as it processes new data.
* Parameter Estimation Methodology:The state parameters ($\beta_0, \beta_1$) are estimated recursively and online. Unlike a static OLS regression, the filter updates its parameter estimates at every single timestamp. This is achieved via the kf_hedge.filter_update method, which optimally blends the previous state prediction with the new observation to produce the most probable current state.
* Reestimation Schedule and Validation Approach:
    * Reestimation Schedule: The reestimation is continuous. The hedge ratio is re-calculated at every time step (e.g., daily) as new price data becomes available.
    * Validation Approach: The overall strategy's performance is validated using the BackTesting class. This class implements K-Fold cross-validation. Crucially, it sets shuffle=False to preserve the temporal order of the data, which is essential for time-series analysis. This allows for performance assessment across multiple distinct "folds" or time periods.

### Trading Strategy Logic

**Trading Strategy Logic**

This section describes how the filter's outputs are translated into discrete trading actions.
* Z-score Definition (using Kalman Filter Spread):Note: While VECM is a valid approach, this implementation uses a spread dynamically derived from the Kalman Filter's state.
    * Spread Calculation: At each step $t$, the spread is calculated using the updated coefficients from the filter:$\text{Spread}_t = P_{1,t} - (\beta_{0,t} + \beta_{1,t} \cdot P_{2,t})$.
    * Z-score Calculation: This spread value is appended to a history (self.spread_history). The Z-score is then calculated by standardizing the current spread relative to its own rolling history (defined by rolling_window):$Z_t = \frac{\text{Spread}_t - \text{rolling\_mean}(\text{Spreads})}{\text{rolling\_std}(\text{Spreads})}$
* Optimal Entry and Exit Z-score Policy Found:The trading policy is governed by symmetric thresholds:
    * Entry (Open Position):
        * If z_score < -self.entry_threshold (e.g., -2.0): Generate Signal 1 (Buy Spread: Long P1, Short P2).
        * If z_score > self.entry_threshold (e.g., +2.0): Generate Signal 0 (Sell Spread: Short P1, Long P2).
    * Exit (Close Position):
        * If abs(z_score) < self.exit_threshold (e.g., 0.5): Generate Signal -2 (Reversion to mean), triggering a call to _close_position.
    * Cost Treatment: Commissions and Borrow Rates:The Exchange class realistically models transaction frictions.
        * Commissions: A fee_rate (e.g., 0.125%) is applied to the total USD value of every transaction leg (both opening and closing trades). This cost is immediately deducted from the cash balance.
        
        * Borrow Rates: At every time step, the update_exchange_status method calculates the total USD value of all short positions. A borrow_rate_daily (e.g., 0.25%) is applied to this value, and the resulting borrow_cost is deducted from the cash balance.

### Results and Performance Analysis

![plot_1](images/dynamic_hedge.jpg)
![plot_2](images/equity_curve.jpg)
![plot_3](images/z_score_resids.jpg)

```
{'avg_loss_usd': 0.0,
 'avg_win_loss_ratio': 0.0,
 'avg_win_usd': 0.0,
 'calmar_ratio': np.float64(-0.29788643516914587),
 'max_drawdown': np.float64(-0.09988912182092222),
 'number_of_roundtrips': 0,
 'number_of_trades (legs)': 142,
 'profit_factor': 0.0,
 'sharpe_ratio': np.float64(-0.9536440039441664),
 'sortino_ratio': np.float64(-0.483408473252764),
 'total_borrow_costs': np.float64(289.47586877214536),
 'win_rate': 0.0}
```

### Conclusions

The implemented framework successfully executes a sophisticated, event-driven backtest of a market-neutral strategy. The core engineering objectives were met:

- The Kalman Filter proved effective in its technical implementation, generating a dynamic, non-constant hedge ratio (Beta_1) that adapted to market data over time (see Figure: Dynamic Hedge Evolution).
- The Z-score signal logic correctly identified statistical deviations from the dynamic mean, and the StrategyOrchestrator successfully executed 142 trade legs based on these signals.

However, the primary finding from the test set is that this strategy, with its current parameters, is strategically unprofitable.
Key Performance Failures:
- Negative Returns: The portfolio's equity curve (see Figure: Equity Curve) demonstrates a clear negative trend, resulting in a significant loss of capital from the initial $1,000,000.
- Poor Risk-Adjusted Performance: All risk-adjusted metrics are negative. The Sharpe Ratio of -0.95 and Sortino Ratio of -0.48 confirm that the strategy's returns were substantially negative, even when accounting for volatility.
- Significant Drawdown: The strategy experienced a Max Drawdown of -10%, with a corresponding negative Calmar Ratio of -0.30. This indicates the strategy's reversion-to-mean exits were insufficient to protect against sustained losses.

The Z-score signals (see Figure: Z-Score and Signals), while mechanically generated, did not lead to profitable mean-reversion. The assumption that a Z-score crossing the 2.0 threshold would reliably revert to the 0.5 exit threshold proved incorrect in this dataset, leading to more losses than gains.

Actionable Recommendations for Redevelopment:

The strategy is mechanically sound but strategically flawed. Future work must pivot from engineering to optimization and risk management.
- Parameter Optimization: The current static thresholds (Entry: 2.0, Exit: 0.5) are arbitrary. These, along with the Z-score's rolling_window, must be rigorously optimized.
- Kalman Filter Calibration: The filter's transition_covariance (the $Q$ matrix) is the most sensitive parameter for a dynamic hedge. This was an estimate and must be formally calibrated to match the pair's true volatility.
- Implement Stop-Loss: The strategy currently only exits on mean-reversion (signal == -2). The deep drawdowns suggest a non-reverting regime. A hard stop-loss (e.g., a Z-score > 3.5) or a time-based stop (e.g., exit position after 30 days) is essential to prevent catastrophic losses when the model assumption fails.
- Improve PnL Logging: The metrics show 142 trades but 0 roundtrips. This indicates a flaw in the PnL logging logic, as round-trip statistics like win_rate are not being calculated. This must be corrected to enable trade-level analysis.