-
Notifications
You must be signed in to change notification settings - Fork 3
Execution assumption auditing: systematic sensitivity testing #6
Description
Problem
Experiments embed execution assumptions (fill probability, cost models, position sizing, slippage) that are rarely challenged by the framework. These assumptions are often the largest source of gap between backtest and live performance, yet the lab treats them as fixed constants.
In a real deployment, a single uncalibrated assumption (fill probability set to 60% vs a realistic 40%) likely overstated the production champion's Sharpe by 20-30%. This was identified in a manual audit but would never have been caught by the framework.
Proposal
Add an execution_audit diagnostic that runs automatically on a schedule.
Configuration in branches.yaml
execution_audit:
trigger: "every_10_cycles"
also_on: "human_checkpoint"
checks:
- name: "fill_probability_sensitivity"
param_path: "execution.fill_probability"
sweep_values: [0.30, 0.40, 0.50, 0.60, 0.70, 0.80]
apply_to: "production_champion"
- name: "cost_model_sensitivity"
param_path: "execution.cost_per_trade"
sweep_multipliers: [0.5, 1.0, 1.5, 2.0, 3.0]
apply_to: "production_champion"
- name: "position_size_impact"
param_path: "execution.position_size"
sweep_values: [50, 100, 500, 1000, 5000]
apply_to: "production_champion"Output
A sensitivity table showing how the champion's primary metric degrades:
Fill Probability Sensitivity (champion: cap_v3)
fill_prob | Sharpe | Return% | PnL
0.30 | 0.82 | 1.4% | $1,400
0.40 | 1.21 | 3.1% | $3,100
0.50 | 1.67 | 4.8% | $4,800
0.60 | 2.35 | 6.4% | $6,400 ← current assumption
0.70 | 2.89 | 7.9% | $7,900
Zero-crossing: Sharpe crosses 0 at fill_prob ≈ 0.15
Robustness: Champion is profitable at all tested fill rates above 0.15
Flagging
If the champion's metric crosses zero at a REALISTIC assumption value (defined per domain), flag in the handoff:
"WARNING: Production champion's Sharpe crosses zero at fill_probability=0.15. If actual fills are below 15%, the strategy loses money. Current assumption (60%) has no empirical calibration."
Integration with judge
Optionally adjust the composite score by a "robustness discount" based on sensitivity:
robustness_discount = (metric_at_pessimistic / metric_at_assumed)
adjusted_score = composite_score * robustness_discount
Why this matters
The gap between backtest and live performance is almost always driven by execution assumptions, not signal quality. A framework that treats these assumptions as immutable constants produces overconfident performance estimates. Systematic auditing makes the numbers honest.
Relationship to existing features
- Natural fit as a diagnostic experiment type
- Runs on the same schedule as human checkpoints (every 15 cycles)
- Output feeds into the handoff and checkpoint reports
- Could trigger a frame_challenge if sensitivity reveals the champion is fragile