Execution assumption auditing: systematic sensitivity testing

## Problem

Experiments embed execution assumptions (fill probability, cost models, position sizing, slippage) that are rarely challenged by the framework. These assumptions are often the largest source of gap between backtest and live performance, yet the lab treats them as fixed constants.

In a real deployment, a single uncalibrated assumption (fill probability set to 60% vs a realistic 40%) likely overstated the production champion's Sharpe by 20-30%. This was identified in a manual audit but would never have been caught by the framework.

## Proposal

Add an `execution_audit` diagnostic that runs automatically on a schedule.

### Configuration in branches.yaml

```yaml
execution_audit:
  trigger: "every_10_cycles"
  also_on: "human_checkpoint"
  checks:
    - name: "fill_probability_sensitivity"
      param_path: "execution.fill_probability"
      sweep_values: [0.30, 0.40, 0.50, 0.60, 0.70, 0.80]
      apply_to: "production_champion"
    - name: "cost_model_sensitivity"
      param_path: "execution.cost_per_trade"
      sweep_multipliers: [0.5, 1.0, 1.5, 2.0, 3.0]
      apply_to: "production_champion"
    - name: "position_size_impact"
      param_path: "execution.position_size"
      sweep_values: [50, 100, 500, 1000, 5000]
      apply_to: "production_champion"
```

### Output

A sensitivity table showing how the champion's primary metric degrades:

```
Fill Probability Sensitivity (champion: cap_v3)
  fill_prob  |  Sharpe  |  Return%  |  PnL
  0.30       |   0.82   |   1.4%    |  $1,400
  0.40       |   1.21   |   3.1%    |  $3,100
  0.50       |   1.67   |   4.8%    |  $4,800
  0.60       |   2.35   |   6.4%    |  $6,400  ← current assumption
  0.70       |   2.89   |   7.9%    |  $7,900

Zero-crossing: Sharpe crosses 0 at fill_prob ≈ 0.15
Robustness: Champion is profitable at all tested fill rates above 0.15
```

### Flagging

If the champion's metric crosses zero at a REALISTIC assumption value (defined per domain), flag in the handoff:
> "WARNING: Production champion's Sharpe crosses zero at fill_probability=0.15. If actual fills are below 15%, the strategy loses money. Current assumption (60%) has no empirical calibration."

### Integration with judge

Optionally adjust the composite score by a "robustness discount" based on sensitivity:
```
robustness_discount = (metric_at_pessimistic / metric_at_assumed)
adjusted_score = composite_score * robustness_discount
```

## Why this matters

The gap between backtest and live performance is almost always driven by execution assumptions, not signal quality. A framework that treats these assumptions as immutable constants produces overconfident performance estimates. Systematic auditing makes the numbers honest.

## Relationship to existing features

- Natural fit as a diagnostic experiment type
- Runs on the same schedule as human checkpoints (every 15 cycles)
- Output feeds into the handoff and checkpoint reports
- Could trigger a frame_challenge if sensitivity reveals the champion is fragile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution assumption auditing: systematic sensitivity testing #6

Problem

Proposal

Configuration in branches.yaml

Output

Flagging

Integration with judge

Why this matters

Relationship to existing features

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Execution assumption auditing: systematic sensitivity testing #6

Description

Problem

Proposal

Configuration in branches.yaml

Output

Flagging

Integration with judge

Why this matters

Relationship to existing features

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions