Inter-judge variance on borderline outputs

## Observation

The self-judge can return the same letter grade across multiple runs on identical input but flag *different* specific issues each time. Empirically the top issues recur (~66% overlap on highest-priority items), but lower-priority picks are noisy.

## Why it matters

When a user runs strict mode and the loop iterates against the bar, the moving target makes "keep iterating until A" expensive — every attempt may surface a new issue the previous attempt didn't. Letter grade alone is not always a reliable lift signal.

## Possible directions

- **Median-of-N grading.** Run the judge 2-3 times on the baseline, take the median grade and the union of high-priority issues. Stabilizes at the cost of N× tokens on the first call only.
- **Issue-priority threshold.** Only iterate on items the judge marks as high-priority; ignore minor flags. Reduces target drift.
- **Calibration mode.** Optional config flag that runs the judge twice on first call and reports variance to the user before iterating.

## Acceptance criteria

A reproducible fix or config option that reduces "different issues each run" behavior on identical input. Demonstrate via a synthetic test case in `tests/`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inter-judge variance on borderline outputs #3

Observation

Why it matters

Possible directions

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inter-judge variance on borderline outputs #3

Description

Observation

Why it matters

Possible directions

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions