Skip to content

Inter-judge variance on borderline outputs #3

@HubWizard

Description

@HubWizard

Observation

The self-judge can return the same letter grade across multiple runs on identical input but flag different specific issues each time. Empirically the top issues recur (~66% overlap on highest-priority items), but lower-priority picks are noisy.

Why it matters

When a user runs strict mode and the loop iterates against the bar, the moving target makes "keep iterating until A" expensive — every attempt may surface a new issue the previous attempt didn't. Letter grade alone is not always a reliable lift signal.

Possible directions

  • Median-of-N grading. Run the judge 2-3 times on the baseline, take the median grade and the union of high-priority issues. Stabilizes at the cost of N× tokens on the first call only.
  • Issue-priority threshold. Only iterate on items the judge marks as high-priority; ignore minor flags. Reduces target drift.
  • Calibration mode. Optional config flag that runs the judge twice on first call and reports variance to the user before iterating.

Acceptance criteria

A reproducible fix or config option that reduces "different issues each run" behavior on identical input. Demonstrate via a synthetic test case in tests/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    calibrationGrading consistency, judge variance, model biasenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions