-
Notifications
You must be signed in to change notification settings - Fork 0
feat: parallel multi-grader fan-out with consensus strategies #906
Copy link
Copy link
Open
Description
Problem
Currently each target has a single grader_target for LLM-as-judge scoring. A single grader can have blind spots or biases that affect score reliability.
Proposal
Add grader_targets (plural) field for parallel grader fan-out:
- name: copilot-cli
provider: copilot-cli
grader_targets:
- grader-openrouter
- grader-gemini
grader_strategy: consensus # or: majority, any, allEach grader scores independently in parallel, then the strategy aggregates:
| Strategy | Behavior |
|---|---|
consensus |
All graders must agree (strictest) |
majority |
>50% of graders pass |
any |
At least one grader passes (most lenient) |
all |
Return all scores without aggregation (for analysis/comparison) |
Result JSONL
When multiple graders are used, the result should include per-grader scores:
{
"scores": [
{ "type": "llm-grader", "grader": "grader-openrouter", "score": 0.9 },
{ "type": "llm-grader", "grader": "grader-gemini", "score": 0.8 }
],
"grader_strategy": "majority",
"grader_agreement": 1.0
}The grader_agreement field (0.0–1.0) measures inter-grader reliability.
Use cases
- Reduce grader bias: one LLM's blind spots covered by another
- Cross-provider validation: ensure scores aren't provider-dependent
- Confidence scoring: high agreement = high confidence in score
- A/B testing graders: compare grader quality before switching
Prior art
- Google ADK: multiple evaluator judges with voting
- LMSYS Chatbot Arena: multi-judge ranking
- No eval framework exposes this declaratively in YAML config yet
Backward compatibility
grader_target(singular) continues to work unchangedgrader_targetsis optional; default strategy ismajority- When only one grader is specified in
grader_targets, behaves identically tograder_target
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels