-
Notifications
You must be signed in to change notification settings - Fork 0
feat: configurable threshold + naming taxonomy cleanup #925
Description
Problem
AgentV has threshold/scoring naming inconsistencies across surfaces, and PASS_THRESHOLD = 0.8 is hardcoded in places that should respect the configurable execution.threshold.
Naming Taxonomy Audit
Concept A — Pass/fail boundary ("what score counts as passing?")
| Surface | Current name | Consistent? |
|---|---|---|
| CLI | --threshold |
Yes |
| YAML | execution.threshold |
Yes |
| Orchestrator | threshold param |
Yes |
| Studio config | studio.pass_threshold |
No — should be threshold |
| Core constant | PASS_THRESHOLD |
No — should be DEFAULT_THRESHOLD |
| Composite aggregator | { type: 'threshold', threshold: 0.7 } |
Yes (different concept — aggregation strategy) |
Industry comparison: DeepEval, Promptfoo, vitest-evals all use threshold. Nobody uses pass_threshold.
Concept B — Required gate ("if this evaluator fails, zero the aggregate")
| Surface | Current name | Consistent? |
|---|---|---|
| Assertion | required: true |
Yes |
| Rubric | required: true |
Yes |
This is fine — boolean required is clear.
Concept C — Minimum score ("what's the minimum acceptable score?")
| Surface | Current name | Scale | Consistent? |
|---|---|---|---|
| Assertion | required: 0.7 |
0-1 | No — overloads boolean required with completely different semantics |
| Rubric | required_min_score: 7 |
0-10 | No — different name, different scale from everything else |
Both should be min_score on a 0-1 scale.
Scoring scale principle: Everything the user configures as a threshold or boundary is 0-1. The only place 0-10 appears in YAML is score_ranges on rubric criteria — those aren't thresholds, they're label definitions describing what each integer band means to the LLM grader.
Agent Skills format — has no threshold concept at all. Assertions are strings promoted to llm-grader. pass_rate is a computed output metric, not a configurable boundary. No naming conflict.
Hardcoded Threshold Bugs
scoreToVerdict(score)ignores configurable threshold — always uses hardcoded 0.8evaluate()APIcomputeSummary()— usesPASS_THRESHOLDdirectlyrequired: truefallback —const minScore = typeof entry.required === 'number' ? entry.required : PASS_THRESHOLDshould use test threshold instead- Per-test
execution.threshold— schema allows it but yaml-parser only readsskip_defaultsfrom per-test execution, ignoresthreshold
Proposed Changes
1. Rename PASS_THRESHOLD → DEFAULT_THRESHOLD
// packages/core/src/evaluation/evaluators/scoring.ts
export const DEFAULT_THRESHOLD = 0.8;
/** @deprecated Use DEFAULT_THRESHOLD */
export const PASS_THRESHOLD = DEFAULT_THRESHOLD;2. Make scoreToVerdict() threshold-aware
export function scoreToVerdict(score: number, threshold = DEFAULT_THRESHOLD): EvaluationVerdict {
return score >= threshold ? 'pass' : 'fail';
}Thread caseThreshold (already available in orchestrator) to all call sites.
3. Wire per-test execution.threshold in yaml-parser
The schema already allows execution.threshold on test cases, but the parser ignores it. Wire it through so each test can override the suite threshold.
Resolution order: --threshold (CLI) > test execution.threshold > suite execution.threshold > DEFAULT_THRESHOLD (0.8)
4. Fix evaluate() API
export interface EvalConfig {
readonly threshold?: number; // NEW
}
function computeSummary(results, durationMs, threshold = DEFAULT_THRESHOLD): EvalSummary { ... }5. Split required: number → required: boolean + min_score: number on assertions
# Before (confusing — is required boolean or number?)
assertions:
- type: llm-grader
prompt: ./safety.md
required: 0.9
# After (clear)
assertions:
- type: llm-grader
prompt: ./safety.md
required: true
min_score: 0.9required: boolean— gate semantics (if fails, aggregate → 0)min_score: number(0-1) — minimum acceptable score for this evaluatormin_scorewithoutrequired: truestill sets the score floor but doesn't gate the aggregate- Deprecation:
required: numbercontinues to work, parsed asrequired: true+min_score: <value>
6. Rename required_min_score → min_score on rubrics, change to 0-1 scale
# Before (0-10 scale, verbose name)
rubrics:
- id: accuracy
outcome: "Factually correct"
required_min_score: 7
score_ranges:
- score_range: [0, 5]
outcome: "Incomplete"
- score_range: [6, 10]
outcome: "Satisfactory"
# After (0-1 scale, consistent name)
rubrics:
- id: accuracy
outcome: "Factually correct"
min_score: 0.7
score_ranges: # stays 0-10 — these describe LLM integer output bands
- score_range: [0, 5]
outcome: "Incomplete"
- score_range: [6, 10]
outcome: "Satisfactory"min_scoreis 0-1 (internally multiplied by 10 for comparison against LLM's 0-10 output)score_rangesstay 0-10 — they're label definitions for the LLM's integer scoring, not user thresholdsrequired_min_scorecontinues to work as deprecated alias (value interpreted as 0-10, converted to 0-1)
7. Rename studio pass_threshold → threshold
# .agentv/config.yaml
# Before
studio:
pass_threshold: 0.8
# After
studio:
threshold: 0.8- Read
pass_thresholdas fallback for backward compat - Write
thresholdon save
Scoring Scale Summary
Everything the user configures is 0-1. The only 0-10 in YAML is score_ranges (LLM output band labels, not a threshold).
| Level | Field | Scale |
|---|---|---|
| Suite/test | execution.threshold |
0-1 |
| CLI | --threshold |
0-1 |
| All graders | min_score |
0-1 |
| Rubric criteria | min_score |
0-1 (internally × 10 for LLM comparison) |
| Rubric criteria | score_ranges |
0-10 (LLM integer band labels — NOT a threshold) |
Files to modify
| File | Change |
|---|---|
packages/core/src/evaluation/evaluators/scoring.ts |
Rename constant, add threshold param to scoreToVerdict() |
packages/core/src/evaluation/orchestrator.ts |
Thread threshold to scoreToVerdict(), fix required: true fallback |
packages/core/src/evaluation/evaluate.ts |
Accept threshold in EvalConfig, pass to computeSummary() |
packages/core/src/evaluation/yaml-parser.ts |
Read per-test execution.threshold, pass to orchestrator |
packages/core/src/evaluation/validation/eval-file.schema.ts |
Add min_score to EvaluatorCommonSchema and RubricItemSchema (0-1 scale) |
packages/core/src/evaluation/loaders/evaluator-parser.ts |
Parse min_score, handle deprecated required: number and required_min_score (with 0-10 → 0-1 conversion) |
packages/core/src/evaluation/evaluators/llm-grader.ts |
Convert rubric min_score from 0-1 to 0-10 for internal LLM comparison |
packages/core/src/evaluation/types.ts |
Update type definitions |
packages/core/src/evaluation/evaluators/index.ts |
Re-export DEFAULT_THRESHOLD + deprecated PASS_THRESHOLD |
apps/cli/src/commands/results/studio-config.ts |
Read/write threshold, fallback to pass_threshold |
apps/cli/src/commands/eval/benchmark-writer.ts |
Import DEFAULT_THRESHOLD instead of local duplicate |
Backward compatibility
All changes use deprecation aliases — no breaking changes:
PASS_THRESHOLDre-exported as deprecatedrequired: numberparsed asrequired: true+min_scorerequired_min_scoreaccepted alongsidemin_score(value in 0-10 converted to 0-1 internally)pass_thresholdread as fallback in studio config
Non-goals
- Changing the default value (0.8 stays)
- Changing
score_rangesto 0-1 (they describe LLM integer output bands) - Renaming
thresholdon latency evaluator (different concept — max ms) - Renaming composite aggregator
type: 'threshold'(different concept — aggregation strategy) - Adding threshold to Agent Skills format (they don't have it, no conflict)
Documentation Requirements
The implementer MUST update documentation to make the scoring scale and naming clear:
- eval-schema.json — regenerate after schema changes (
bun run generate:schema) - README.md — update any threshold/scoring examples to use new field names
- YAML schema JSDoc — add clear comments on
min_scoreexplaining 0-1 scale at all levels - Migration note — document deprecated fields (
required: number,required_min_score,pass_threshold) and their new equivalents in a visible location (CHANGELOG or migration section) - Scoring scale principle — document in CLAUDE.md or schema comments: "All user-configurable score thresholds use 0-1 scale. The only 0-10 values in YAML are
score_rangeswhich define LLM integer output band labels."
min_score Behavior (Clarification)
min_score and required are independent:
| Config | Evaluator verdict | Aggregate effect |
|---|---|---|
min_score: 0.7 only |
fail if score < 0.7, pass if >= 0.7 | Score contributes to weighted average normally |
required: true only |
fail if score < threshold (suite/test level) | If fails, entire aggregate → 0 |
min_score: 0.7 + required: true |
fail if score < 0.7 | If score < 0.7, entire aggregate → 0 |
| Neither | fail if score < threshold (suite/test level) | Score contributes normally |
min_score controls the evaluator's own verdict. required controls whether failing gates the aggregate. They compose independently.
Acceptance Signals
- Threshold threading:
scoreToVerdict()accepts and uses configurable threshold at all call sites - Per-test threshold:
execution.thresholdon a test case overrides suite threshold — verified by test - Resolution order: CLI
--threshold> testexecution.threshold> suiteexecution.threshold>DEFAULT_THRESHOLD(0.8) — verified by test evaluate()API:EvalConfig.thresholdis respected incomputeSummary()— verified by testmin_scoreon assertions: new field accepted in YAML, controls per-evaluator verdict — verified by testmin_scoreon rubrics: new field accepted in YAML (0-1 scale), internally converted to 0-10 — verified by test- Deprecation aliases work:
required: 0.7parsed asrequired: true+min_score: 0.7;required_min_score: 7parsed asmin_score: 0.7— verified by test - Studio config: reads
threshold, falls back topass_threshold— verified by test - Schema regenerated:
bun run generate:schemaproduces updatedeval-schema.json - All existing tests pass: no regressions from rename/deprecation
- Documentation updated: per Documentation Requirements section above
Metadata
Metadata
Assignees
Labels
Type
Projects
Status