Scope
This is a documentation/examples task — no core code changes needed. These evaluators are implementable as code judges using the existing @agentv/eval SDK.
What to Create
Add example code judges in examples/ for standard NLP metrics:
1. ROUGE-N (examples/nlp-metrics/judges/rouge.ts)
import { defineCodeJudge } from "@agentv/eval";
// Use rouge npm package to compute ROUGE-N score
// Config: { n: 1, threshold: 0.75 }
2. BLEU (examples/nlp-metrics/judges/bleu.ts)
// Use sacrebleu or bleu-score package
// Config: { threshold: 0.5 }
3. Semantic Similarity (examples/nlp-metrics/judges/similarity.ts)
// Use target proxy to call embedding model
// Compute cosine similarity between expected and actual
// Config: { threshold: 0.8 }
4. Levenshtein Distance (examples/nlp-metrics/judges/levenshtein.ts)
// Compute edit distance, score = 1 - (distance / max_length)
// Config: { threshold: 5 }
Example EVAL.yaml
name: nlp-metrics-demo
cases:
- id: summarization-test
input: "Summarize this article..."
expected_outcome: "Reference summary text"
execution:
evaluators:
- name: rouge_score
type: code_judge
script: ["bun", "run", "judges/rouge.ts"]
n: 1
threshold: 0.75
- name: semantic_match
type: code_judge
script: ["bun", "run", "judges/similarity.ts"]
threshold: 0.8
Documentation
- Add to docs site: "NLP Metrics Evaluators" guide
- Reference from existing evaluator docs
- Show how to compose with other evaluator types
Acceptance Criteria
Scope
This is a documentation/examples task — no core code changes needed. These evaluators are implementable as code judges using the existing
@agentv/evalSDK.What to Create
Add example code judges in
examples/for standard NLP metrics:1. ROUGE-N (
examples/nlp-metrics/judges/rouge.ts)2. BLEU (
examples/nlp-metrics/judges/bleu.ts)3. Semantic Similarity (
examples/nlp-metrics/judges/similarity.ts)4. Levenshtein Distance (
examples/nlp-metrics/judges/levenshtein.ts)Example EVAL.yaml
Documentation
Acceptance Criteria