-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Objective
Add an iterative generate→evaluate→feedback→regenerate loop (Ralph Loop) to AgentV. When enabled, a failing test case gets re-prompted with structured feedback about what went wrong, up to N iterations, until the quality threshold is met.
Design Latitude
YAML configuration (suggested shape, flexible):
execution:
ralph:
max_iterations: 3
threshold: 0.8
improvement_threshold: 0.05
feedback_template: ./feedback.md # optional custom templateCLI flag: agentv eval evals/ --ralph --max-iterations 3
Key components
- Orchestrator loop — wraps existing evaluate-single-case flow with retry logic
- Feedback builder — converts assertion failures into structured LLM-actionable feedback injected into the next prompt
- Stop conditions (from microsoft/skills harness):
quality_threshold_met— score >= thresholdperfect_score— score >= 1.0max_iterations_reached— exhausted budgetno_improvement— improvement < improvement_thresholdscore_regression— score went down
- Per-iteration result tracking — score trajectory, which iteration passed, stop reason
- Feedback template system — default template that formats failures by severity with suggestions; user can override with custom markdown template
Result schema extension
Acceptance Signals
agentv eval evals/ --ralphre-prompts failing tests with feedback- Feedback includes assertion failures grouped by severity
- Stops when threshold met, max iterations reached, or no improvement
- Results include per-iteration scores and stop reason
- Works with all target types (CLI agents, API providers)
Non-Goals
- Not multi-agent orchestration (single agent, iterative refinement)
- Not automatic prompt rewriting (feedback is appended, original prompt preserved)
- Not a replacement for trials (trials = same prompt N times; Ralph = feedback-augmented retries)
Context
Core pattern from the microsoft/skills eval harness. Named after the "Sensei" technique by Shayne Boyer. The skills harness uses this across 1114+ scenarios and reports significant quality improvements (often 40-60 point score jumps in 2-3 iterations).
Related
- feat(eval): composable quality gates with auto-remediation triggers #334 — composable quality gates (complementary — Ralph Loop uses gates to determine stop)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
{ "ralph": { "iterations": 3, "scores": [0.4, 0.7, 0.9], "stop_reason": "quality_threshold_met", "improvement": 0.5 } }