-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Add an eval command to agk that enables testing and validating agent prompts and agent workflows. This feature should provide a framework for developers to define expected behavior, run automated tests, and integrate those tests into CI workflows.
Currently, agk supports scaffolding (agk init), tracing, and workflow execution. Eval support will complete the developer experience by making it easy to verify correctness and catch regressions.
Motivation
Developers building agents need a reliable way to:
- Validate prompt outputs against expectations
- Test multi-step agent workflows
- Detect regressions in prompt or agent logic
- Automate tests in CI/CD pipelines
Without a built-in evaluation mechanism, developers have to build custom scripts or maintain separate test tooling for each project.
Eval support will:
- Improve confidence in agent behavior
- Facilitate CI automation
- Encourage best practices
- Reduce duplication of test code
Design and Specification
What “eval” Should Do
-
Run a suite of test cases defined by the developer
-
Validate model output against expected results
-
Support different match strategies
- Exact match
- Semantic similarity
- Pattern or regular expression
-
Report structured results (CLI summary, JSON, HTML, JUnit)
Test Definition Format (Proposal)
Below is a proposed YAML format for prompt evaluation:
tests:
- name: Translate to French
input: "Translate to French: Hello"
expect:
type: exact
value: "Bonjour"
- name: Summarize text
input: "Summarize:\nThe quick brown fox..."
expect:
type: semantic
value: "Quick brown fox summary..."Example YAML for agent workflow evaluation:
tests:
- name: Calculator add
input: "Add 2 and 3"
expect:
output: "5"
tool_calls:
- name: "calc.add"
count: 1CLI Usage
Proposed commands:
agk eval # runs tests in default location
agk eval tests.yaml # specify custom path
agk eval --format json # output in JSON
agk eval --report html # generate HTML test report
Output Reporting
CLI Summary
2 tests run
1 passed
1 failed
JSON (machine-friendly)
{
"tests": [
{ "name": "Translate", "status": "passed" },
{ "name": "Summarize", "status": "failed" }
]
}HTML (human-readable)
A standalone HTML report showing test results, differences, and details.
Acceptance Criteria
agk evalcommand exists- Test definition format supported (YAML)
- Prompt and agent workflow tests execute against configured LLM
- Multiple expectation strategies supported
- Colorized CLI output
- Configurable report formats (JSON, HTML)
- Documentation for CI integration
- Reasonable defaults for timeouts and retries
User Experience Considerations
- Print clear failure reasons
- Show actual vs expected output
- Offer configurable retry behavior for non-deterministic models
- Provide helpful defaults for first-time users
Example Workflow
- Developer scaffolds an agent with
agk init - Developer writes tests in
tests.yaml - Developer runs
agk eval - If tests fail, developer refines prompts or logic
- CI/CD runs
agk eval --format jsonand fails builds when tests fail
Open Questions
- Should we support test parameterization?
- What semantic similarity metric should be used?
- How should external tool calls be mocked?
- Should tests support tolerance thresholds for nondeterministic output?
Documentation Needs
- Specification for test file format
- CLI reference for
eval - Example repository with common patterns
- Best practices guide for prompt and agent testing
Future Enhancements
- Automatic test case generation
- Interactive test recorder
- Import/export with formats such as JUnit or pytest
- Support for evaluation scoring metrics