Agent Flow Test - Pytest for AI agents
AgentFT (Agent Flow Test) is a pytest-style evaluation framework for AI agents. It provides composable primitives for tasks, scenarios, agents, and judges, with production-ready infrastructure including async execution, retries, rate limiting, and comprehensive reporting.
- Pytest-like simplicity: Minimal boilerplate, clear abstractions
- Production-ready: Async execution with retries and rate limiting
- Full observability: Traces, costs, metadata, HTML reports
- Composable judges: Layer multiple evaluation strategies (sequential, voting, weighted)
- Environment support: Not just Q&A - full stateful agent evaluation
- CLI tools: Run evals, view summaries, compare runs
Install from PyPI:
pip install agentftRun an example:
git clone https://github.com/Geddydukes/agentflowtest
cd agentflowtest
aft run --config examples/config_example.pyOr use in code:
from agentft import Task, RunConfig, run
from agentft.presets import build_math_basic_scenario, ExactMatchJudge
# Define your agent
class MyAgent:
name = "my_agent"
version = "0.1.0"
provider_key = None
async def setup(self): pass
async def reset(self): pass
async def teardown(self): pass
async def run_task(self, task, context=None):
# Your agent logic here
return {"response": "42"}
# Create evaluation config
config = RunConfig(
name="quick_test",
agents=[MyAgent()],
scenarios=[build_math_basic_scenario()],
judges=[ExactMatchJudge()],
)
# Run evaluation
results = run(config)
print(f"Passed: {sum(r.passed for r in results)}/{len(results)}")View results:
- Console: Summary printed automatically
- HTML: Open
runs/<run_id>/report.htmlin browser - Raw data:
runs/<run_id>/results.jsonl
Existing eval frameworks are either too simple or too complex:
- Simple tools: Just run prompts, no structure or observability
- Academic benchmarks: Steep learning curve, not designed for iteration
- Vendor platforms: Lock-in, limited customization
AgentFT hits the sweet spot:
- Pytest-like API (minimal boilerplate)
- Production-ready features (async, retries, rate limiting)
- Full transparency (traces, costs, metadata)
- Works for simple Q&A through complex multi-step agents
Run an evaluation from a Python config file.
aft run --config path/to/config.pyThe config file should define a RunConfig object named config.
Show a summary of a completed run.
aft summary --run-dir runs/my_run-abc123/Compare two runs to identify regressions and improvements.
aft compare --run-a runs/run1/ --run-b runs/run2/Running an evaluation creates a directory under runs/<run_id>/ with:
- results.jsonl: All evaluation results in JSONL format
- traces.jsonl: Event traces for debugging and analysis
- run_metadata.json: Run configuration, environment, and version information
- report.html: Interactive HTML report with statistics, per-agent performance, and failing tasks
Open report.html in your browser to view the detailed evaluation report.
AgentFT includes a comprehensive test suite with 43+ tests covering all major components:
- Core types (Task, Cost, EvaluationResult, Trace)
- Scenarios and presets
- Judges (ExactMatchJudge, CompositeJudge)
- Runner and lifecycle hooks
- Storage and reporting functions
- Integration tests
Run tests:
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest
# Generate HTML test report
pytest --html=test-results/report.html --self-contained-html
# Generate coverage report
pytest --cov=src/agentft --cov-report=html:htmlcovTest reports are generated in test-results/ and coverage reports in htmlcov/.
Agent Flow Test (AgentFT) is in active development. The core framework is functional and ready for evaluation use cases.
Current version: 0.1.0