Universal testing and evaluation toolkit for AI agents.
pip install agentest
import agentest
# Auto-record all LLM calls (works with Anthropic and OpenAI SDKs)
agentest.instrument()
# Run your agent and capture a trace
result, trace = agentest.run(my_agent, "Summarize README.md", task="Summarize")
# Evaluate it
for r in agentest.evaluate(trace):
print(f"{r.evaluator}: {'PASS' if r.passed else 'FAIL'}")That's it. Three lines to instrument, trace, and evaluate any agent — no matter what framework or LLM provider you use.
- Record & Replay — Capture real agent sessions, replay them deterministically without LLM calls
- Tool Mocking — Mock any tool with a fluent API:
.when(...).returns(...) - 10 Built-in Evaluators — Task completion, safety, cost, latency, tool usage, LLM judges, and more
- Auto-Instrumentation —
agentest.instrument()patches Anthropic/OpenAI clients with zero code changes - Framework Adapters — LangChain, CrewAI, AutoGen, LlamaIndex, Claude Agent SDK, OpenAI Agents SDK
- MCP Server Testing — Protocol compliance, schema validation, and security testing
- pytest Plugin — Auto-registered fixtures, markers, and CLI flags
- Benchmarking — Compare pass rates, cost, and latency across models
- CLI —
agentest evaluate,agentest replay,agentest summary, and more - Web Dashboard — Browse and explore traces in your browser
- Quick Start Guide — install to first passing test in under a minute
- Full Documentation — guides, API reference, and best practices
- Examples — working code you can run
- Best Practices — rollout order, project structure, CI/CD setup
MIT