An AI coding task evaluator using E2B sandboxes and LLM judges. Run a task, execute it in a secure sandbox, and evaluate the results automatically.
# Install
uv pip install -e .
# Run a task
evaluator "Write a Python script to calculate Fibonacci numbers"- Ad-hoc Task Execution: Give a prompt, get code + evaluation.
- E2B Sandbox Integration: Code runs in a secure, isolated cloud environment.
- LLM-as-a-Judge: Uses Claude to evaluate the quality and correctness of the solution.
- Backtesting: Run suites of regression tests defined in YAML (legacy
retrocodefunctionality).
evaluator "Write a React component for a login form"This will:
- Spin up an E2B sandbox.
- Use an AI agent to write the code.
- Run the code (if applicable).
- Evaluate the result using an LLM judge.
evaluator test --tests tests/backtestsThe system uses:
- Evaluator CLI: Entry point.
- AgentInvoker: Interacts with Claude to generate code.
- E2BExecutor: Runs code in a sandbox.
- TestRunner: Orchestrates execution and assertions.
MIT