Simple evaluation framework for testing agent performance on cloud infrastructure debugging problems using LLM-as-judge.
-
Install dependencies:
uv sync
-
Set your OpenAI API key:
cp .env.example .env # Edit .env and add your OpenAI API key -
Run evaluation:
uv run python eval.py
-
View reports: Reports generated in
reports/folder as Markdown files.
Replace the agent import in eval.py:
# Instead of:
from example_agent import ExampleAgent
agent = ExampleAgent()
# Use your agent:
from your_agent import YourAgent
agent = YourAgent()class YourAgent:
def solve_problem(self, problem_context: str) -> str:
"""
Solve a debugging problem given the full context.
Args:
problem_context: Formatted problem with logs, configs, etc.
Returns:
Solution string with root cause, diagnosis steps, solution, etc.
"""
# Your implementation here
return solution_string├── problems/ # Problem directories
├── reports/ # Generated evaluation reports (markdown)
├── src/evaluator.py # Evaluation framework
├── example_agent.py # Example external agent
└── eval.py # Run: python eval.py
The LLM judge evaluates on three dimensions:
- Diagnosis Accuracy (40%): How well did the agent identify the root cause?
- Solution Correctness (40%): How correct and complete is the proposed fix?
- Investigation Quality (20%): How systematic and thorough is the debugging methodology?
- Create directory under
problems/ - Include
problem.md,solution.md,logs/,configs/ - See
docs/problem-creation-guide.mdfor details