Cloud Debug Eval

Simple evaluation framework for testing agent performance on cloud infrastructure debugging problems using LLM-as-judge.

Quick Start

Install dependencies:
```
uv sync
```

Set your OpenAI API key:

cp .env.example .env
# Edit .env and add your OpenAI API key

Run evaluation:
```
uv run python eval.py
```
View reports: Reports generated in reports/ folder as Markdown files.

Using Your Own Agent

Replace the agent import in eval.py:

# Instead of:
from example_agent import ExampleAgent
agent = ExampleAgent()

# Use your agent:
from your_agent import YourAgent
agent = YourAgent()

Agent Interface

class YourAgent:
    def solve_problem(self, problem_context: str) -> str:
        """
        Solve a debugging problem given the full context.
        
        Args:
            problem_context: Formatted problem with logs, configs, etc.
            
        Returns:
            Solution string with root cause, diagnosis steps, solution, etc.
        """
        # Your implementation here
        return solution_string

Structure

├── problems/                    # Problem directories
├── reports/                    # Generated evaluation reports (markdown)
├── src/evaluator.py            # Evaluation framework
├── example_agent.py           # Example external agent
└── eval.py                    # Run: python eval.py

Evaluation

The LLM judge evaluates on three dimensions:

Diagnosis Accuracy (40%): How well did the agent identify the root cause?
Solution Correctness (40%): How correct and complete is the proposed fix?
Investigation Quality (20%): How systematic and thorough is the debugging methodology?

Adding Problems

Create directory under problems/
Include problem.md, solution.md, logs/, configs/
See docs/problem-creation-guide.md for details

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
problems/k8s-github-secret-failure		problems/k8s-github-secret-failure
reports		reports
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
example_agent.py		example_agent.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud Debug Eval

Quick Start

Using Your Own Agent

Agent Interface

Structure

Evaluation

Adding Problems

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cloud Debug Eval

Quick Start

Using Your Own Agent

Agent Interface

Structure

Evaluation

Adding Problems

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages