Skip to content

[meta] Eval suite coverage for agents and skills #93

@arnaudlh

Description

@arnaudlh

Goal

Track per-skill and per-agent eval suite authoring on top of the eval harness work tracked in #61. Each sub-issue covers one skill or one agent and ships as a small, independently reviewable PR using the contributor loop documented in CONTRIBUTING.md and docs/WAZA.md (both land with the harness PRs).

Contributor loop

  1. Pick an unclaimed sub-issue below and assign yourself.
  2. Run /skill-bench <name> or /agent-bench <name> to draft eval.yaml + tasks from the live SKILL.md / .agent.md.
  3. Run waza run .github/evals/<name>/eval.yaml -v locally (copilot-sdk executor, requires copilot login).
  4. Run /skill-improve <name> or /agent-improve <name> to iterate on graders and fix false positives.
  5. Open a PR adding the suite and a manifest.yaml entry.
  6. CI runs the mock executor; a maintainer dispatches a real-model run for final review.
  7. After the suite is stable, /skill-promote (or /agent-promote) bumps it from expanded to pilot tier.

Skills

Good-first-issue candidates (no Azure CLI / no live calls):

  • Skill: azure-naming-research
  • Skill: azure-rest-api-reference
  • Skill: azure-role-selector

Standard skills:

  • Skill: azure-cost-estimator
  • Skill: azure-policy-advisor
  • Skill: azure-security-analyzer
  • Skill: git-ape-onboarding (most complex — defer until others land)

(prereq-check ships with the harness PR as the proof-of-pipe — no separate sub-issue needed.)

Agents

Good-first-issue candidates:

  • Agent: azure-principal-architect

Standard agents:

  • Agent: azure-iac-exporter
  • Agent: azure-policy-advisor
  • Agent: azure-requirements-gatherer
  • Agent: azure-resource-deployer (safety-sensitive — grade refusal / plan-only path, not real deploy)
  • Agent: azure-template-generator
  • Agent: git-ape (orchestrator — depends on most sub-agent suites being stable first)
  • Agent: git-ape-onboarding

Conventions

  • One suite per PR. Don't bundle.
  • Use the authoring prompts; don't hand-write YAML.
  • Default new suites to expanded tier in manifest.yaml; promote after at least one clean real-model run.
  • Mock executor is the only thing fork PRs can run. Real-model runs are maintainer-dispatched.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    AI-evalsAll things related to agent and skills evaluation.enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions