[meta] Eval suite coverage for agents and skills

## Goal

Track per-skill and per-agent eval suite authoring on top of the eval harness work tracked in #61. Each sub-issue covers one skill or one agent and ships as a small, independently reviewable PR using the contributor loop documented in `CONTRIBUTING.md` and `docs/WAZA.md` (both land with the harness PRs).

## Contributor loop

1. Pick an unclaimed sub-issue below and assign yourself.
2. Run `/skill-bench <name>` or `/agent-bench <name>` to draft `eval.yaml` + tasks from the live `SKILL.md` / `.agent.md`.
3. Run `waza run .github/evals/<name>/eval.yaml -v` locally (`copilot-sdk` executor, requires `copilot login`).
4. Run `/skill-improve <name>` or `/agent-improve <name>` to iterate on graders and fix false positives.
5. Open a PR adding the suite and a `manifest.yaml` entry.
6. CI runs the mock executor; a maintainer dispatches a real-model run for final review.
7. After the suite is stable, `/skill-promote` (or `/agent-promote`) bumps it from `expanded` to `pilot` tier.

## Skills

Good-first-issue candidates (no Azure CLI / no live calls):

- Skill: `azure-naming-research`
- Skill: `azure-rest-api-reference`
- Skill: `azure-role-selector`

Standard skills:

- Skill: `azure-cost-estimator`
- Skill: `azure-policy-advisor`
- Skill: `azure-security-analyzer`
- Skill: `git-ape-onboarding` (most complex — defer until others land)

(`prereq-check` ships with the harness PR as the proof-of-pipe — no separate sub-issue needed.)

## Agents

Good-first-issue candidates:

- Agent: `azure-principal-architect`

Standard agents:

- Agent: `azure-iac-exporter`
- Agent: `azure-policy-advisor`
- Agent: `azure-requirements-gatherer`
- Agent: `azure-resource-deployer` (safety-sensitive — grade refusal / plan-only path, not real deploy)
- Agent: `azure-template-generator`
- Agent: `git-ape` (orchestrator — depends on most sub-agent suites being stable first)
- Agent: `git-ape-onboarding`

## Conventions

- One suite per PR. Don't bundle.
- Use the authoring prompts; don't hand-write YAML.
- Default new suites to `expanded` tier in `manifest.yaml`; promote after at least one clean real-model run.
- Mock executor is the only thing fork PRs can run. Real-model runs are maintainer-dispatched.

## Related

- Eval harness work: #61


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[meta] Eval suite coverage for agents and skills #93

Goal

Contributor loop

Skills

Agents

Conventions

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[meta] Eval suite coverage for agents and skills #93

Description

Goal

Contributor loop

Skills

Agents

Conventions

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions