Goal
Track per-skill and per-agent eval suite authoring on top of the eval harness work tracked in #61. Each sub-issue covers one skill or one agent and ships as a small, independently reviewable PR using the contributor loop documented in CONTRIBUTING.md and docs/WAZA.md (both land with the harness PRs).
Contributor loop
- Pick an unclaimed sub-issue below and assign yourself.
- Run
/skill-bench <name> or /agent-bench <name> to draft eval.yaml + tasks from the live SKILL.md / .agent.md.
- Run
waza run .github/evals/<name>/eval.yaml -v locally (copilot-sdk executor, requires copilot login).
- Run
/skill-improve <name> or /agent-improve <name> to iterate on graders and fix false positives.
- Open a PR adding the suite and a
manifest.yaml entry.
- CI runs the mock executor; a maintainer dispatches a real-model run for final review.
- After the suite is stable,
/skill-promote (or /agent-promote) bumps it from expanded to pilot tier.
Skills
Good-first-issue candidates (no Azure CLI / no live calls):
- Skill:
azure-naming-research
- Skill:
azure-rest-api-reference
- Skill:
azure-role-selector
Standard skills:
- Skill:
azure-cost-estimator
- Skill:
azure-policy-advisor
- Skill:
azure-security-analyzer
- Skill:
git-ape-onboarding (most complex — defer until others land)
(prereq-check ships with the harness PR as the proof-of-pipe — no separate sub-issue needed.)
Agents
Good-first-issue candidates:
- Agent:
azure-principal-architect
Standard agents:
- Agent:
azure-iac-exporter
- Agent:
azure-policy-advisor
- Agent:
azure-requirements-gatherer
- Agent:
azure-resource-deployer (safety-sensitive — grade refusal / plan-only path, not real deploy)
- Agent:
azure-template-generator
- Agent:
git-ape (orchestrator — depends on most sub-agent suites being stable first)
- Agent:
git-ape-onboarding
Conventions
- One suite per PR. Don't bundle.
- Use the authoring prompts; don't hand-write YAML.
- Default new suites to
expanded tier in manifest.yaml; promote after at least one clean real-model run.
- Mock executor is the only thing fork PRs can run. Real-model runs are maintainer-dispatched.
Related
Goal
Track per-skill and per-agent eval suite authoring on top of the eval harness work tracked in #61. Each sub-issue covers one skill or one agent and ships as a small, independently reviewable PR using the contributor loop documented in
CONTRIBUTING.mdanddocs/WAZA.md(both land with the harness PRs).Contributor loop
/skill-bench <name>or/agent-bench <name>to drafteval.yaml+ tasks from the liveSKILL.md/.agent.md.waza run .github/evals/<name>/eval.yaml -vlocally (copilot-sdkexecutor, requirescopilot login)./skill-improve <name>or/agent-improve <name>to iterate on graders and fix false positives.manifest.yamlentry./skill-promote(or/agent-promote) bumps it fromexpandedtopilottier.Skills
Good-first-issue candidates (no Azure CLI / no live calls):
azure-naming-researchazure-rest-api-referenceazure-role-selectorStandard skills:
azure-cost-estimatorazure-policy-advisorazure-security-analyzergit-ape-onboarding(most complex — defer until others land)(
prereq-checkships with the harness PR as the proof-of-pipe — no separate sub-issue needed.)Agents
Good-first-issue candidates:
azure-principal-architectStandard agents:
azure-iac-exporterazure-policy-advisorazure-requirements-gathererazure-resource-deployer(safety-sensitive — grade refusal / plan-only path, not real deploy)azure-template-generatorgit-ape(orchestrator — depends on most sub-agent suites being stable first)git-ape-onboardingConventions
expandedtier inmanifest.yaml; promote after at least one clean real-model run.Related