feat(evals): introduce waza eval harness for skills and agents#109
Conversation
- port .waza.yaml + .github/prompts/ (skill+agent bench/improve/promote) - add manifest.yaml seeded with prereq-check as harness smoke test - port waza-evals + waza-agent-evals workflows with PR-comment fan-in - extend actionlint.yaml to silence intentional SC2016 in waza workflows - ignore .waza-cache/ and .waza-results/ 🧪 - Generated by Copilot
🤖 Waza agent evals (advisory)
Ran 0 agent evals against
📊 Agent file token comparison vs
|
Without the secret, downstream jobs failed loudly with 403 (microsoft/waza is a private repo) and the agent matrix step crashed on the missing .github/evals/agents/ directory. Changes: - Add 'preflight' job in both waza-evals.yml and waza-agent-evals.yml that checks for the COPILOT_GITHUB_TOKEN secret and emits enabled=true|false. - Gate prepare/tokens/eval/comment jobs on needs.preflight.outputs.enabled so they skip cleanly (gray, not red) until the maintainer adds the secret. - Make the agent matrix step tolerate a missing .github/evals/agents/ directory by treating it as an empty agent list. 🧪 - Generated by Copilot
🧪 Waza skill evals (advisory)
Ran 4 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg
📊 Token comparison vs
|
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6
Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-opus-4.6.junit.xml
Model: claude-sonnet-4.6
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
✓ [1/4] Negative — Editing an ARM template
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: 281A:227B0A:165954E:1AD01F2:6A13C032)
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.77 | Duration: 2m0.502s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.57 - 1.00 (σ=0.1839)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 1/3 (error):
- ❌ answer_quality (0.00): fail: All four PASS criteria are missing: The assistant's response failed with "unexpected user permission response" errors and produced no useful output. It did not: (1) name any of the required tools (az, gh, jq, git), (2) provide any install command for az or any other tool, (3) recommend version verification commands, or (4) reach any verdict or next step. The response was entirely empty of actionable content.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6
Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json
Model: gpt-5-codex
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [2/4] Negative — Azure service concept question
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [4/4] Positive — "What do I need to install?"
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [3/4] Positive — "command not found" failure
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [1/4] Negative — Editing an ARM template
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.00 | Duration: 319ms
- Tests: 4 total, 0 passed, 4 failed, 0 errors
- Success Rate: 0.0%
- Score Range: 0.00 - 0.00 (σ=0.0000)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.00 | ❌ | - |
| Negative — Azure service concept question | 0.00 | ❌ | - |
| Positive — "command not found" failure | 0.00 | ❌ | - |
| Positive — "What do I need to install?" | 0.00 | ❌ | - |
Failed Task Details
Negative — Editing an ARM template
Run 1/3 (error):
Run 2/3 (error):
Run 3/3 (error):
Negative — Azure service concept question
Run 1/3 (error):
Run 2/3 (error):
Run 3/3 (error):
Positive — "command not found" failure
Run 1/3 (error):
Run 2/3 (error):
Run 3/3 (error):
Positive — "What do I need to install?"
Run 1/3 (error):
Run 2/3 (error):
Run 3/3 (error):
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5-codex
Results saved to: .waza-results/prereq-check-gpt-5-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✓ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 100.0% (4/4 tasks passed)
Without Skills: 75.0% (3/4 tasks passed)
Impact: +25.0 percentage points
Per-Task Breakdown:
• Negative — Editing an ARM template [NEUTRAL] 100% → 100% (+0pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [NEUTRAL] 100% → 100% (+0pp)
• Positive — "What do I need to install?" [IMPROVED] 0% → 100% (+100pp)
Verdict: Skills have POSITIVE IMPACT (improved 1/4 tasks)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ✅ Passed | Score: 0.79 | Duration: 1m53.373s
- Tests: 4 total, 4 passed, 0 failed, 0 errors
- Success Rate: 100.0%
- Score Range: 0.57 - 1.00 (σ=0.2074)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4
Results saved to: .waza-results/prereq-check-gpt-5.4.json
JUnit XML saved to: .waza-results/prereq-check-gpt-5.4.junit.xml
🔢 Tokens (count + profile)
📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
⚠️ token count 2138 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious from the description and Quick Reference table. Steps are numbered, ordered logically (detect → scan → check → auth → verdict), and each step references its implementation artifact. The status mapping and verdict taxonomy leave no ambiguity.
completeness █████ Covers tool checks, version minimums, auth checks, platform variants (macOS/Linux/Windows), user-reported vs. terminal-detected discrepancies, error handling table with 8 specific scenarios, and a full constraints section. Edge cases like headless shells, expired sessions, and execution policy restrictions are explicitly addressed.
trigger_precision ████░ USE FOR triggers are exhaustive and include both conceptual phrases and literal error strings (e.g., 'az: command not found'), which maximizes recall. The DO NOT USE FOR section is deliberately blunt ('Anything else'), which is clear but could briefly name 1-2 adjacent skills to prevent boundary confusion — e.g., 'not for azure-validate or deployment errors'.
scope_coverage █████ Scope is tightly defined: read-only, four specific tools, two auth sessions, three platforms. Boundaries are explicit in both the Quick Reference ('Side effects: Read-only') and Constraints ('Never' list). The 'Next' section correctly hands off without overstepping into onboarding territory.
anti_patterns █████ Avoids all common anti-patterns: no vague instructions ('exactly one of READY / TOOLS MISSING / REPORTED MISSING / AUTH MISSING'), no conflicting directives, error handling is concrete with cause+fix pairs, and Rule 4 explicitly prevents the dangerous auto-chaining anti-pattern. The 'stop at first blocking failure' rule prevents partial-state confusion.
────────────────────────────────────────────
Overall: 4.8/5.0
Exceptionally well-structured skill definition. It is precise, actionable, and defensively designed — covering platform detection, user-reported vs. terminal-detected discrepancies, and a clear verdict taxonomy. The only minor gap is that DO NOT USE FOR could name 1-2 adjacent skills to sharpen routing at the edges, but this is a negligible omission in an otherwise exemplary skill.
✅ Check (compliance summary) (59 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: prereq-check
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 2138 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 4/4 valid
✅ All links valid.
📊 Token Budget: 2138 / 500 tokens
❌ Exceeds limit by 1638 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 1 reference module(s)
❌ [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips
…t presence A token that exists but lacks access to the private microsoft/waza repo still produced a hard 403 during waza install. Upgrade the preflight check to a real HTTP call against the microsoft/waza releases API and gate on a 200 response so misconfigured tokens skip the same way absent tokens do. 🧪 - Generated by Copilot
… command-not-found errors and PATH repair
|
waza check expects eval.yaml colocated with SKILL.md and reports 'Not Found' when the layout separates them. Repo convention is .github/skills/<name>/SKILL.md + .github/evals/<name>/eval.yaml, so the warning was a false negative. Prepend a note above the captured waza check output explaining the layout and pointing reviewers to the Score section, which reflects the actual eval run.
Introduces stage 0 of the eval lifecycle: scaffold a brand-new .github/evals/<skill>/ suite (eval.yaml + positive/negative/off-topic tasks), register the skill at the expanded tier in manifest.yaml, and run a single-model smoke trial. Pauses for approval before appending to the manifest and before the smoke trial. Out of scope: editing SKILL.md (use /skill-improve) and promoting to the pilot tier (use /skill-promote). Also updates website/docs/authoring/prompts.md with a When-to-use row, a per-prompt section, and an updated lifecycle summary in the page intro and frontmatter description.
…amework - add /agent-onboard prompt and AGENT/SKILL scaffold templates - record waza harness decision in .github/evals/README.md - document eval lifecycle in website/docs/authoring/framework.md - wire mutable-by-* tags into /agent-improve gate and Locked column - add CONTRIBUTING.md section for adding eval suites 🦍 - Generated by Copilot
sendtoshailesh
left a comment
There was a problem hiding this comment.
🧪 Local Validation Feedback — Waza Eval Harness
Tested locally with Waza CLI v0.31.0 on macOS/arm64. Here is a structured summary of findings.
✅ What Works Well
- Reproducibility: 3×
--no-cacheruns withclaude-sonnet-4.6→ identical 0.86 score, 0% variance - Workflow architecture: Proper job DAG (preflight→prepare→eval→comment), graceful degradation when secret missing
- Security model:
pull_requestevent (fork-safe), minimal permissions (contents:read, pull-requests:write) - Contributor prompts: All 6 prompts (skill/agent × bench/improve/promote) are well-structured
- Advisory-only mode: Evals never block merges ✅
🔴 High Priority Findings
1. trials_per_task: 1 causes non-deterministic pass/fail
| Model | Local Result | CI Result |
|---|---|---|
| claude-sonnet-4.6 | ✅ 0.86, 3/3 pass | ✅ pass |
| gpt-5-codex | ✅ 0.86, 3/3 pass | ✅ pass |
| gpt-5.4 | ✅ pass |
gpt-5.4 failed locally on positive-what-do-i-need → criterion 4 missing (no install commands). Same model passes in CI. With a single trial, one flaky LLM response = permanent failure for that run.
Recommendation: trials_per_task: 3 with majority voting for production gating.
2. Insufficient negative task coverage
Current state: 2 positive + 1 negative task. A skill that always fires would score 66% (above the 0.6 trigger_precision threshold).
Recommendation: Add ≥3 negative scenarios, e.g.:
- "Review this pull request for security issues" (code review, not prereq)
- "Deploy my app to Azure Container Apps" (deployment, not prereq)
- "What are the team's Q1 OKRs?" (org question, not tooling)
3. Token budget calibration mismatch
The .waza.yaml comment says:
"75th-percentile token count across all 34 skills, rounded up to nearest 50, capped at 1000"
But actual state: 13 skills, 75th percentile = 3,179 tokens (3.2× the 1000 threshold). Only 2/13 skills are under the 1300 fallback limit. This causes waza check to flag every skill as over-budget.
Question: Was this calibrated against a different/future skill set? If so, the comment should clarify. If not, thresholds need recalibration.
🟡 Medium Priority
4. waza check false negative on eval suite
Because eval.yaml lives in .github/evals/prereq-check/ (not colocated with SKILL.md), waza check reports "Evaluation Suite: Not Found". The PR comment correctly notes this, but it will confuse contributors following the readiness workflow.
5. Judge model same-family bias
claude-sonnet-4.6 judges claude-sonnet-4.6 responses. The meeting discussion about "makers vs checkers across model families" suggests this should eventually use a cross-family judge (e.g., GPT judges Claude, Claude judges GPT).
6. Aggregate score 0.86 for "all pass" may confuse reviewers
The negative task scores 0.57 (correct behavior — skill correctly did NOT trigger, but the score reflects imperfect isolation). Readers seeing <1.0 on "all pass" might think something is wrong.
📊 Cost Data (for scalability planning)
| Model | Premium Requests | Input Tokens | Duration |
|---|---|---|---|
| claude-sonnet-4.6 | 10 | 291K | 62s |
| gpt-5-codex | 12 | 350K | 65s |
| gpt-5.4 | 14 | 365K | ~145s |
At full scale (13 skills × 4 models): ~650 premium requests per full matrix run.
Local Test Steps Used
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
gh repo clone Azure/git-ape && cd git-ape
git checkout -b feat/eval-harness origin/feat/eval-harness
waza tokens count .github/skills/
waza check .github/skills/prereq-check
waza run .github/evals/prereq-check/eval.yaml --model claude-sonnet-4.6 -v
waza run .github/evals/prereq-check/eval.yaml --model gpt-5.4 --no-cache
waza run .github/evals/prereq-check/eval.yaml --model gpt-5-codex --no-cacheOverall: solid foundation for the eval harness. The critical items above would strengthen reliability before scaling to all 13 skills + 8 agents.
🔄 Updated Review — New Commits AcknowledgedNoticed two new commits since my initial review:
✅ New additions that address earlier feedback1. 2. Decision doc ( 3. CONTRIBUTING.md section — good onboarding for new contributors. 🟡 Observations on new additions
Agent eval mirror pattern — Previous feedback still appliesMy 3 critical findings from the earlier review remain relevant:
Overall the PR is maturing well. The authoring framework and lifecycle prompts significantly lower the bar for contributors. |
Validated locally against the full prereq-check eval suite (4 tasks × 3 trials, claude-sonnet-4.6): 100% pass, stddev 0.0000, schema unchanged. Also rewrite the pin comment to record the real failure mode that motivated the original pin (commit f0d44ea): GitHub's releases/latest endpoint returned the sibling azd-extension tag (azd-ext-microsoft-azd-waza_0.33.0) instead of v0.33.0, causing a 404 on every download. v0.33.0's actual binary was never exercised. Pinning to v0.33.0 directly avoids the tag-resolution path entirely. Notable upsides shipped between 0.31 and 0.33: - PR #251: prompt grader preserves grades when follow-up turn fails (mitigates documented continue_session JSON-RPC flakiness). - PR #258: prompt graders route through CopilotEngine for consistency. - Bundled copilot-cli 1.0.2 -> 1.0.49.
✅ Updated Local Validation — All Critical Findings ResolvedRe-tested locally with Waza CLI v0.33.0 on macOS/arm64 (commit 📊 Results Summary
✅ Critical Findings — All Addressed
🆕 Additional Improvements Since Last Review
🟡 Non-blocking Observations
🧪 Local Test Commandswaza --version # 0.33.0
waza check .github/skills/prereq-check
waza run .github/evals/prereq-check/eval.yaml --model claude-sonnet-4.6 -v
waza run .github/evals/prereq-check/eval.yaml --model gpt-5.4 -vVerdictReady for merge. The eval harness is stable, reproducible, and well-documented. All 3 critical items from my earlier review are resolved. Remaining items are UX polish suitable for follow-up PRs. |
Summary
Introduces a microsoft/waza-based eval harness for skills and custom agents. This is the harness skeleton — it lands the tooling so per-skill and per-agent suites can be added one-at-a-time afterward via #93.
Closes #61.
What lands
.waza.yamlcopilot-sdkexecutor,claude-sonnet-4.6default, 1000-token warn / 1300-token block budgets..github/evals/manifest.yamlprereq-checkso the first PR-comment cycle proves the pipe end-to-end..github/evals/prereq-check/.github/prompts/.github/workflows/waza-evals.yml.github/workflows/waza-agent-evals.yml.github/evals/agents/<name>/), no manifest. No-op until first agent suite lands..github/actionlint.yamlprintfblocks of the two waza workflows..gitignore.waza-cache/,.waza-results/,*.waza-results.json.Intentionally not in this PR
waza-trends.yml,.waza-history/,lab-bench.json— this PR keeps the surface focused on per-PR feedback. Trend scoreboards can be added later as a follow-up if there's appetite.prereq-check— those land one PR per sub-issue under [meta] Eval suite coverage for agents and skills #93 so contributors can review them in isolation.Maintainer setup required before workflows run
Both workflows need the
COPILOT_GITHUB_TOKENrepo secret (Copilot-scoped PAT). The defaultGITHUB_TOKENdoes not carry the scopecopilot-sdkneeds. Until that secret exists:waza-evals.ymlwill fail at thecopilot-sdkstep (still safely surfaces in the PR comment as an infra failure).waza-agent-evals.ymlis a no-op anyway until the first agent suite is added.Validation done locally
actionlint -coloron all workflows → exit 0yq evalon.waza.yaml, manifest, eval suite, all task files → OKHow to test after merge
Open a follow-up PR that touches
.github/skills/prereq-check/SKILL.md(or any path the workflows trigger on). The skills workflow should:prepareand emit a 4-leg matrix (claude-sonnet-4.6,gpt-5.4baseline,gpt-5-codex,claude-opus-4.6).prereq-checkagainst each model, post a single<!-- waza-evals-comment -->PR comment with one section per leg.Next steps
Tracked in #93 — 15 sub-issues, one per remaining suite (7 skills + 8 agents). 4 are tagged
good first issue.