feat(evals): introduce waza eval harness for skills and agents by arnaudlh · Pull Request #109 · Azure/git-ape

arnaudlh · 2026-05-19T10:58:09Z

Summary

Introduces a microsoft/waza-based eval harness for skills and custom agents. This is the harness skeleton — it lands the tooling so per-skill and per-agent suites can be added one-at-a-time afterward via #93.

Closes #61.

What lands

File	Purpose
`.waza.yaml`	Waza config — `copilot-sdk` executor, `claude-sonnet-4.6` default, 1000-token warn / 1300-token block budgets.
`.github/evals/manifest.yaml`	Skill matrix source of truth. Seeded with only `prereq-check` so the first PR-comment cycle proves the pipe end-to-end.
`.github/evals/prereq-check/`	Skill suite (eval.yaml + 3 tasks). Doubles as smoke test for the harness.
`.github/prompts/`	6 prompts (skill + agent × bench / improve / promote) that contributors invoke locally to author and tune suites.
`.github/workflows/waza-evals.yml`	PR-mode skills runner with the manifest-driven matrix, per-leg retry wrapper, and PR-comment fan-in.
`.github/workflows/waza-agent-evals.yml`	PR-mode agents runner. Filesystem-discovered (`.github/evals/agents/<name>/`), no manifest. No-op until first agent suite lands.
`.github/actionlint.yaml`	Silences intentional SC2016 in the markdown-PR-comment `printf` blocks of the two waza workflows.
`.gitignore`	Adds `.waza-cache/`, `.waza-results/`, `*.waza-results.json`.

Intentionally not in this PR

waza-trends.yml, .waza-history/, lab-bench.json — this PR keeps the surface focused on per-PR feedback. Trend scoreboards can be added later as a follow-up if there's appetite.
Any skill or agent suite other than prereq-check — those land one PR per sub-issue under [meta] Eval suite coverage for agents and skills #93 so contributors can review them in isolation.

Maintainer setup required before workflows run

Both workflows need the COPILOT_GITHUB_TOKEN repo secret (Copilot-scoped PAT). The default GITHUB_TOKEN does not carry the scope copilot-sdk needs. Until that secret exists:

waza-evals.yml will fail at the copilot-sdk step (still safely surfaces in the PR comment as an infra failure).
waza-agent-evals.yml is a no-op anyway until the first agent suite is added.

Validation done locally

actionlint -color on all workflows → exit 0
yq eval on .waza.yaml, manifest, eval suite, all task files → OK

How to test after merge

Open a follow-up PR that touches .github/skills/prereq-check/SKILL.md (or any path the workflows trigger on). The skills workflow should:

Detect the change in prepare and emit a 4-leg matrix (claude-sonnet-4.6, gpt-5.4 baseline, gpt-5-codex, claude-opus-4.6).
Run prereq-check against each model, post a single  PR comment with one section per leg.
Never block the merge — evals are always advisory.

Next steps

Tracked in #93 — 15 sub-issues, one per remaining suite (7 skills + 8 agents). 4 are tagged good first issue.

- port .waza.yaml + .github/prompts/ (skill+agent bench/improve/promote) - add manifest.yaml seeded with prereq-check as harness smoke test - port waza-evals + waza-agent-evals workflows with PR-comment fan-in - extend actionlint.yaml to silence intentional SC2016 in waza workflows - ignore .waza-cache/ and .waza-results/ 🧪 - Generated by Copilot

github-actions · 2026-05-19T11:05:07Z

🤖 Waza agent evals (advisory)

🔁 Full matrix run. workflow file changed → full matrix

Ran 0 agent evals against claude-sonnet-4.6. Each eval consumes ~5 premium Copilot requests; results are non-blocking — investigate failures via the workflow logs and the per-agent waza-agent-results-* artifacts.

How this works: This workflow auto-syncs the canonical .github/agents/<name>.agent.md into the sibling mirror inside .github/evals/agents/<name>/ before each run, so the score below reflects the version of the agent in this PR — not whatever was committed when the eval was first wired up.

📊 Agent file token comparison vs main (advisory)

No .agent.md files changed vs main (or token-compare returned no entries).

No agents in scope for this PR.

Without the secret, downstream jobs failed loudly with 403 (microsoft/waza is a private repo) and the agent matrix step crashed on the missing .github/evals/agents/ directory. Changes: - Add 'preflight' job in both waza-evals.yml and waza-agent-evals.yml that checks for the COPILOT_GITHUB_TOKEN secret and emits enabled=true|false. - Gate prepare/tokens/eval/comment jobs on needs.preflight.outputs.enabled so they skip cleanly (gray, not red) until the maintainer adds the secret. - Make the agent matrix step tolerate a missing .github/evals/agents/ directory by treating it as an empty agent list. 🧪 - Generated by Copilot

github-actions · 2026-05-19T11:07:24Z

🧪 Waza skill evals (advisory)

🔁 Full matrix run. project-wide config change (.waza.yaml, manifest, or workflow file) → full matrix

Ran 4 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg waza-results-* artifacts.

Legend: Models flagged baseline: true in .github/evals/manifest.yaml (currently: gpt-5.4) run with --baseline (A/B mode) to cap quota. All other models run standard. Judge model is fixed at claude-sonnet-4.6 across all legs.

📊 Token comparison vs main (advisory)

{
  "baseRef": "main",
  "headRef": "WORKING",
  "threshold": 10,
  "passed": true,
  "timestamp": "2026-05-25T03:20:52.068308741Z",
  "summary": {
    "totalBefore": 0,
    "totalAfter": 33320,
    "totalDiff": 33320,
    "percentChange": 100,
    "filesAdded": 13,
    "filesRemoved": 0,
    "filesModified": 0,
    "filesIncreased": 13,
    "filesDecreased": 0
  },
  "files": [
    {
      "file": ".github/skills/azure-cost-estimator/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3227,
        "characters": 11926,
        "lines": 344
      },
      "diff": 3227,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-deployment-preflight/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1444,
        "characters": 6267,
        "lines": 211
      },
      "diff": 1444,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-drift-detector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3179,
        "characters": 13149,
        "lines": 460
      },
      "diff": 3179,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-integration-tester/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1559,
        "characters": 6793,
        "lines": 247
      },
      "diff": 1559,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-naming-research/SKILL.md",
      "before": null,
      "after": {
        "tokens": 486,
        "characters": 2108,
        "lines": 44
      },
      "diff": 486,
      "percentChange": 100,
      "status": "added",
      "limit": 500
    },
    {
      "file": ".github/skills/azure-policy-advisor/SKILL.md",
      "before": null,
      "after": {
        "tokens": 6233,
        "characters": 26754,
        "lines": 642
      },
      "diff": 6233,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-availability/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2409,
        "characters": 9867,
        "lines": 307
      },
      "diff": 2409,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-visualizer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1490,
        "characters": 6165,
        "lines": 191
      },
      "diff": 1490,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-rest-api-reference/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1827,
        "characters": 8416,
        "lines": 199
      },
      "diff": 1827,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-role-selector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1276,
        "characters": 5627,
        "lines": 161
      },
      "diff": 1276,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-security-analyzer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 5322,
        "characters": 21405,
        "lines": 450
      },
      "diff": 5322,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/git-ape-onboarding/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2730,
        "characters": 11072,
        "lines": 270
      },
      "diff": 2730,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/prereq-check/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2138,
        "characters": 8019,
        "lines": 147
      },
      "diff": 2138,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    }
  ]
}

Skill: `prereq-check`

📈 Score (per model) + Suggestions/Recommendations

Model: claude-opus-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.79 | Duration: 1m55.328s

Tests: 4 total, 4 passed, 0 failed, 0 errors
Success Rate: 100.0%
Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task	Score	Status	Graders
Negative — Editing an ARM template	0.57	✅	budget, trigger_relevance_negative
Negative — Azure service concept question	0.60	✅	budget, trigger_relevance_negative
Positive — "command not found" failure	1.00	✅	answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?"	1.00	✅	answer_quality, budget, trigger_relevance_positive

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6

Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-opus-4.6.junit.xml

Model: claude-sonnet-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

✓ [1/4] Negative — Editing an ARM template
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: 281A:227B0A:165954E:1AD01F2:6A13C032)

✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 2m0.502s

Tests: 4 total, 3 passed, 1 failed, 0 errors
Success Rate: 75.0%
Score Range: 0.57 - 1.00 (σ=0.1839)

Task Results

Task	Score	Status	Graders
Negative — Editing an ARM template	0.57	✅	budget, trigger_relevance_negative
Negative — Azure service concept question	0.60	✅	budget, trigger_relevance_negative
Positive — "command not found" failure	0.89	❌	answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?"	1.00	✅	answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

Positive — "command not found" failure: 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "command not found" failure

Run 1/3 (error):

❌ answer_quality (0.00): fail: All four PASS criteria are missing: The assistant's response failed with "unexpected user permission response" errors and produced no useful output. It did not: (1) name any of the required tools (az, gh, jq, git), (2) provide any install command for az or any other tool, (3) recommend version verification commands, or (4) reach any verdict or next step. The response was entirely empty of actionable content.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6

Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json

Model: gpt-5-codex

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [2/4] Negative — Azure service concept question
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [4/4] Positive — "What do I need to install?"
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [3/4] Positive — "command not found" failure
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [1/4] Negative — Editing an ARM template

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.00 | Duration: 319ms

Tests: 4 total, 0 passed, 4 failed, 0 errors
Success Rate: 0.0%
Score Range: 0.00 - 0.00 (σ=0.0000)

Task Results

Task	Status	Graders
Negative — Editing an ARM template	❌	-
Negative — Azure service concept question	❌	-
Positive — "command not found" failure	❌	-
Positive — "What do I need to install?"	❌	-

Failed Task Details

Negative — Editing an ARM template

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Negative — Azure service concept question

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Positive — "command not found" failure

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Positive — "What do I need to install?"

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5-codex

Results saved to: .waza-results/prereq-check-gpt-5-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✓ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 100.0% (4/4 tasks passed)
Without Skills: 75.0% (3/4 tasks passed)
Impact: +25.0 percentage points

Per-Task Breakdown:
• Negative — Editing an ARM template [NEUTRAL] 100% → 100% (+0pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [NEUTRAL] 100% → 100% (+0pp)
• Positive — "What do I need to install?" [IMPROVED] 0% → 100% (+100pp)

Verdict: Skills have POSITIVE IMPACT (improved 1/4 tasks)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.79 | Duration: 1m53.373s

Tests: 4 total, 4 passed, 0 failed, 0 errors
Success Rate: 100.0%
Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task	Score	Status	Graders
Negative — Editing an ARM template	0.57	✅	budget, trigger_relevance_negative
Negative — Azure service concept question	0.60	✅	budget, trigger_relevance_negative
Positive — "command not found" failure	1.00	✅	answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?"	1.00	✅	answer_quality, budget, trigger_relevance_positive

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4

Results saved to: .waza-results/prereq-check-gpt-5.4.json
JUnit XML saved to: .waza-results/prereq-check-gpt-5.4.junit.xml

🔢 Tokens (count + profile)

📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
   ⚠️  token count 2138 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious from the description and Quick Reference table. Steps are numbered, ordered logically (detect → scan → check → auth → verdict), and each step references its implementation artifact. The status mapping and verdict taxonomy leave no ambiguity.
completeness       █████  Covers tool checks, version minimums, auth checks, platform variants (macOS/Linux/Windows), user-reported vs. terminal-detected discrepancies, error handling table with 8 specific scenarios, and a full constraints section. Edge cases like headless shells, expired sessions, and execution policy restrictions are explicitly addressed.
trigger_precision  ████░  USE FOR triggers are exhaustive and include both conceptual phrases and literal error strings (e.g., 'az: command not found'), which maximizes recall. The DO NOT USE FOR section is deliberately blunt ('Anything else'), which is clear but could briefly name 1-2 adjacent skills to prevent boundary confusion — e.g., 'not for azure-validate or deployment errors'.
scope_coverage     █████  Scope is tightly defined: read-only, four specific tools, two auth sessions, three platforms. Boundaries are explicit in both the Quick Reference ('Side effects: Read-only') and Constraints ('Never' list). The 'Next' section correctly hands off without overstepping into onboarding territory.
anti_patterns      █████  Avoids all common anti-patterns: no vague instructions ('exactly one of READY / TOOLS MISSING / REPORTED MISSING / AUTH MISSING'), no conflicting directives, error handling is concrete with cause+fix pairs, and Rule 4 explicitly prevents the dangerous auto-chaining anti-pattern. The 'stop at first blocking failure' rule prevents partial-state confusion.
────────────────────────────────────────────
Overall: 4.8/5.0

Exceptionally well-structured skill definition. It is precise, actionable, and defensively designed — covering platform detection, user-reported vs. terminal-detected discrepancies, and a clear verdict taxonomy. The only minor gap is that DO NOT USE FOR could name 1-2 adjacent skills to sharpen routing at the edges, but this is a negligible omission in an otherwise exemplary skill.

✅ Check (compliance summary) (59 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: prereq-check

📋 Compliance Score: Medium-High
   ⚠️  Good, but could be improved. Missing routing clarity.

   Issues found:
   ❌  SKILL.md is 2138 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 4/4 valid
   ✅  All links valid.

📊 Token Budget: 2138 / 500 tokens
   ❌  Exceeds limit by 1638 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 1 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ❌  [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips

…t presence A token that exists but lacks access to the private microsoft/waza repo still produced a hard 403 during waza install. Upgrade the preflight check to a real HTTP call against the microsoft/waza releases API and gate on a 200 response so misconfigured tokens skip the same way absent tokens do. 🧪 - Generated by Copilot

… command-not-found errors and PATH repair

github-actions · 2026-05-19T16:27:37Z

⚠️ Documentation Staleness Warning

Source files (agents, skills, workflows, or config) changed in this PR, but the generated documentation is out of date.

Changed docs that need regeneration:

website/docs/skills/overview.md
website/docs/skills/prereq-check.md
website/docs/workflows/daily-repo-status-lock.md
website/docs/workflows/issue-triage-agent-lock.md
website/docs/workflows/pr-validation.md
website/docs/workflows/waza-agent-evals.md
website/docs/workflows/waza-evals.md

To fix: Run the following command and commit the results:

node scripts/generate-docs.js

This is an advisory check — it does not block the PR.

…B_TOKEN

… eval workflows

waza check expects eval.yaml colocated with SKILL.md and reports 'Not Found' when the layout separates them. Repo convention is .github/skills/<name>/SKILL.md + .github/evals/<name>/eval.yaml, so the warning was a false negative. Prepend a note above the captured waza check output explaining the layout and pointing reviewers to the Score section, which reflects the actual eval run.

Introduces stage 0 of the eval lifecycle: scaffold a brand-new .github/evals/<skill>/ suite (eval.yaml + positive/negative/off-topic tasks), register the skill at the expanded tier in manifest.yaml, and run a single-model smoke trial. Pauses for approval before appending to the manifest and before the smoke trial. Out of scope: editing SKILL.md (use /skill-improve) and promoting to the pilot tier (use /skill-promote). Also updates website/docs/authoring/prompts.md with a When-to-use row, a per-prompt section, and an updated lifecycle summary in the page intro and frontmatter description.

…amework - add /agent-onboard prompt and AGENT/SKILL scaffold templates - record waza harness decision in .github/evals/README.md - document eval lifecycle in website/docs/authoring/framework.md - wire mutable-by-* tags into /agent-improve gate and Locked column - add CONTRIBUTING.md section for adding eval suites 🦍 - Generated by Copilot

sendtoshailesh

🧪 Local Validation Feedback — Waza Eval Harness

Tested locally with Waza CLI v0.31.0 on macOS/arm64. Here is a structured summary of findings.

✅ What Works Well

Reproducibility: 3× --no-cache runs with claude-sonnet-4.6 → identical 0.86 score, 0% variance
Workflow architecture: Proper job DAG (preflight→prepare→eval→comment), graceful degradation when secret missing
Security model: pull_request event (fork-safe), minimal permissions (contents:read, pull-requests:write)
Contributor prompts: All 6 prompts (skill/agent × bench/improve/promote) are well-structured
Advisory-only mode: Evals never block merges ✅

🔴 High Priority Findings

1. `trials_per_task: 1` causes non-deterministic pass/fail

Model	Local Result	CI Result
claude-sonnet-4.6	✅ 0.86, 3/3 pass	✅ pass
gpt-5-codex	✅ 0.86, 3/3 pass	✅ pass
gpt-5.4	⚠️ 0.74, 2/3 pass	✅ pass

gpt-5.4 failed locally on positive-what-do-i-need → criterion 4 missing (no install commands). Same model passes in CI. With a single trial, one flaky LLM response = permanent failure for that run.

Recommendation: trials_per_task: 3 with majority voting for production gating.

2. Insufficient negative task coverage

Current state: 2 positive + 1 negative task. A skill that always fires would score 66% (above the 0.6 trigger_precision threshold).

Recommendation: Add ≥3 negative scenarios, e.g.:

"Review this pull request for security issues" (code review, not prereq)
"Deploy my app to Azure Container Apps" (deployment, not prereq)
"What are the team's Q1 OKRs?" (org question, not tooling)

3. Token budget calibration mismatch

The .waza.yaml comment says:

"75th-percentile token count across all 34 skills, rounded up to nearest 50, capped at 1000"

But actual state: 13 skills, 75th percentile = 3,179 tokens (3.2× the 1000 threshold). Only 2/13 skills are under the 1300 fallback limit. This causes waza check to flag every skill as over-budget.

Question: Was this calibrated against a different/future skill set? If so, the comment should clarify. If not, thresholds need recalibration.

🟡 Medium Priority

4. `waza check` false negative on eval suite

Because eval.yaml lives in .github/evals/prereq-check/ (not colocated with SKILL.md), waza check reports "Evaluation Suite: Not Found". The PR comment correctly notes this, but it will confuse contributors following the readiness workflow.

5. Judge model same-family bias

claude-sonnet-4.6 judges claude-sonnet-4.6 responses. The meeting discussion about "makers vs checkers across model families" suggests this should eventually use a cross-family judge (e.g., GPT judges Claude, Claude judges GPT).

6. Aggregate score 0.86 for "all pass" may confuse reviewers

The negative task scores 0.57 (correct behavior — skill correctly did NOT trigger, but the score reflects imperfect isolation). Readers seeing <1.0 on "all pass" might think something is wrong.

📊 Cost Data (for scalability planning)

Model	Premium Requests	Input Tokens	Duration
claude-sonnet-4.6	10	291K	62s
gpt-5-codex	12	350K	65s
gpt-5.4	14	365K	~145s

At full scale (13 skills × 4 models): ~650 premium requests per full matrix run.

Local Test Steps Used

curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
gh repo clone Azure/git-ape && cd git-ape
git checkout -b feat/eval-harness origin/feat/eval-harness
waza tokens count .github/skills/
waza check .github/skills/prereq-check
waza run .github/evals/prereq-check/eval.yaml --model claude-sonnet-4.6 -v
waza run .github/evals/prereq-check/eval.yaml --model gpt-5.4 --no-cache
waza run .github/evals/prereq-check/eval.yaml --model gpt-5-codex --no-cache

Overall: solid foundation for the eval harness. The critical items above would strengthen reliability before scaling to all 13 skills + 8 agents.

sendtoshailesh · 2026-05-21T15:52:47Z

🔄 Updated Review — New Commits Acknowledged

Noticed two new commits since my initial review:

c2e5ada — /skill-onboard maintainer prompt + docs
b819016 — /agent-onboard prompt, decision doc, authoring framework

✅ New additions that address earlier feedback

1. /skill-onboard now scaffolds negatives by default — the prompt generates negativeTasks (default 1, capped at 2) + an off-topic negative. This partially addresses my "insufficient negative coverage" concern, though the default of 1 negative + 1 off-topic (total 2) against 2 positives still allows an always-firing skill to pass. Consider defaulting to negativeTasks=2 + 1 off-topic = 3 negatives.

2. Decision doc (.github/evals/README.md) — excellent. Clearly explains why waza was chosen over openai/evals and custom Node harness. The lifecycle summary (onboard → bench → improve → promote) gives contributors a clear path.

3. CONTRIBUTING.md section — good onboarding for new contributors.

🟡 Observations on new additions

/skill-onboard cost notice says 5-8 premium requests per invocation — this is per-skill scaffolding, not per-PR-run. Worth noting that the total eval cost across all skills (when all 13 are onboarded) will be much higher. Consider adding a cost table to the decision doc.

Agent eval mirror pattern — waza-agent-evals.yml auto-syncs the canonical .agent.md into the eval directory before each run. This is clever but means the eval mirror can drift from the source between runs. The sync step mitigates this, but worth documenting as a design choice.

Previous feedback still applies

My 3 critical findings from the earlier review remain relevant:

⚠️ trials_per_task: 1 — still a flakiness risk (unaddressed)
⚠️ Negative coverage defaults — improved by onboard prompt, but defaults could be higher
⚠️ Token budget calibration comment — still references "34 skills"

Overall the PR is maturing well. The authoring framework and lifecycle prompts significantly lower the bar for contributors.

…checks

… negative trigger task

Validated locally against the full prereq-check eval suite (4 tasks × 3 trials, claude-sonnet-4.6): 100% pass, stddev 0.0000, schema unchanged. Also rewrite the pin comment to record the real failure mode that motivated the original pin (commit f0d44ea): GitHub's releases/latest endpoint returned the sibling azd-extension tag (azd-ext-microsoft-azd-waza_0.33.0) instead of v0.33.0, causing a 404 on every download. v0.33.0's actual binary was never exercised. Pinning to v0.33.0 directly avoids the tag-resolution path entirely. Notable upsides shipped between 0.31 and 0.33: - PR #251: prompt grader preserves grades when follow-up turn fails (mitigates documented continue_session JSON-RPC flakiness). - PR #258: prompt graders route through CopilotEngine for consistency. - Bundled copilot-cli 1.0.2 -> 1.0.49.

sendtoshailesh · 2026-05-25T06:28:33Z

✅ Updated Local Validation — All Critical Findings Resolved

Re-tested locally with Waza CLI v0.33.0 on macOS/arm64 (commit 3684b5e).

📊 Results Summary

Model	Pass Rate	Aggregate	Std Dev	Tasks	Trials	Duration
claude-sonnet-4.6	100% (4/4)	0.79	0.0000	4	3/task	2m10s
gpt-5.4	100% (4/4)	0.79	0.0000	4	3/task	2m17s

✅ Critical Findings — All Addressed

#	Original Finding	Status
1	`trials_per_task: 1` causes flakiness	✅ Fixed — `trials_per_task: 3`, gpt-5.4 now passes 100% (was 66%) with zero variance
2	Insufficient negative coverage	✅ Fixed — 2 positive + 2 negative tasks; negatives score 0.57/0.60 (well below 0.50 trigger threshold)
3	Token budget comment ("34 skills")	✅ Clarified — `.waza.yaml` now documents aspirational caps vs actual corpus (13 skills, p75 ≈ 3.2k)

🆕 Additional Improvements Since Last Review

Waza v0.33.0 pinned with documented rationale (avoids releases/latest 404 from sibling tag)
/skill-onboard and /agent-onboard prompts scaffold negatives by default
Decision doc (.github/evals/README.md) explains waza choice + lifecycle
Authoring guide lowers contributor bar significantly

🟡 Non-blocking Observations

Aggregate 0.79 ≠ failure — negative tasks intentionally score <1.0 (trigger correctly not firing). Consider adding a one-line explanation in the CI PR comment for reviewers unfamiliar with the scoring model.
waza check reports token overage (2138 vs 500 limit) — this is the aspirational cap working as intended. No action needed.
Same-family judge bias — future work, not blocking for harness MVP.

🧪 Local Test Commands

waza --version  # 0.33.0
waza check .github/skills/prereq-check
waza run .github/evals/prereq-check/eval.yaml --model claude-sonnet-4.6 -v
waza run .github/evals/prereq-check/eval.yaml --model gpt-5.4 -v

Verdict

Ready for merge. The eval harness is stable, reproducible, and well-documented. All 3 critical items from my earlier review are resolved. Remaining items are UX polish suitable for follow-up PRs.

arnaudlh added enhancement New feature or request AI-evals All things related to agent and skills evaluation. labels May 19, 2026

arnaudlh self-assigned this May 19, 2026

arnaudlh and others added 2 commits May 19, 2026 19:11

feat(evals): enhance prereq-check with detailed guidance for reported…

7b5f631

… command-not-found errors and PATH repair

arnaudlh added 3 commits May 20, 2026 07:46

docs(authoring): add authoring guide for skills, agents, evals, prompts

5e3bea4

ci(waza): surface 403 diagnostics in preflight to debug COPILOT_GITHU…

19df70f

…B_TOKEN

docs: regenerate auto-generated pages for prereq-check skill and waza…

fd8fb3e

… eval workflows

arnaudlh requested a review from sendtoshailesh May 20, 2026 00:11

github-actions Bot mentioned this pull request May 20, 2026

[repo-status] 🐒 Git-Ape Daily Status — May 20, 2026 #112

Closed

arnaudlh added 2 commits May 21, 2026 11:18

arnaudlh requested review from dawright22 and suuus May 21, 2026 11:18

sendtoshailesh reviewed May 21, 2026

View reviewed changes

arnaudlh added 3 commits May 22, 2026 14:09

feat(prereq-check): enhance skill documentation and add tool version …

2e44ac8

…checks

feat(waza): pin WAZA_VERSION to a known-good release for stability

f0d44ea

feat(eval): update trials_per_task to enhance flake detection and add…

5301dde

… negative trigger task

arnaudlh added this to the v0.1.0 milestone May 22, 2026

arnaudlh requested a review from sendtoshailesh May 22, 2026 09:30

This was referenced May 22, 2026

[repo-status] 🐒 Git-Ape Daily Status — May 22, 2026 #114

Closed

[repo-status] 🐒 Git-Ape Daily Status — May 23, 2026 #115

Closed

[repo-status] 🐒 Git-Ape Daily Status — May 24, 2026 #117

Closed

sendtoshailesh approved these changes May 25, 2026

View reviewed changes

arnaudlh merged commit f9e5d20 into main May 25, 2026
20 checks passed

arnaudlh deleted the feat/eval-harness branch May 25, 2026 07:35

Conversation

arnaudlh commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What lands

Intentionally not in this PR

Maintainer setup required before workflows run

Validation done locally

How to test after merge

Next steps

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Waza agent evals (advisory)

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 Waza skill evals (advisory)

Skill: prereq-check

🧪 Waza Eval Results

Task Results

🧪 Waza Eval Results

Task Results

⚠️ Flaky Tasks

Failed Task Details

Positive — "command not found" failure

🧪 Waza Eval Results

Task Results

Failed Task Details

Negative — Editing an ARM template

Negative — Azure service concept question

Positive — "command not found" failure

Positive — "What do I need to install?"

🧪 Waza Eval Results

Task Results

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Documentation Staleness Warning

Uh oh!

sendtoshailesh left a comment

Choose a reason for hiding this comment

🧪 Local Validation Feedback — Waza Eval Harness

✅ What Works Well

🔴 High Priority Findings

1. trials_per_task: 1 causes non-deterministic pass/fail

2. Insufficient negative task coverage

3. Token budget calibration mismatch

🟡 Medium Priority

4. waza check false negative on eval suite

5. Judge model same-family bias

6. Aggregate score 0.86 for "all pass" may confuse reviewers

📊 Cost Data (for scalability planning)

Local Test Steps Used

Uh oh!

sendtoshailesh commented May 21, 2026

🔄 Updated Review — New Commits Acknowledged

✅ New additions that address earlier feedback

🟡 Observations on new additions

Previous feedback still applies

Uh oh!

sendtoshailesh commented May 25, 2026

✅ Updated Local Validation — All Critical Findings Resolved

📊 Results Summary

✅ Critical Findings — All Addressed

🆕 Additional Improvements Since Last Review

🟡 Non-blocking Observations

🧪 Local Test Commands

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arnaudlh commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

Skill: `prereq-check`

github-actions Bot commented May 19, 2026 •

edited

Loading

1. `trials_per_task: 1` causes non-deterministic pass/fail

4. `waza check` false negative on eval suite