Skip to content

feat(evals): introduce waza eval harness for skills and agents#109

Merged
arnaudlh merged 14 commits into
mainfrom
feat/eval-harness
May 25, 2026
Merged

feat(evals): introduce waza eval harness for skills and agents#109
arnaudlh merged 14 commits into
mainfrom
feat/eval-harness

Conversation

@arnaudlh
Copy link
Copy Markdown
Member

@arnaudlh arnaudlh commented May 19, 2026

Summary

Introduces a microsoft/waza-based eval harness for skills and custom agents. This is the harness skeleton — it lands the tooling so per-skill and per-agent suites can be added one-at-a-time afterward via #93.

Closes #61.

What lands

File Purpose
.waza.yaml Waza config — copilot-sdk executor, claude-sonnet-4.6 default, 1000-token warn / 1300-token block budgets.
.github/evals/manifest.yaml Skill matrix source of truth. Seeded with only prereq-check so the first PR-comment cycle proves the pipe end-to-end.
.github/evals/prereq-check/ Skill suite (eval.yaml + 3 tasks). Doubles as smoke test for the harness.
.github/prompts/ 6 prompts (skill + agent × bench / improve / promote) that contributors invoke locally to author and tune suites.
.github/workflows/waza-evals.yml PR-mode skills runner with the manifest-driven matrix, per-leg retry wrapper, and PR-comment fan-in.
.github/workflows/waza-agent-evals.yml PR-mode agents runner. Filesystem-discovered (.github/evals/agents/<name>/), no manifest. No-op until first agent suite lands.
.github/actionlint.yaml Silences intentional SC2016 in the markdown-PR-comment printf blocks of the two waza workflows.
.gitignore Adds .waza-cache/, .waza-results/, *.waza-results.json.

Intentionally not in this PR

  • waza-trends.yml, .waza-history/, lab-bench.json — this PR keeps the surface focused on per-PR feedback. Trend scoreboards can be added later as a follow-up if there's appetite.
  • Any skill or agent suite other than prereq-check — those land one PR per sub-issue under [meta] Eval suite coverage for agents and skills #93 so contributors can review them in isolation.

Maintainer setup required before workflows run

Both workflows need the COPILOT_GITHUB_TOKEN repo secret (Copilot-scoped PAT). The default GITHUB_TOKEN does not carry the scope copilot-sdk needs. Until that secret exists:

  • waza-evals.yml will fail at the copilot-sdk step (still safely surfaces in the PR comment as an infra failure).
  • waza-agent-evals.yml is a no-op anyway until the first agent suite is added.

Validation done locally

  • actionlint -color on all workflows → exit 0
  • yq eval on .waza.yaml, manifest, eval suite, all task files → OK

How to test after merge

Open a follow-up PR that touches .github/skills/prereq-check/SKILL.md (or any path the workflows trigger on). The skills workflow should:

  1. Detect the change in prepare and emit a 4-leg matrix (claude-sonnet-4.6, gpt-5.4 baseline, gpt-5-codex, claude-opus-4.6).
  2. Run prereq-check against each model, post a single <!-- waza-evals-comment --> PR comment with one section per leg.
  3. Never block the merge — evals are always advisory.

Next steps

Tracked in #93 — 15 sub-issues, one per remaining suite (7 skills + 8 agents). 4 are tagged good first issue.

- port .waza.yaml + .github/prompts/ (skill+agent bench/improve/promote)
- add manifest.yaml seeded with prereq-check as harness smoke test
- port waza-evals + waza-agent-evals workflows with PR-comment fan-in
- extend actionlint.yaml to silence intentional SC2016 in waza workflows
- ignore .waza-cache/ and .waza-results/

🧪 - Generated by Copilot
@arnaudlh arnaudlh added enhancement New feature or request AI-evals All things related to agent and skills evaluation. labels May 19, 2026
@arnaudlh arnaudlh self-assigned this May 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🤖 Waza agent evals (advisory)

🔁 Full matrix run. workflow file changed → full matrix

Ran 0 agent evals against claude-sonnet-4.6. Each eval consumes ~5 premium Copilot requests; results are non-blocking — investigate failures via the workflow logs and the per-agent waza-agent-results-* artifacts.

How this works: This workflow auto-syncs the canonical .github/agents/<name>.agent.md into the sibling mirror inside .github/evals/agents/<name>/ before each run, so the score below reflects the version of the agent in this PR — not whatever was committed when the eval was first wired up.

📊 Agent file token comparison vs main (advisory)

No .agent.md files changed vs main (or token-compare returned no entries).

No agents in scope for this PR.

Without the secret, downstream jobs failed loudly with 403 (microsoft/waza is
a private repo) and the agent matrix step crashed on the missing
.github/evals/agents/ directory.

Changes:
- Add 'preflight' job in both waza-evals.yml and waza-agent-evals.yml that
  checks for the COPILOT_GITHUB_TOKEN secret and emits enabled=true|false.
- Gate prepare/tokens/eval/comment jobs on needs.preflight.outputs.enabled
  so they skip cleanly (gray, not red) until the maintainer adds the secret.
- Make the agent matrix step tolerate a missing .github/evals/agents/
  directory by treating it as an empty agent list.

🧪 - Generated by Copilot
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🧪 Waza skill evals (advisory)

🔁 Full matrix run. project-wide config change (.waza.yaml, manifest, or workflow file) → full matrix

Ran 4 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg waza-results-* artifacts.

Legend: Models flagged baseline: true in .github/evals/manifest.yaml (currently: gpt-5.4) run with --baseline (A/B mode) to cap quota. All other models run standard. Judge model is fixed at claude-sonnet-4.6 across all legs.

📊 Token comparison vs main (advisory)
{
  "baseRef": "main",
  "headRef": "WORKING",
  "threshold": 10,
  "passed": true,
  "timestamp": "2026-05-25T03:20:52.068308741Z",
  "summary": {
    "totalBefore": 0,
    "totalAfter": 33320,
    "totalDiff": 33320,
    "percentChange": 100,
    "filesAdded": 13,
    "filesRemoved": 0,
    "filesModified": 0,
    "filesIncreased": 13,
    "filesDecreased": 0
  },
  "files": [
    {
      "file": ".github/skills/azure-cost-estimator/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3227,
        "characters": 11926,
        "lines": 344
      },
      "diff": 3227,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-deployment-preflight/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1444,
        "characters": 6267,
        "lines": 211
      },
      "diff": 1444,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-drift-detector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3179,
        "characters": 13149,
        "lines": 460
      },
      "diff": 3179,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-integration-tester/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1559,
        "characters": 6793,
        "lines": 247
      },
      "diff": 1559,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-naming-research/SKILL.md",
      "before": null,
      "after": {
        "tokens": 486,
        "characters": 2108,
        "lines": 44
      },
      "diff": 486,
      "percentChange": 100,
      "status": "added",
      "limit": 500
    },
    {
      "file": ".github/skills/azure-policy-advisor/SKILL.md",
      "before": null,
      "after": {
        "tokens": 6233,
        "characters": 26754,
        "lines": 642
      },
      "diff": 6233,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-availability/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2409,
        "characters": 9867,
        "lines": 307
      },
      "diff": 2409,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-visualizer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1490,
        "characters": 6165,
        "lines": 191
      },
      "diff": 1490,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-rest-api-reference/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1827,
        "characters": 8416,
        "lines": 199
      },
      "diff": 1827,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-role-selector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1276,
        "characters": 5627,
        "lines": 161
      },
      "diff": 1276,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-security-analyzer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 5322,
        "characters": 21405,
        "lines": 450
      },
      "diff": 5322,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/git-ape-onboarding/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2730,
        "characters": 11072,
        "lines": 270
      },
      "diff": 2730,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/prereq-check/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2138,
        "characters": 8019,
        "lines": 147
      },
      "diff": 2138,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    }
  ]
}

Skill: prereq-check

📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.79 | Duration: 1m55.328s

  • Tests: 4 total, 4 passed, 0 failed, 0 errors
  • Success Rate: 100.0%
  • Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 1.00 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6

Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-opus-4.6.junit.xml

Model: claude-sonnet-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

✓ [1/4] Negative — Editing an ARM template
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: 281A:227B0A:165954E:1AD01F2:6A13C032)

✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 2m0.502s

  • Tests: 4 total, 3 passed, 1 failed, 0 errors
  • Success Rate: 75.0%
  • Score Range: 0.57 - 1.00 (σ=0.1839)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 0.89 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — "command not found" failure: 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "command not found" failure

Run 1/3 (error):

  • answer_quality (0.00): fail: All four PASS criteria are missing: The assistant's response failed with "unexpected user permission response" errors and produced no useful output. It did not: (1) name any of the required tools (az, gh, jq, git), (2) provide any install command for az or any other tool, (3) recommend version verification commands, or (4) reach any verdict or next step. The response was entirely empty of actionable content.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6

Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json

Model: gpt-5-codex

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [2/4] Negative — Azure service concept question
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [4/4] Positive — "What do I need to install?"
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [3/4] Positive — "command not found" failure
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [1/4] Negative — Editing an ARM template

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.00 | Duration: 319ms

  • Tests: 4 total, 0 passed, 4 failed, 0 errors
  • Success Rate: 0.0%
  • Score Range: 0.00 - 0.00 (σ=0.0000)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.00 -
Negative — Azure service concept question 0.00 -
Positive — "command not found" failure 0.00 -
Positive — "What do I need to install?" 0.00 -

Failed Task Details

Negative — Editing an ARM template

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Negative — Azure service concept question

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Positive — "command not found" failure

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Positive — "What do I need to install?"

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):


Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5-codex

Results saved to: .waza-results/prereq-check-gpt-5-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✓ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 100.0% (4/4 tasks passed)
Without Skills: 75.0% (3/4 tasks passed)
Impact: +25.0 percentage points

Per-Task Breakdown:
• Negative — Editing an ARM template [NEUTRAL] 100% → 100% (+0pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [NEUTRAL] 100% → 100% (+0pp)
• Positive — "What do I need to install?" [IMPROVED] 0% → 100% (+100pp)

Verdict: Skills have POSITIVE IMPACT (improved 1/4 tasks)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.79 | Duration: 1m53.373s

  • Tests: 4 total, 4 passed, 0 failed, 0 errors
  • Success Rate: 100.0%
  • Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 1.00 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4

Results saved to: .waza-results/prereq-check-gpt-5.4.json
JUnit XML saved to: .waza-results/prereq-check-gpt-5.4.junit.xml

🔢 Tokens (count + profile)

📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
   ⚠️  token count 2138 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious from the description and Quick Reference table. Steps are numbered, ordered logically (detect → scan → check → auth → verdict), and each step references its implementation artifact. The status mapping and verdict taxonomy leave no ambiguity.
completeness       █████  Covers tool checks, version minimums, auth checks, platform variants (macOS/Linux/Windows), user-reported vs. terminal-detected discrepancies, error handling table with 8 specific scenarios, and a full constraints section. Edge cases like headless shells, expired sessions, and execution policy restrictions are explicitly addressed.
trigger_precision  ████░  USE FOR triggers are exhaustive and include both conceptual phrases and literal error strings (e.g., 'az: command not found'), which maximizes recall. The DO NOT USE FOR section is deliberately blunt ('Anything else'), which is clear but could briefly name 1-2 adjacent skills to prevent boundary confusion — e.g., 'not for azure-validate or deployment errors'.
scope_coverage     █████  Scope is tightly defined: read-only, four specific tools, two auth sessions, three platforms. Boundaries are explicit in both the Quick Reference ('Side effects: Read-only') and Constraints ('Never' list). The 'Next' section correctly hands off without overstepping into onboarding territory.
anti_patterns      █████  Avoids all common anti-patterns: no vague instructions ('exactly one of READY / TOOLS MISSING / REPORTED MISSING / AUTH MISSING'), no conflicting directives, error handling is concrete with cause+fix pairs, and Rule 4 explicitly prevents the dangerous auto-chaining anti-pattern. The 'stop at first blocking failure' rule prevents partial-state confusion.
────────────────────────────────────────────
Overall: 4.8/5.0

Exceptionally well-structured skill definition. It is precise, actionable, and defensively designed — covering platform detection, user-reported vs. terminal-detected discrepancies, and a clear verdict taxonomy. The only minor gap is that DO NOT USE FOR could name 1-2 adjacent skills to sharpen routing at the edges, but this is a negligible omission in an otherwise exemplary skill.
✅ Check (compliance summary) (59 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: prereq-check

📋 Compliance Score: Medium-High
   ⚠️  Good, but could be improved. Missing routing clarity.

   Issues found:
   ❌  SKILL.md is 2138 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 4/4 valid
   ✅  All links valid.

📊 Token Budget: 2138 / 500 tokens
   ❌  Exceeds limit by 1638 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 1 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ❌  [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips

arnaudlh and others added 2 commits May 19, 2026 19:11
…t presence

A token that exists but lacks access to the private microsoft/waza repo
still produced a hard 403 during waza install. Upgrade the preflight check
to a real HTTP call against the microsoft/waza releases API and gate on a
200 response so misconfigured tokens skip the same way absent tokens do.

🧪 - Generated by Copilot
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

⚠️ Documentation Staleness Warning

Source files (agents, skills, workflows, or config) changed in this PR, but the generated documentation is out of date.

Changed docs that need regeneration:

  • website/docs/skills/overview.md
  • website/docs/skills/prereq-check.md
  • website/docs/workflows/daily-repo-status-lock.md
  • website/docs/workflows/issue-triage-agent-lock.md
  • website/docs/workflows/pr-validation.md
  • website/docs/workflows/waza-agent-evals.md
  • website/docs/workflows/waza-evals.md

To fix: Run the following command and commit the results:

node scripts/generate-docs.js

This is an advisory check — it does not block the PR.

@arnaudlh arnaudlh requested a review from sendtoshailesh May 20, 2026 00:11
waza check expects eval.yaml colocated with SKILL.md and reports 'Not Found' when the layout separates them. Repo convention is .github/skills/<name>/SKILL.md + .github/evals/<name>/eval.yaml, so the warning was a false negative. Prepend a note above the captured waza check output explaining the layout and pointing reviewers to the Score section, which reflects the actual eval run.
arnaudlh added 2 commits May 21, 2026 11:18
Introduces stage 0 of the eval lifecycle: scaffold a brand-new
.github/evals/<skill>/ suite (eval.yaml + positive/negative/off-topic
tasks), register the skill at the expanded tier in manifest.yaml, and
run a single-model smoke trial. Pauses for approval before appending
to the manifest and before the smoke trial.

Out of scope: editing SKILL.md (use /skill-improve) and promoting
to the pilot tier (use /skill-promote).

Also updates website/docs/authoring/prompts.md with a When-to-use row,
a per-prompt section, and an updated lifecycle summary in the page
intro and frontmatter description.
…amework

- add /agent-onboard prompt and AGENT/SKILL scaffold templates
- record waza harness decision in .github/evals/README.md
- document eval lifecycle in website/docs/authoring/framework.md
- wire mutable-by-* tags into /agent-improve gate and Locked column
- add CONTRIBUTING.md section for adding eval suites

🦍 - Generated by Copilot
@arnaudlh arnaudlh requested review from dawright22 and suuus May 21, 2026 11:18
Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧪 Local Validation Feedback — Waza Eval Harness

Tested locally with Waza CLI v0.31.0 on macOS/arm64. Here is a structured summary of findings.


✅ What Works Well

  • Reproducibility:--no-cache runs with claude-sonnet-4.6 → identical 0.86 score, 0% variance
  • Workflow architecture: Proper job DAG (preflight→prepare→eval→comment), graceful degradation when secret missing
  • Security model: pull_request event (fork-safe), minimal permissions (contents:read, pull-requests:write)
  • Contributor prompts: All 6 prompts (skill/agent × bench/improve/promote) are well-structured
  • Advisory-only mode: Evals never block merges ✅

🔴 High Priority Findings

1. trials_per_task: 1 causes non-deterministic pass/fail

Model Local Result CI Result
claude-sonnet-4.6 ✅ 0.86, 3/3 pass ✅ pass
gpt-5-codex ✅ 0.86, 3/3 pass ✅ pass
gpt-5.4 ⚠️ 0.74, 2/3 pass ✅ pass

gpt-5.4 failed locally on positive-what-do-i-need → criterion 4 missing (no install commands). Same model passes in CI. With a single trial, one flaky LLM response = permanent failure for that run.

Recommendation: trials_per_task: 3 with majority voting for production gating.

2. Insufficient negative task coverage

Current state: 2 positive + 1 negative task. A skill that always fires would score 66% (above the 0.6 trigger_precision threshold).

Recommendation: Add ≥3 negative scenarios, e.g.:

  • "Review this pull request for security issues" (code review, not prereq)
  • "Deploy my app to Azure Container Apps" (deployment, not prereq)
  • "What are the team's Q1 OKRs?" (org question, not tooling)

3. Token budget calibration mismatch

The .waza.yaml comment says:

"75th-percentile token count across all 34 skills, rounded up to nearest 50, capped at 1000"

But actual state: 13 skills, 75th percentile = 3,179 tokens (3.2× the 1000 threshold). Only 2/13 skills are under the 1300 fallback limit. This causes waza check to flag every skill as over-budget.

Question: Was this calibrated against a different/future skill set? If so, the comment should clarify. If not, thresholds need recalibration.


🟡 Medium Priority

4. waza check false negative on eval suite

Because eval.yaml lives in .github/evals/prereq-check/ (not colocated with SKILL.md), waza check reports "Evaluation Suite: Not Found". The PR comment correctly notes this, but it will confuse contributors following the readiness workflow.

5. Judge model same-family bias

claude-sonnet-4.6 judges claude-sonnet-4.6 responses. The meeting discussion about "makers vs checkers across model families" suggests this should eventually use a cross-family judge (e.g., GPT judges Claude, Claude judges GPT).

6. Aggregate score 0.86 for "all pass" may confuse reviewers

The negative task scores 0.57 (correct behavior — skill correctly did NOT trigger, but the score reflects imperfect isolation). Readers seeing <1.0 on "all pass" might think something is wrong.


📊 Cost Data (for scalability planning)

Model Premium Requests Input Tokens Duration
claude-sonnet-4.6 10 291K 62s
gpt-5-codex 12 350K 65s
gpt-5.4 14 365K ~145s

At full scale (13 skills × 4 models): ~650 premium requests per full matrix run.


Local Test Steps Used

curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
gh repo clone Azure/git-ape && cd git-ape
git checkout -b feat/eval-harness origin/feat/eval-harness
waza tokens count .github/skills/
waza check .github/skills/prereq-check
waza run .github/evals/prereq-check/eval.yaml --model claude-sonnet-4.6 -v
waza run .github/evals/prereq-check/eval.yaml --model gpt-5.4 --no-cache
waza run .github/evals/prereq-check/eval.yaml --model gpt-5-codex --no-cache

Overall: solid foundation for the eval harness. The critical items above would strengthen reliability before scaling to all 13 skills + 8 agents.

@sendtoshailesh
Copy link
Copy Markdown
Contributor

🔄 Updated Review — New Commits Acknowledged

Noticed two new commits since my initial review:

  • c2e5ada/skill-onboard maintainer prompt + docs
  • b819016/agent-onboard prompt, decision doc, authoring framework

✅ New additions that address earlier feedback

1. /skill-onboard now scaffolds negatives by default — the prompt generates negativeTasks (default 1, capped at 2) + an off-topic negative. This partially addresses my "insufficient negative coverage" concern, though the default of 1 negative + 1 off-topic (total 2) against 2 positives still allows an always-firing skill to pass. Consider defaulting to negativeTasks=2 + 1 off-topic = 3 negatives.

2. Decision doc (.github/evals/README.md) — excellent. Clearly explains why waza was chosen over openai/evals and custom Node harness. The lifecycle summary (onboard → bench → improve → promote) gives contributors a clear path.

3. CONTRIBUTING.md section — good onboarding for new contributors.

🟡 Observations on new additions

/skill-onboard cost notice says 5-8 premium requests per invocation — this is per-skill scaffolding, not per-PR-run. Worth noting that the total eval cost across all skills (when all 13 are onboarded) will be much higher. Consider adding a cost table to the decision doc.

Agent eval mirror patternwaza-agent-evals.yml auto-syncs the canonical .agent.md into the eval directory before each run. This is clever but means the eval mirror can drift from the source between runs. The sync step mitigates this, but worth documenting as a design choice.

Previous feedback still applies

My 3 critical findings from the earlier review remain relevant:

  1. ⚠️ trials_per_task: 1 — still a flakiness risk (unaddressed)
  2. ⚠️ Negative coverage defaults — improved by onboard prompt, but defaults could be higher
  3. ⚠️ Token budget calibration comment — still references "34 skills"

Overall the PR is maturing well. The authoring framework and lifecycle prompts significantly lower the bar for contributors.

Validated locally against the full prereq-check eval suite (4 tasks ×
3 trials, claude-sonnet-4.6): 100% pass, stddev 0.0000, schema
unchanged.

Also rewrite the pin comment to record the real failure mode that
motivated the original pin (commit f0d44ea): GitHub's releases/latest
endpoint returned the sibling azd-extension tag
(azd-ext-microsoft-azd-waza_0.33.0) instead of v0.33.0, causing a 404
on every download. v0.33.0's actual binary was never exercised. Pinning
to v0.33.0 directly avoids the tag-resolution path entirely.

Notable upsides shipped between 0.31 and 0.33:
  - PR #251: prompt grader preserves grades when follow-up turn fails
    (mitigates documented continue_session JSON-RPC flakiness).
  - PR #258: prompt graders route through CopilotEngine for consistency.
  - Bundled copilot-cli 1.0.2 -> 1.0.49.
@sendtoshailesh
Copy link
Copy Markdown
Contributor

✅ Updated Local Validation — All Critical Findings Resolved

Re-tested locally with Waza CLI v0.33.0 on macOS/arm64 (commit 3684b5e).


📊 Results Summary

Model Pass Rate Aggregate Std Dev Tasks Trials Duration
claude-sonnet-4.6 100% (4/4) 0.79 0.0000 4 3/task 2m10s
gpt-5.4 100% (4/4) 0.79 0.0000 4 3/task 2m17s

✅ Critical Findings — All Addressed

# Original Finding Status
1 trials_per_task: 1 causes flakiness Fixedtrials_per_task: 3, gpt-5.4 now passes 100% (was 66%) with zero variance
2 Insufficient negative coverage Fixed — 2 positive + 2 negative tasks; negatives score 0.57/0.60 (well below 0.50 trigger threshold)
3 Token budget comment ("34 skills") Clarified.waza.yaml now documents aspirational caps vs actual corpus (13 skills, p75 ≈ 3.2k)

🆕 Additional Improvements Since Last Review

  • Waza v0.33.0 pinned with documented rationale (avoids releases/latest 404 from sibling tag)
  • /skill-onboard and /agent-onboard prompts scaffold negatives by default
  • Decision doc (.github/evals/README.md) explains waza choice + lifecycle
  • Authoring guide lowers contributor bar significantly

🟡 Non-blocking Observations

  1. Aggregate 0.79 ≠ failure — negative tasks intentionally score <1.0 (trigger correctly not firing). Consider adding a one-line explanation in the CI PR comment for reviewers unfamiliar with the scoring model.
  2. waza check reports token overage (2138 vs 500 limit) — this is the aspirational cap working as intended. No action needed.
  3. Same-family judge bias — future work, not blocking for harness MVP.

🧪 Local Test Commands

waza --version  # 0.33.0
waza check .github/skills/prereq-check
waza run .github/evals/prereq-check/eval.yaml --model claude-sonnet-4.6 -v
waza run .github/evals/prereq-check/eval.yaml --model gpt-5.4 -v

Verdict

Ready for merge. The eval harness is stable, reproducible, and well-documented. All 3 critical items from my earlier review are resolved. Remaining items are UX polish suitable for follow-up PRs.

@arnaudlh arnaudlh merged commit f9e5d20 into main May 25, 2026
20 checks passed
@arnaudlh arnaudlh deleted the feat/eval-harness branch May 25, 2026 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI-evals All things related to agent and skills evaluation. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent/Skill eval framework

3 participants