feat(onboarding): template-driven scaffold + register prompts/eval in matrix#142
feat(onboarding): template-driven scaffold + register prompts/eval in matrix#142arnaudlh wants to merge 16 commits into
Conversation
The waza model catalog now ships gpt-5-codex under its versioned ID gpt-5.3-codex. Align manifest tiers and bench-prompt argument hints so dispatched runs resolve to a valid model. - .github/evals/manifest.yaml: pilot + expanded tier model lists - .github/prompts/agent-bench.prompt.md: default models in argument-hint + body - .github/prompts/skill-bench.prompt.md: default models in argument-hint + body 🔖 - Generated by Copilot
Rewrite git-ape-onboarding as a skill-driven CLI playbook backed by a
sync-able template bundle. The previous .exampleyml workflows lived in
this repo's .github/workflows/ and were copy-pasted by users; they're
now first-class templates under the skill and pushed into target repos
by scripts/sync-templates.{sh,ps1}.
What ships:
- .github/agents/git-ape-onboarding.agent.md: rewritten flow + tools
- .github/skills/git-ape-onboarding/SKILL.md: new playbook structure
- .github/skills/git-ape-onboarding/scripts/: bash + pwsh helpers
- scaffold-repo.{sh,ps1}: bootstrap target repo
- sync-templates.{sh,ps1}: drop-in workflow + instructions update
- .github/skills/git-ape-onboarding/templates/: canonical target-repo
artifacts (copilot-instructions.md, workflows/git-ape-{plan,deploy,
destroy,verify,drift}.yml + drift agentic workflow + drift lockfile)
- .github/evals/git-ape-onboarding/: positive + negative tasks for
first-time-setup, multi-env, skip-on-collision, and storage refusal
- .github/workflows/git-ape-onboarding-template-check.yml: CI check
that the shipped templates pass actionlint and round-trip cleanly
- .github/evals/manifest.yaml: register git-ape-onboarding in pilot
tier (matches its prior 4-model bench coverage)
Removed:
- .github/workflows/git-ape-{deploy,destroy,plan,verify}.exampleyml:
retired — their content is now in skills/.../templates/workflows/
The .exampleyml extension was a workaround to keep GitHub Actions from
auto-loading workflow scaffolds; templates under the skill don't need
the workaround because their path isn't .github/workflows/.
🐵 - Generated by Copilot
Wire the .github/prompts/ directory into the published artifacts:
- plugin.json: declare 'prompts: .github/prompts/' so the plugin
manifest exposes them alongside agents and skills.
- extension/package.template.json: register all 9 prompt files
(git-ape, agent-{bench,improve,onboard,promote}, skill-{bench,
improve,onboard,promote}) under chatPromptFiles so VS Code picks
them up from the installed extension.
- extension/.vscodeignore: explicitly exclude dev-only .github
subtrees (actionlint, dependabot, aw, copilot, evals, plugins,
references, scripts, templates, workflows). Keeps agents/, skills/,
plugin/, copilot-instructions.md, and now prompts/ in the VSIX
while shedding ~MB of CI tooling that shouldn't ship to users.
🧩 - Generated by Copilot
…t Stacks Align copilot-instructions with the actual workflow templates shipped by the onboarding skill: use 'az stack sub' instead of 'az deployment sub' / 'az group delete' for the full plan-deploy-destroy lifecycle. Why this matters for agents reading the instructions: - The stack is the single unit of lifecycle — create, update, and destroy all operate on it, not on the underlying RGs. - 'deleteAll' on unmanage cleans up every managed resource across every scope (subscription, multiple RGs, sub-scope role/policy assignments) in one call. No orphans, idempotent re-runs. - See #30 for the design rationale. Sample workflow snippet now also passes --action-on-unmanage deleteAll, --deny-settings-mode none, --yes — matching what .github/skills/git-ape-onboarding/templates/workflows/git-ape-deploy.yml generates in target repos. 📘 - Generated by Copilot
scripts/generate-docs.js: teach the workflow doc generator about two
source directories, the existing CI workflows under .github/workflows/
and the new user-facing templates under .github/skills/git-ape-
onboarding/templates/workflows/. Templated workflows get a Docusaurus
:::info admonition explaining they're scaffolded by /git-ape-onboarding
and don't run in the git-ape repo itself. Drops .exampleyml handling
since those stubs are gone.
README.md: update the Workflows table + repo tree to reflect the new
layout. The four git-ape-{plan,deploy,destroy,verify}.exampleyml stubs
no longer exist in .github/workflows/; their canonical sources are
inside the onboarding skill's templates/ directory and scaffolded into
user repos as ready-to-run .yml files. Mention skip-on-collision so
readers know existing workflows are never overwritten.
website/docs/: regenerate every page that the generator touches:
- workflows/{git-ape-plan,deploy,destroy,verify}.md: relocated to the
template source path + new admonition
- workflows/git-ape-drift-lock.md, git-ape-onboarding-template-check.md
(new pages)
- workflows/overview.md: refreshed listing
- agents/git-ape-onboarding.md, skills/git-ape-onboarding.md,
getting-started/onboarding.md: re-synced from current sources
- reference/{plugin-json,marketplace}.md: re-synced to pick up prompts:
registration and chatPromptFiles entries
📚 - Generated by Copilot
…source The auto-generated 'Continuous Drift Remediation' page documents the compiled '.lock.yml' shape. This adds the missing hand-curated page documenting the agentic '.md' source — schedule, severity model, anti-flapping rules, safe-outputs configuration, and how to recompile after editing. Ported from the private repo with two small adaptations: - Workflow-file path updated to the template location under .github/skills/git-ape-onboarding/templates/workflows/git-ape-drift.md (matches the autogen lock-page convention). - Added the ':::info[Scaffolded by /git-ape-onboarding]' admonition for consistency with the autogen lock page; clarifies the file is shipped as a template, not run in the git-ape repo itself. - Added a Related section linking to the lock-page, the azure-drift-detector skill, the deployment guide, and the use-case overview so readers can navigate the full drift story. Marked HAND-CURATED at the top so generate-docs.js maintainers know not to add a generator branch for '.md' workflow sources. 🌊 - Generated by Copilot
🤖 Waza agent evals (advisory)
Ran 0 agent evals against
📊 Agent file token comparison vs
|
🧪 Waza skill evals (advisory)
Ran 12 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg
📊 Token comparison vs
|
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 2/3 (error):
- ❌ answer_quality (0.00): fail: : The assistant's response never delivered any user-facing content. All tool calls returned "unexpected user permission response" errors, and no final message was produced. As a result: (1) the four core tools (az, gh, jq, git) were not named, (2) no install command for
azwas provided, (3) no version verification commands were given, and (4) no verdict or next step was reached. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6
Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
Model: claude-sonnet-4.6
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: F008:280923:9F6984:B234E3:6A221F73)
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: F008:280923:A01327:B2F2C5:6A221F9A)
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: F00A:261D73:979BC2:AA6071:6A221F9F)
✗ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.71 | Duration: 1m43.552s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.57 - 0.89 (σ=0.1303)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 33% pass rate, score=0.78±0.16
- Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 1/3 (error):
- ❌ answer_quality (0.00): fail: Assistant never delivered an answer: The assistant attempted to invoke the prereq-check skill but all its tool calls failed with "unexpected user permission response" errors, and it produced no user-facing response. None of the four PASS criteria were met: (1) the core tool list (az, gh, jq, git) was never named; (2) no install command for
azwas provided; (3) no verification step was suggested; (4) no verdict/next step was given. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Run 2/3 (error):
- ❌ answer_quality (0.00): fail: Assistant never delivered an answer: The assistant's prior response consisted only of tool calls that errored ("unexpected user permission response") and produced no user-facing message. None of the four criteria are met: (1) the required tools az/gh/jq/git were never named, (2) no install command for az was provided, (3) no version verification commands were recommended, (4) no verdict or next step was given.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Positive — "What do I need to install?"
Run 2/3 (error):
- ❌ answer_quality (0.00): fail: Assistant did not produce a user-facing answer: The assistant invoked the prereq-check skill and attempted tool calls, all of which failed with "unexpected user permission response". No user-facing response was produced listing the required tools (az, gh, jq, git), authentication requirements, version info, or install commands. All four PASS criteria are missing.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6
Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✗ [4/4] Positive — "What do I need to install?"
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: FC00:2DFF7F:9CD99E:AFC710:6A221FB3)
✗ [3/4] Positive — "command not found" failure
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.71 | Duration: 1m36.415s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.57 - 0.89 (σ=0.1303)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 33% pass rate, score=0.78±0.16
- Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 2/3 (failed):
- ❌ answer_quality (0.00): fail: : Missing criterion 2: No concrete install command for
azon any platform was provided. The response only referenced "Microsoft's Azure CLI install script/docs" without giving an actual command likebrew install azure-cli,curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash, orwinget install Microsoft.AzureCLI. Criteria 1, 3, and 4 are met. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Run 3/3 (error):
- ❌ answer_quality (0.00): fail: Response failed all four PASS criteria - no tool list, no install commands, no version verification, no verdict: The assistant's response did not address the user's question. It only invoked the prereq-check skill and attempted to run tool checks (which failed with "unexpected user permission response"). It never listed the required tools (az, gh, jq, git), never provided install commands for az, never recommended version verification, and never reached a verdict or next step.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Positive — "What do I need to install?"
Run 2/3 (failed):
- ❌ answer_quality (0.00): fail: : Criteria 1, 2, 3 met. Criterion 4 not met: response provided only verification commands (az version, gh --version, etc.) rather than install commands or a pointer to the prereq-check skill/script that performs the checks for the user.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.3-codex
Results saved to: .waza-results/prereq-check-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 505C:C76C:974C9F:A8BB2E:6A221F7C)
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 505A:297D79:93A532:A555F5:6A221FAA)
✗ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✗ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✗ [3/4] Positive — "command not found" failure
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 25.0% (1/4 tasks passed)
Without Skills: 75.0% (3/4 tasks passed)
Impact: -50.0 percentage points
Per-Task Breakdown:
• Negative — Editing an ARM template [REGRESSED] 100% → 67% (-33pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [IMPROVED] 0% → 67% (+67pp)
• Positive — "What do I need to install?" [REGRESSED] 100% → 67% (-33pp)
Verdict: Skills have NEGATIVE IMPACT (regressed 2/4 tasks)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.74 | Duration: 1m56.347s
- Tests: 4 total, 1 passed, 3 failed, 0 errors
- Success Rate: 25.0%
- Score Range: 0.57 - 0.89 (σ=0.1519)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ❌ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Editing an ARM template: 67% pass rate, score=0.57±0.00
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
- Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16
Failed Task Details
Negative — Editing an ARM template
Run 3/3 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.14): Prompt correctly treated as non-trigger (score 0.14 < 0.50)
Positive — "command not found" failure
Run 1/3 (error):
- ❌ answer_quality (0.00): fail: Assistant never produced a substantive answer: The assistant's response consisted only of failed tool calls ("unexpected user permission response") and a single intro sentence. It did not: (1) name the required tools (az, gh, jq, git), (2) provide any install command for az, (3) recommend version verification commands, or (4) reach a verdict or next step. All four PASS criteria are missing.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Positive — "What do I need to install?"
Run 2/3 (failed):
- ❌ answer_quality (0.00): fail: Missing install commands / verification script reference: Criteria 1, 2, 3 met (lists az/gh/jq/git, mentions az login + gh auth login, gives minimum versions). Criterion 4 missing: response did not include install commands nor point to a verification script/skill (e.g., prereq-check).
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4
Results saved to: .waza-results/prereq-check-gpt-5.4.json
🔢 Tokens (count + profile)
📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
⚠️ token count 2138 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious, steps are logically ordered with a numbered table, and the Quick Reference block gives agents instant orientation. The Always/Never constraint lists eliminate ambiguity about read-only behavior.
completeness ████░ Tool versions, platform paths, auth checks, and error-handling table are thorough. However, the skill references external files (references/install-commands.md, scripts/check-tools.sh, check-tools.ps1) without fallback content — if those files are absent the agent has no install recipes to display.
trigger_precision █████ USE FOR enumerates specific shell error strings (e.g., 'az: command not found') and named scenarios, while DO NOT USE FOR draws a hard boundary. The 'When to Use' section reinforces routing with concrete examples, making misrouting very unlikely.
scope_coverage █████ Boundaries are explicitly stated (read-only, no chaining, no installs), capabilities are enumerated, related skills are named with handoff notes, and the 'Side effects: Read-only' Quick Reference entry makes the scope unambiguous.
anti_patterns ████░ No vague or conflicting directives; error-handling table covers real-world failures well. Minor issue: the error table recommends 'pwsh -File' to bypass execution policy but also notes Windows PowerShell 5.1 'also works' — this slightly contradicts the 'require pwsh' constraint and could confuse an agent choosing which path to print.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality, production-ready skill document. Structure, trigger precision, and scope boundaries are exemplary. The main actionable gap is the dependency on external reference files (install-commands.md, check-tools scripts) without inline fallback content — embedding a minimal install-command table directly would make the skill self-contained and robust when those files are missing.
✅ Check (compliance summary) (59 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: prereq-check
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 2138 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 4/4 valid
✅ All links valid.
📊 Token Budget: 2138 / 500 tokens
❌ Exceeds limit by 1638 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 1 reference module(s)
❌ [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips
Skill: git-ape-onboarding
📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup
✗ [3/4] Positive — Multi-environment onboarding
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.76 | Duration: 59.251s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.56 - 0.98 (σ=0.1965)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.94 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Response skipped the prereq gate and proceeded straight to a full step-by-step guide.: Criteria 1 and 2 are not met; criterion 3 is borderline at best.
-
❌ Prereq check results NOT presented. The agent attempted to run the check script and
command -vfallback, both of which returned "unexpected user permission response". Instead of resolving this or presenting a status table, the agent abandoned the check entirely with "I can't execute shell commands" and moved on. No tool/version table, no auth inspection. -
❌ Auth/prereq gate NOT explicitly surfaced as blocking. The agent said "No problem — I can still walk you through the exact steps" and proceeded. The prereq-check skill rules explicitly say "Stop at first blocking failure" and produce one of READY / TOOLS MISSING / REPORTED MISSING / AUTH MISSING verdicts — none was emitted. The agent merely included a passive
az account showsnippet under "Prerequisites" without gating on its result. -
⚠️ Borderline. The agent mentions needing staging subscription ID, existing client ID, and repo, but these are embedded inline in Step 1 / Step 4 rather than presented as a numbered/blocked input-gathering request before any steps. It does not ask about RBAC role choice (just hardcodes Contributor + UAA), does not ask reuse-vs-new App Reg as a decision (assumes reuse), does not confirm environment name, does not ask about onboarding mode. The closing "Want me to help you generate the exact commands..." is post-hoc, after the whole guide was emitted. -
✅ Multi-env awareness present: explicitly names
azure-deploy-staging, creates a separatefc-azure-deploy-stagingfederated credential, scopes RBAC to the staging subscription, and sets env-scoped variables/secrets.
Net: 1 of 4 clearly met, 1 borderline, 2 clearly failed. The response was a completion-report-style walkthrough, not the gated step-1 the skill requires.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-opus-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.json
Model: claude-sonnet-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.76 | Duration: 53.725s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.56 - 0.98 (σ=0.1857)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.91 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq check results and auth gate: Criterion 1 FAIL: No prereq check results were presented. The agent attempted to run check-tools.sh and version commands but they returned "unexpected user permission response", and the agent never surfaced any tool-version table, list, or equivalent inspection output to the user. Criterion 2 FAIL: The auth/prereq gate was not explicitly surfaced. The agent did not state that prereqs failed, that Azure/GitHub CLI auth was unverified, or display any ❌ marker — it silently moved past the failed checks and only asked for inputs. Criterion 3 PASS: Five inputs were requested (repo URL, subscription IDs, onboarding mode, RBAC role, default branch). Criterion 4 PASS: The agent did not claim to have configured OIDC, federated credentials, environments, RBAC, or scaffolded files.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-sonnet-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.68 | Duration: 46.286s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.56 - 0.98 (σ=0.1748)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing criterion 1: prereq check results not presented: Criterion 1 NOT met: The agent attempted to run the prereq check script and a manual fallback, but both bash invocations returned "unexpected user permission response" and no tool-version table, Azure auth status, or GitHub auth status was ever rendered. The reply only states "command execution is currently blocked" without any inspected environment data.
Criterion 2 NOT clearly met: Because no prereq results were produced, no explicit ❌ marker on the Azure/GitHub auth rows was surfaced. The agent surfaced a generic execution-blocked message instead of a prereq auth gate.
Criterion 3 MET: The agent requested repo URL, subscription mapping + RBAC role, compliance framework/enforcement, and explicit go-ahead — ≥3 required inputs.
Criterion 4 MET: The agent did not claim to have configured OIDC, federated credentials, environments, RBAC, or scaffolded files. It explicitly waits for inputs and approval.
Overall: FAIL — criterion 1 (and arguably 2) not satisfied.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq execution and gating: Criteria 1 and 2 are not met. The assistant did NOT actually run or present any prereq check results — it merely instructed the user to run
/prereq-checkas a step. No tool/auth status table or inspection of the local environment was shown, and no auth gate was explicitly surfaced. Criterion 3 is partially met: the response lists ORG/REPO, STAGING_SUBSCRIPTION_ID, and App Registration reuse decision as "inputs to set," but these are presented as variables to fill rather than gated questions before proceeding — borderline. Criterion 4 is clearly met (mentionsfc-azure-deploy-stagingfederated credential, newazure-deploy-stagingenvironment, per-env secrets/variables, and reusing existing SP). Overall the response jumped straight to a completion runbook instead of gating on prereqs first. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.3-codex
Results saved to: .waza-results/git-ape-onboarding-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✗ [3/4] Positive — Multi-environment onboarding
✗ [4/4] Positive — Scaffold honors skip-with-notice on collision
[ERROR] waiting for session.idle: context deadline exceeded
✗ [2/4] Positive — First-time repo setup
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 50.0% (2/4 tasks passed)
Without Skills: 25.0% (1/4 tasks passed)
Impact: +25.0 percentage points
Per-Task Breakdown:
• Negative — Storage service comparison (off-topic) [NEUTRAL] 100% → 100% (+0pp)
• Positive — First-time repo setup [NEUTRAL] 0% → 0% (+0pp)
• Positive — Multi-environment onboarding [NEUTRAL] 0% → 0% (+0pp)
• Positive — Scaffold honors skip-with-notice on collision [IMPROVED] 0% → 100% (+100pp)
Verdict: Skills have POSITIVE IMPACT (improved 1/4 tasks)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.68 | Duration: 1m15.65s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.56 - 0.98 (σ=0.1748)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq check results: Criterion 1 not met: The agent attempted to run the prereq check script and inspect tool versions/auth, but every tool call returned "unexpected user permission response" and no results were shown. The reply contains no table or list of tool versions, no Azure CLI auth status, and no GitHub CLI auth status — so the user has no evidence the environment was actually inspected. Criterion 2 is also weak: the response blames generic "tool execution is being denied" rather than explicitly surfacing an auth/prereq gate (e.g. ❌ on az login). Criteria 3 (asked for repo URL, subscription IDs, role mapping — ≥3 inputs) and 4 (did not claim to have configured OIDC/FIC/RBAC/environments/scaffolding) are satisfied.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq check execution/results: Criterion 1 not met: the agent did not present prereq check results (no tool/auth status table or inspection of local environment). It only told the user to "Run /prereq-check" without executing it or showing output. Criterion 2 is consequently also not satisfied — no auth gate was actually surfaced based on observed state. Criteria 3 (inputs requested: repo, staging subscription ID, app reg reuse, RBAC role) and 4 (multi-env awareness: fc-azure-deploy-staging credential, azure-deploy-staging environment, per-env secrets/RBAC scoping) are met.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.4
Results saved to: .waza-results/git-ape-onboarding-gpt-5.4.json
JUnit XML saved to: .waza-results/git-ape-onboarding-gpt-5.4.junit.xml
🔢 Tokens (count + profile)
📊 git-ape-onboarding: 3,101 tokens (detailed ✓), 17 sections, 15 code blocks
⚠️ token count 3101 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity ████░ Purpose is immediately obvious, invariants section is exceptionally clear, and numbered playbook steps are well-ordered. Minor inconsistency undermines clarity: the bash scaffold command uses `./scripts/scaffold-repo.sh` while PowerShell uses the fully-qualified `.github/skills/git-ape-onboarding/scripts/scaffold-repo.ps1` — standardize both to the same relative root.
completeness ████░ Covers prerequisites, multi-env modes, OIDC gotchas, disabled subscriptions, compliance preferences, and verification commands thoroughly. Two gaps: Step 3 mentions 'reuses existing' app registration but provides no lookup command (e.g., `az ad app list --display-name`), and there is no rollback/cleanup guidance for partial failures mid-playbook.
trigger_precision ███░░ The 'When to Use' section is clear and specific, but there is no 'DO NOT USE FOR' section — leaving ambiguity about whether this skill applies to re-onboarding a partial config, rotating credentials, or adding a new subscription to an already-onboarded repo. Adding explicit negative triggers would prevent misrouting.
scope_coverage ████░ Capabilities are well-enumerated (identity, OIDC, RBAC, environments, secrets, scaffolding, compliance). The prereq-check dependency is explicitly called out. However, out-of-scope boundaries are never stated — the skill does not say it won't create subscriptions, manage org-level GitHub settings beyond environments, or handle certificate-based credentials, leaving the agent to infer limits.
anti_patterns ████░ Invariants block common mistakes (master vs main), Safe-Execution Rules prevent secret leakage and silent overwrites, and concrete CLI commands avoid vague instructions. The one notable anti-pattern: the 'Suggested Agent Flow' and 'Command Playbook' are parallel but slightly divergent descriptions of the same sequence, which could cause an agent to follow one and miss details in the other — consolidate into a single authoritative sequence.
────────────────────────────────────────────
Overall: 3.8/5.0
This is a well-crafted, production-quality skill with strong invariant enforcement, concrete CLI examples, and good edge-case coverage (custom OIDC subjects, disabled subscriptions, file collision handling). The primary gaps are the missing DO NOT USE FOR triggers (reducing routing precision), the absent app-registration-reuse lookup commands, the lack of rollback guidance for partial failures, and a minor bash/PowerShell script path inconsistency. Addressing these would bring the skill to a 4.5+ rating.
✅ Check (compliance summary) (64 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/git-ape-onboarding/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: git-ape-onboarding
📋 Compliance Score: Medium
⚠️ Needs improvement. Missing anti-triggers and routing clarity.
Issues found:
❌ SKILL.md is 3101 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 2/5 valid
⚠️ 3 link issue(s) found.
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-deploy/SKILL.md: target does not exist
❌ [templates/copilot-instructions.md] → website/docs/deployment/state.md: target does not exist
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-destroy/SKILL.md: target does not exist
📊 Token Budget: 3101 / 500 tokens
❌ Exceeds limit by 2601 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (3101 tokens, 0 modules)
❌ [negative-delta-risk] Negative delta risk patterns detected: excessive constraints (12 constraint keywords found)
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 5 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
2. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
3. Run 'waza dev' for interactive compliance improvement
4. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
5. Fix 3 broken link(s) — targets do not exist
6. Reduce SKILL.md by 2601 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-deploy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [3/5] Negative — What-if preview / preflight validation
✓ [5/5] Positive — Re-deploy after template edit
✓ [4/5] Positive — Local deploy of an existing deployment artifact
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.81 | Duration: 1m31.024s
- Tests: 5 total, 3 passed, 2 failed, 0 errors
- Success Rate: 60.0%
- Score Range: 0.60 - 0.94 (σ=0.1148)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.94 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✓ [5/5] Positive — Re-deploy after template edit
✗ [4/5] Positive — Local deploy of an existing deployment artifact
[ERROR] waiting for session.idle: context deadline exceeded
✗ [3/5] Negative — What-if preview / preflight validation
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.78 | Duration: 1m27.827s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.86 (σ=0.0946)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Positive — Local deploy of an existing deployment artifact
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: : Missing criterion 4: the response does not mention that state.json (schemaVersion 1.0) will be written to capture the stack ID and managed resources. Criteria 1 (az stack sub create), 2 (--action-on-unmanage deleteAll), and 3 (deploy-stack.sh helper) are met.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-stack-deploy-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks
⚠️ token count 1912 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious from the title and description. Steps are numbered, well-ordered, and include both bash and PowerShell equivalents. The inline script behavior breakdown (steps 1–7) leaves no ambiguity about what the script does at each stage.
completeness █████ Covers prerequisites, arguments, procedure, output format, failure modes, state.json schema, soft-deletable resource types, and post-run user messaging requirements. Edge cases like race conditions, stack unavailability, and fallback behavior are explicitly documented.
trigger_precision ████░ USE FOR and DO NOT USE FOR sections are well-defined with cross-links to the correct alternative skills. Minor gap: no explicit trigger for 'first-time deployment vs. update' distinction, though the re-deploy case is mentioned in prose. Could add a DO NOT USE FOR case covering partial/incremental deployments if that's a real routing risk.
scope_coverage █████ Scope is tightly defined — subscription-scoped stack create only, with explicit exclusions for destroy, preflight, and template authoring. The fallback path is scoped and labeled with a clear trade-off warning, preventing scope creep into legacy deployment patterns.
anti_patterns ████░ Avoids vague instructions, conflicting directives, and missing error handling well. The 'What to tell the user after running' section is a nice anti-hallucination guard. Minor: the fallback behavior (step 4) is described in both the procedure and the arguments table (--no-fallback flag), which is slightly redundant but not harmful. The race-condition recovery ('Re-run — the script is idempotent') could be more specific about how long to wait or what to check first.
────────────────────────────────────────────
Overall: 4.6/5.0
This is a high-quality, production-ready SKILL.md. It excels at completeness and clarity, with thorough error handling, schema documentation, and explicit post-run messaging requirements that prevent agent hallucination. Trigger precision is strong but could add one more DO NOT USE FOR case for partial deployments. The only minor anti-pattern is slight redundancy around the fallback mechanism and a vague race-condition recovery step.
✅ Check (compliance summary) (70 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-deploy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 1912 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 0/8 valid
⚠️ 8 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory
📊 Token Budget: 1912 / 500 tokens
❌ Exceeds limit by 1412 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Description density is optimal for cross-model use
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-destroy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
[ERROR] waiting for session.idle: context deadline exceeded
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.74 | Duration: 1m48.171s
- Tests: 5 total, 1 passed, 4 failed, 0 errors
- Success Rate: 20.0%
- Score Range: 0.60 - 0.87 (σ=0.1064)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.80 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — Local destroy of a Git-Ape deployment: 50% pass rate, score=0.80±0.17
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 1/2 (error):
- ❌ answer_quality (0.00): fail: Missing explicit coverage of required criteria: The assistant invoked the azure-stack-destroy skill and delegated execution to a background agent, but the user-facing response itself did not: (1) explicitly recommend the skill/scripts over raw
az group deletewith rationale about soft-delete/multi-RG; (2) reference the state.json prerequisite at .azure/deployments/deploy-20260524-test/; (3) mentionaz stack sub delete --action-on-unmanage deleteAllsemantics; (4) cover soft-delete purge sweep (Key Vault, Cognitive Services) or purgeProtected retention. The response was essentially "I'm running it, will report back" — none of the four PASS criteria are explicitly communicated. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: : Missing criterion 1: The response recommended the destroy script but did not explicitly contrast it with raw
az group deleteor explain that rawaz group deletemisses soft-delete cleanup and multi-RG/subscription-scope resources. Criterion 3 is also weak — the response mentioned "deleteAll" semantics but did not explicitly cite theaz stack sub delete --action-on-unmanage deleteAllcommand. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Positive — Local destroy of a Git-Ape deployment
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: : Criteria 1, 2, 3 are met (recommends destroy-stack.sh, references state.json prerequisite, mentions az stack sub delete --action-on-unmanage deleteAll). Criterion 4 is partially met but weak: the response says "Purge any soft-deleted Key Vaults so the names are immediately reusable" but does not name
az keyvault purgeoraz keyvault list-deleted, nor does it explicitly explain the non-purge-protected sweep semantics. Borderline — leaning fail for missing the explicit command name. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [4/5] Positive — Clean up the deployment stack
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.71 | Duration: 1m29.083s
- Tests: 5 total, 1 passed, 4 failed, 0 errors
- Success Rate: 20.0%
- Score Range: 0.60 - 0.87 (σ=0.1093)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.63 | ❌ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: : Response recommended the destroy script (criterion 1 partial) but did not explicitly explain why raw
az group deleteis insufficient (missing soft-delete cleanup / multi-RG). Did not mention thestate.jsonprerequisite under.azure/deployments/deploy-20260524-test/(criterion 2 missing). Did not mentionaz stack sub delete --action-on-unmanage deleteAllsemantics (criterion 3 missing). Did not describe the Key Vault / Cognitive Services purge sweep orpurgeProtectedretention (criterion 4 missing). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Missing criteria 2, 3, and 4: The assistant recommended the azure-stack-destroy script (criterion 1 partially met, though it did not explicitly contrast with raw
az group deleteand explain why). However, it did NOT mention: (2) the state.json prerequisite under .azure/deployments/deploy-20260524-test/, (3) the underlyingaz stack sub delete --action-on-unmanage deleteAllcommand/semantics, or (4) the soft-delete purge sweep behavior for Key Vault/Cognitive Services or the purgeProtected retention behavior. The response was cut short due to an environment permission error and only provided the script command without explaining what it does. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Positive — Local destroy of a Git-Ape deployment
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing criteria 2 and 3: The assistant recommended the skill and ran destroy-stack.sh (criterion 1 ✅) and mentioned the soft-deleted Key Vault purge sweep so the name can be reused (criterion 4 ✅). However, the response did not reference state.json under .azure/deployments/deploy-20260506-001/ as the source of truth (criterion 2 ❌), and did not name the actual
az stack sub delete --action-on-unmanage deleteAllcommand or its single-idempotent-call semantics (criterion 3 ❌). Those details lived only in the loaded skill context, not in the assistant's reply. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Response missing 3 of 4 required elements: Criterion 1 met (recommends destroy-stack.sh under .github/skills/azure-stack-destroy/scripts/). Missing: (2) does not reference state.json under .azure/deployments/deploy-20260506-001/ as source of truth; (3) does not name
az stack sub delete --action-on-unmanage deleteAllcommand or its semantics; (4) does not explicitly mentionaz keyvault purge/az keyvault list-deletedor explain the purge sweep mechanics — only vaguely says "purges eligible soft-deleted Key Vaults". - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-stack-destroy-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks
⚠️ token count 2644 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious, steps are logically ordered, and the fast vs sync mode comparison table is an excellent UX decision. Code examples are concrete and cover bash/PowerShell parity. Minor issue: the 'When to Use' section near the top partially duplicates the 'USE FOR' section, adding noise.
completeness █████ Exceptional coverage: prerequisites with version constraints, all failure modes with recovery steps, status codes with meanings, purge sweep behavior per resource type, state.json field mapping, and idempotency guarantees. Edge cases like purge-protected vaults and the fallback path when stackId is absent are explicitly handled.
trigger_precision ████░ USE FOR and DO NOT USE FOR triggers are specific and include exact user-phrasing examples, which is excellent. However, the standalone 'When to Use' section at the bottom duplicates trigger content already in 'USE FOR', creating redundancy that could confuse routing logic — consolidate or remove it.
scope_coverage █████ Boundaries are exceptionally well-defined: explicit 'no surgical mode' caveat, hard state.json prerequisite, non-Git-Ape exclusions, and clear differentiation from raw az group delete with three concrete reasons. Capabilities and intentional omissions (App Configuration, API Management not auto-purged) are both documented.
anti_patterns ████░ No vague instructions, no conflicting directives, and error handling is thorough. The one notable anti-pattern is the duplicated 'When to Use' / 'USE FOR' content, which could cause an agent to double-weight those triggers. The bypass flag safety rationale is a good proactive clarification that avoids misuse.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality, production-ready skill definition. It excels at completeness and scope coverage, with thorough error handling, idempotency guarantees, and explicit edge-case documentation. The primary improvement opportunity is consolidating the duplicate 'When to Use' and 'USE FOR' sections to reduce redundancy and tighten routing signal.
✅ Check (compliance summary) (69 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-destroy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 2644 tokens (hard limit 500)
📐 Spec Compliance: 7/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
❌ [spec-security] Security risks detected: description contains XML angle brackets
📎 XML angle brackets and reserved prefixes pose injection and naming conflict risks
📎 Links: 0/4 valid
⚠️ 4 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory
📊 Token Budget: 2644 / 500 tokens
❌ Exceeds limit by 2144 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips
There was a problem hiding this comment.
Pull request overview
Overhauls the /git-ape-onboarding flow: replaces the .exampleyml activation hack with a template-driven scaffolder under the skill directory, migrates deploy/destroy from az deployment sub to Azure Deployment Stacks (closes part of #30), registers the onboarding eval suite in the pilot tier, declares prompt files in the VSIX, and regenerates website docs.
Changes:
- Removes
.github/workflows/git-ape-{plan,deploy,destroy,verify}.exampleyml, ships canonical templates under.github/skills/git-ape-onboarding/templates/workflows/, plusscaffold-repo.{sh,ps1}andsync-templates.{sh,ps1}with a newgit-ape-onboarding-template-checkCI workflow enforcing parity. - Rewrites deploy + destroy templates around
az stack sub create/delete --action-on-unmanage deleteAll, adds a state-filestackId/managedResources[]schema, a rollback step, a templateanalyzer staging workaround, and switchesAZURE_SUBSCRIPTION_IDfromsecretstovars. - Renames
gpt-5-codex→gpt-5.3-codexin tier manifest and bench prompts; registersgit-ape-onboardingin thepilottier with 4 tasks; tightens the agent's identity contract and adds a "required inputs" gate; declares.github/prompts/inplugin.jsonand registers 9 chatPromptFiles; trims.vscodeignore.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/git-ape-onboarding-template-check.yml |
New CI parity check (bash+pwsh sync + scaffold byte-diff). |
.github/workflows/git-ape-deploy.exampleyml (deleted) |
Old activation stub; superseded by template. |
.github/skills/git-ape-onboarding/templates/workflows/git-ape-{plan,deploy,destroy,verify}.yml |
Canonical workflow templates; Stacks migration + scan staging + rollback. |
.github/skills/git-ape-onboarding/templates/README.md + copilot-instructions.md |
Maintainer doc + canonical deployment standards. |
.github/skills/git-ape-onboarding/scripts/{scaffold-repo,sync-templates}.{sh,ps1} |
Parity scaffold + mirror scripts. |
.github/skills/git-ape-onboarding/SKILL.md, .github/agents/git-ape-onboarding.agent.md |
Drop acknowledgment phase, add invariants/identity/non-goals/required-inputs gate. |
.github/copilot-instructions.md |
Stacks-based deploy/destroy guidance. |
.github/evals/git-ape-onboarding/{eval,tasks/*}.yaml, .github/evals/manifest.yaml |
New eval suite (3 positive + 1 negative); register skill in pilot, rename codex model. |
.github/prompts/{agent,skill}-bench.prompt.md |
Update default model list. |
extension/package.template.json, extension/.vscodeignore, plugin.json |
Register prompt files in VSIX, drop dev-only .github/* paths from VSIX. |
scripts/generate-docs.js, README.md, website/docs/** |
Regenerated docs for both repo CI and scaffolded user-facing workflows. |
Copilot's findings
Comments suppressed due to low confidence (3)
.github/skills/git-ape-onboarding/templates/workflows/git-ape-verify.yml:44
- The check now reads
vars.AZURE_SUBSCRIPTION_ID(a repository variable), but the error message and summary still call it a "secret". This is misleading: a user looking at logs will go check repo Secrets, not repo Variables, and may waste time before realising the setup expects a variable. Update the user-facing messages and the missing-config copy to refer toAZURE_SUBSCRIPTION_IDas a variable. Also notegit-ape-deploy.ymlstill writessubscriptionfromvars.AZURE_SUBSCRIPTION_IDwhile the onboarding skill (Step 7) andcopilot-instructions.md(line 405) still documentAZURE_SUBSCRIPTION_IDas a secret — the docs and the workflow contract have diverged.
.github/skills/git-ape-onboarding/templates/workflows/git-ape-verify.yml:121 - The verify workflow checks for
git-ape-ttl-reaper.yml, but the scaffold helper (scaffold-repo.sh/scaffold-repo.ps1) does not ship a TTL Reaper template — the MAPPINGS only includeplan,deploy,destroy,verify, anddrift.{md,lock.yml}. Every onboarded repo will therefore see a perpetual⚠️ Git-Ape: TTL Reaper (git-ape-ttl-reaper.yml) — not foundwarning inVerify Setup. Either drop this entry from the workflow list, or add the TTL Reaper template to the scaffolder and thetemplates/workflows/directory.
.github/skills/git-ape-onboarding/templates/workflows/git-ape-destroy.yml:151 - This gate accepts a state file as long as it has either
stackIdordeploymentId. Every state file ever written by this project has adeploymentId(it's the matrix key), so the check effectively only fails ifstate.jsonis corrupt. For a deployment created by the old (pre-Stacks)git-ape-deploy.exampleyml,stackIdwill be empty butdeploymentIdwill be set — so the check passes, thenaz stack sub show --name "$STACK_NAME"in the next step returns a non-zero exit, the workflow recordsexists=falseand exits 0 with "Already destroyed (stack not found)". Real Azure resources still exist, the resource group is never deleted, but the destroy run reports success andmetadata.jsonwill be flipped todestroyed. To make this Stacks-only and safe, requirestackIdexplicitly (or, if you must accept old state files, fall back toaz group deleteonstate.resourceGroupwhenstackIdis empty).
- Files reviewed: 47/47 changed files
- Comments generated: 3
| STACK_ID='${{ steps.deploy.outputs.stack_id }}' | ||
| MANAGED='${{ steps.deploy.outputs.managed_resources }}' | ||
| MANAGED=${MANAGED:-[]} |
There was a problem hiding this comment.
Fixed in d6b41dc. managed_resources and stack_id are now passed through env: and read as MANAGED="${MANAGED_RESOURCES:-[]}", then validated with jq empty (falling back to [] on invalid/empty JSON) before being consumed via jq --argjson managedResources. No more single-quoted JSON literal.
| # Also snapshot the previous template from git (parent commit of this merge | ||
| # or origin/main for /deploy comment). Used to redeploy last-known-good on failure. | ||
| DEPLOY_DIR="${{ steps.params.outputs.deploy_dir }}" | ||
| mkdir -p /tmp/rollback | ||
| if git show HEAD~1:"$DEPLOY_DIR/template.json" > /tmp/rollback/template.json 2>/dev/null; then | ||
| cp "$DEPLOY_DIR/parameters.json" /tmp/rollback/parameters.json 2>/dev/null || true | ||
| # Prefer the previous parameters if they exist at HEAD~1 | ||
| git show HEAD~1:"$DEPLOY_DIR/parameters.json" > /tmp/rollback/parameters.json 2>/dev/null || true | ||
| echo "prior_template_available=true" >> "$GITHUB_OUTPUT" | ||
| echo "[$(date -u +%H:%M:%S)] Previous template captured from HEAD~1 → /tmp/rollback/" | ||
| echo " template bytes: $(wc -c < /tmp/rollback/template.json)" | ||
| else | ||
| echo "prior_template_available=false" >> "$GITHUB_OUTPUT" | ||
| echo "[$(date -u +%H:%M:%S)] No previous template in git history (first deployment)" | ||
| fi |
There was a problem hiding this comment.
Fixed in d6b41dc. The rollback baseline is now derived per trigger: HEAD~1 only for push; for /deploy comments we git fetch origin main --depth=1 and use origin/main. git show "$BASELINE_REF:$DEPLOY_DIR/template.json" then reads the correct previous known-good template instead of the PR head.
There was a problem hiding this comment.
Follow-up: the /deploy PR-comment trigger referenced above has since been removed entirely (unverifiable comment-author authorization). The rollback baseline is now derived solely from the push trigger — HEAD~1 on main after merge — so the origin/main fetch path for /deploy no longer exists.
| if (validationStatus === 'passed' && whatifResult) { | ||
| comment += `### What-If Analysis\n\n`; | ||
| comment += `\`\`\`\n${whatifResult}\n\`\`\`\n\n`; | ||
| } else if (whatifStatus === 'passed' && whatifResult) { | ||
| comment += `### What-If Analysis\n\n`; | ||
| comment += `\`\`\`\n${whatifResult}\n\`\`\`\n\n`; |
There was a problem hiding this comment.
Fixed in d6b41dc. Removed the unreachable validationStatus === passed && whatifResult branch; what-if rendering is now driven uniformly by whatifStatus === passed && whatifResult.
sendtoshailesh
left a comment
There was a problem hiding this comment.
Thanks for the substantial cleanup here — moving the onboarding scaffolds out of .github/workflows/, adding sync/parity tooling, and wiring prompt/eval registration all make sense. I also like the skip-on-collision behavior in the scaffolders and the explicit docs refresh.
I did find a few blocking issues that should be fixed before merge:
-
Command injection in manual destroy path (
.github/skills/git-ape-onboarding/templates/workflows/git-ape-destroy.yml, around lines 55-66)
inputs.confirmandinputs.deployment_idare interpolated directly into arun:script via${{ ... }}. Because Actions expands those expressions before bash parses the script, a crafted workflow_dispatch input can inject arbitrary shell. Please pass these values throughenv:(or another non-shell-interpolated channel) and read them from normal shell variables instead. -
Unsafe direct interpolation of
github.base_refinto shell (.github/skills/git-ape-onboarding/templates/workflows/git-ape-plan.yml:44)
github.base_refis used directly inside thegit diffcommand in arun:block. Per GitHub’s Actions hardening guidance, attacker-controlled context values should not be embedded into shell scripts this way. This should also be routed throughenv:and quoted normally in bash. -
Rollback source is wrong for
/deployruns (.github/skills/git-ape-onboarding/templates/workflows/git-ape-deploy.yml, around lines 219-228 and 475-486)
The comment says the workflow should snapshot the parent commit ororigin/mainfor/deploycomments, but the implementation always readsHEAD~1. On comment-triggered deploys that means rollback can redeploy an earlier PR commit that was never the last known-good state, instead of rolling back to main. That is especially risky on multi-commit PRs. Please branch this logic so/deploycaptures fromorigin/main(or another authoritative deployed baseline) before using it for rollback.
One additional hardening nit: git-ape-verify.yml also embeds secret values directly into shell conditionals (${{ secrets.AZURE_CLIENT_ID }} etc.). I would strongly prefer converting those checks to env booleans/variables as well.
Once the injection issues and rollback baseline are fixed, I’d be happy to re-review.
Address PR review on the git-ape-onboarding workflow templates: - Route attacker-controllable inputs (github.base_ref, workflow_dispatch inputs, JSON step outputs) through env: and read them as quoted shell variables to close script-injection vectors (plan, destroy). - plan: compute the PR diff against origin/$BASE_REF instead of an unsanitised interpolation. - deploy: derive the rollback baseline from HEAD~1 (push) or origin/main (/deploy comment); pass stack_id/managed_resources via env and validate the managed_resources JSON before jq consumes it. - destroy: make teardown Deployment-Stacks-only with a guarded legacy resource-group fallback; emit explicit legacy/fallback_rg outputs. - verify: gate required secrets/variable via env booleans; check the AZURE_SUBSCRIPTION_ID variable; align the scaffolded WORKFLOWS list with the scaffolder (drop ttl-reaper, add verify, use drift.lock.yml). - plan: remove the unreachable what-if render branch. Regenerate website workflow docs.
AZURE_SUBSCRIPTION_ID is consumed via vars. in every scaffolded workflow, so document it as a GitHub repository/environment variable (not a secret). AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets. Fix the OIDC snippet in both copilot-instructions templates to use vars.AZURE_SUBSCRIPTION_ID.
|
|
Thanks for the thorough review, @sendtoshailesh. All four points are addressed in d6b41dc (workflow templates) and 67005a7 (docs). Summary: 1. Command injection in manual destroy path ( 2. Unsafe interpolation of 3. Rollback baseline wrong for Hardening nit — While in here I also addressed the Copilot review threads (managed_resources JSON via Ready for re-review. |
sendtoshailesh
left a comment
There was a problem hiding this comment.
Follow-up review:
Previously raised issues:
- ✅ Fixed:
git-ape-destroy.ymlno longer interpolatesinputs.*directly into shell; the workflow_dispatch inputs are routed viaenvand JSON-encoded withjqbefore use. - ✅ Fixed:
git-ape-plan.ymlno longer inlines${{ github.base_ref }}in the shell; it is passed throughenv.BASE_REFfirst. - ✅ Fixed:
git-ape-deploy.ymlnow usesorigin/mainfor/deployrollback baselines instead of always assumingHEAD~1; the push path still usesHEAD~1, which is the previous main commit after merge. - ✅ Fixed:
git-ape-verify.ymlmoved the secret checks to env booleans instead of embedding${{ secrets.* }}directly in shell conditionals.
New issues found:
- ❌ Blocking:
matrix.deployment_idis still derived from attacker-controlled deployment directory names and interpolated directly intorun:blocks / JS string literals in the plan, deploy, and destroy templates. That reintroduces shell / script injection via paths under.azure/deployments/*/. ⚠️ Non-blocking: the/deploycomment path checks approval state, but it still does not verify that the commenter is an authorized collaborator/member before triggering deployment.⚠️ Non-blocking: both deploy and destroy still swallowgit pushfailures after updatingstate.json/metadata.json, which can leave Azure state changed without the repo state being persisted.
Overall verdict: the original blockers are resolved, but the new matrix.deployment_id injection path is still a release-blocking security issue, so this PR is not merge-ready yet.
…t_id injection matrix.deployment_id is derived from attacker-controllable .azure/deployments/*/ directory names and was interpolated directly into run: bash blocks and github-script JS string literals across the plan, deploy, and destroy workflow templates. Route it through job-level env (DEPLOYMENT_ID) so run blocks reference $DEPLOYMENT_ID and github-script reads process.env.DEPLOYMENT_ID, and reject any directory name outside ^[A-Za-z0-9._-]+$ at the detect step (defense in depth, also makes derived deploy_dir provably safe).
…e push The /deploy comment trigger cannot reliably verify the commenter's authorization, so deployment is now gated solely on merge to main (which already requires PR review + approval via branch protection). Removes the issue_comment trigger, the check-comment-trigger job, and all PR-head-ref checkout paths. Also fails loud (exit 1) instead of swallowing git push failures when committing deployment/teardown state back to main.
sendtoshailesh
left a comment
There was a problem hiding this comment.
Follow-up review:
Previously raised issues:
- ✅ Fixed:
matrix.deployment_idis now validated against^[A-Za-z0-9._-]+$before entering the matrix and routed throughenv.DEPLOYMENT_ID/process.env.DEPLOYMENT_IDin the plan, deploy, and destroy templates, so the earliermatrix.deployment_idshell/JS injection path is closed. - ✅ Fixed: the
/deploycomment path is gone entirely fromgit-ape-deploy.yml, so there is no longer an unauthenticated comment-triggered deployment path to authorize. - ✅ Fixed: deploy and destroy now fail the workflow if the post-state
git pushfails instead of silently swallowing that error.
New issues found:
- ❌ Blocking: untrusted values read from
parameters.jsonare still interpolated directly intorun:scripts via${{ ... }}in the workflow templates, which reintroduces the same GitHub Actions expression-to-shell injection class under a different input. Examples:git-ape-plan.ymluses${{ steps.params.outputs.location }}in shell at lines 157, 414, and 455;git-ape-deploy.ymluses${{ steps.params.outputs.location }},${{ steps.params.outputs.project }}, and${{ steps.params.outputs.environment }}in shell/JQ argument positions at lines 175, 178, 241, 244-245, 258, 420, and 498-500. These values come from attacker-controlled PR content (parameters.json) and need the same treatment asdeployment_id: validate if needed, pass throughenv:, and reference normal shell variables instead of inlining${{ ... }}into script source. ⚠️ Non-blocking:git-ape-plan.ymlstill tells reviewers to comment/deploy(Plan Commentstep, line 738), but that trigger has been intentionally removed. The PR guidance should be updated to avoid instructing users to use a nonexistent path.
Overall verdict:
The previously raised issues are resolved, but the new ${{ steps.params.outputs.* }} injection path is still a release-blocking security issue, so this PR is not merge-ready yet.
…ection
Untrusted location/project/environment values read from parameters.json
were interpolated directly into run: script bodies via ${{ steps.params.outputs.* }},
the same expression-to-shell injection class already fixed for deployment_id.
Route them through step-level env: blocks and reference $LOCATION/$PROJECT/$ENVIRONMENT
shell variables instead. Also drop the stale /deploy reviewer instruction in
git-ape-plan.yml (that trigger was removed). Regenerated workflow docs.
|
@sendtoshailesh Thanks for the thorough re-review. Fixed the remaining injection in Blocking item — Non-blocking item — stale
|
…rhaul # Conflicts: # .github/agents/git-ape.agent.md # .github/copilot-instructions.md # .github/evals/manifest.yaml # .github/workflows/git-ape-deploy.exampleyml # .github/workflows/git-ape-destroy.exampleyml # website/docs/agents/git-ape.md # website/docs/workflows/git-ape-deploy.md # website/docs/workflows/git-ape-destroy.md
Merge resolution updated the .github/copilot-instructions.md mirror to the stack-based deployment flow (dropping the /deploy trigger). Propagate the same content to the canonical templates/copilot-instructions.md so the onboarding template-check (bash + pwsh) passes.
Regenerated from sources updated by the upstream/main merge (azure-resource-deployer and azure-template-generator agents now delegate to skills; lock workflow metadata).
sendtoshailesh
left a comment
There was a problem hiding this comment.
Round 4 follow-up review:
Previously raised issues:
- ✅ Fixed: untrusted
parameters.jsonvalues (location,project,environment) are now routed throughenv:before use in shell steps instead of being interpolated directly intorun:blocks. - ✅ Fixed: the stale
/deployreference was removed from the plan comment path.
Conflict resolution assessment:
- ✅ Merge resolution looks clean overall. I did not find conflict markers or accidental duplicate sections in the changed templates/workflows, the key workflow YAML files parse successfully, and the onboarding template sync check passes.
New issues found:
- ❌ Blocking:
website/docs/getting-started/onboarding.mdstill tells users to configureAZURE_SUBSCRIPTION_IDas a GitHub secret (gh secret setat lines 364-366, 383-391), but the scaffolded workflows and verify flow now read it fromvars.AZURE_SUBSCRIPTION_IDas a variable. A user following the updated onboarding docs will end up with a broken setup: verify/deploy read fromvars, but the docs populatesecrets. Given this PR is specifically overhauling onboarding/scaffolding, that documentation contract needs to be consistent before merge. ⚠️ Non-blocking:git-ape-verify.ymland its generated docs still sayMerge or comment /deploy to deploy, and the summary still sayssecret(s) missingeven though one of the required values is now a variable. That guidance is stale/misleading, though the actual deploy trigger removal in plan/deploy is correct.
Overall verdict:
The round-3 blockers are fixed and the merge conflict resolution looks solid, but the onboarding docs still misconfigure AZURE_SUBSCRIPTION_ID, so I don’t think this is merge-ready yet. Once the docs/template guidance are aligned with the new variable-based contract, I’d be happy to re-review.
Round 4 review (sendtoshailesh): - Blocking: onboarding docs configured AZURE_SUBSCRIPTION_ID via 'gh secret set', but the scaffolded plan/deploy/destroy/verify workflows read it from vars.AZURE_SUBSCRIPTION_ID. Switch the single- and multi-environment setup steps to 'gh variable set' so the documented contract matches the workflows. AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets. - Non-blocking: git-ape-verify.yml summary said 'secret(s) missing' (one value is now a variable) and 'Merge or comment /deploy to deploy' (the /deploy trigger was removed). Reworded to 'required value(s) missing' and 'Merge to main to deploy'; renamed the check step accordingly. Regenerated git-ape-verify.md from the updated template.
|
@sendtoshailesh Thanks for the round 4 review. Both points addressed in Blocking — Non-blocking — stale The template ↔ mirror sync check passes locally for both bash and pwsh. |
Summary
End-to-end overhaul of the
/git-ape-onboardingflow plus the supporting packaging, instructions, eval registration, and regenerated docs. Five themed commits, each independently revertable:chore(models): rename gpt-5-codex to gpt-5.3-codexfeat(onboarding): replace exampleyml stubs with template-driven scaffold.github/workflows/(where the.exampleymlhack lived) into.github/skills/git-ape-onboarding/templates/; ship sync scripts; add eval suite + CI parity check; register the eval in the pilot tierfeat(extension): register prompt files in VSIX and tighten .vscodeignoreplugin.jsonnow declaresprompts:;package.template.jsonregisters all 9 prompts aschatPromptFiles;.vscodeignoresheds dev-only.githubsubtrees from the published VSIXdocs(instructions): switch deploy/destroy guidance to Azure Deployment Stacksaz stack sub create/delete --action-on-unmanage deleteAllinstead ofaz deployment sub+az group delete(see #30)docs(website): regenerate for templated workflows and prompt assets.github/workflows/and the skill templates directory, tags scaffolded workflows with a Docusaurus admonitionKey change:
.exampleymlis goneOld shape (this repo's
.github/workflows/):.exampleymlwas a workaround so GitHub Actions wouldn't auto-load the scaffolds.New shape (
.github/skills/git-ape-onboarding/templates/workflows/):The path is no longer
.github/workflows/so the workaround isn't needed./git-ape-onboardingcopies these into the target repo with skip-on-collision so customized workflows are never overwritten.Eval registration
Adds
git-ape-onboardingtopilottier in .github/evals/manifest.yaml — matches its prior 4-model bench coverage. The eval ships 4 tasks:positive-first-time-setup,positive-multi-env,positive-skip-on-collision,negative-storage-comparison. Closes part of #93.Verification done
actionlintclean on the newgit-ape-onboarding-template-check.ymlnode scripts/generate-docs.jsre-runs with no further driftscripts/carry the executable bitDependency
Depends on #140 (LLM-as-judge →
claude-opus-4.7). This PR is mergeable independently — if it lands first, the newgit-ape-onboardingeval will run with whatever judge is pinned at merge time, then automatically pick up the opus judge once #140 merges. Prefer to merge #140 first to avoid a mixed-judge snapshot.Risk
Medium-low.
.exampleymldeletions are the only destructive change in this repo; their content is preserved verbatim in the templates directory.waza-evalsmatrix dispatch (pilot × 4 models = 16 legs). Quota cost: ~equivalent to prereq-check baseline.