Skip to content

feat(onboarding): template-driven scaffold + register prompts/eval in matrix#142

Open
arnaudlh wants to merge 16 commits into
mainfrom
feat/onboarding-overhaul
Open

feat(onboarding): template-driven scaffold + register prompts/eval in matrix#142
arnaudlh wants to merge 16 commits into
mainfrom
feat/onboarding-overhaul

Conversation

@arnaudlh
Copy link
Copy Markdown
Member

Summary

End-to-end overhaul of the /git-ape-onboarding flow plus the supporting packaging, instructions, eval registration, and regenerated docs. Five themed commits, each independently revertable:

# Commit Why
1 chore(models): rename gpt-5-codex to gpt-5.3-codex waza catalog renamed; aligns manifest tiers + bench prompt defaults
2 feat(onboarding): replace exampleyml stubs with template-driven scaffold Move workflow scaffolds out of .github/workflows/ (where the .exampleyml hack lived) into .github/skills/git-ape-onboarding/templates/; ship sync scripts; add eval suite + CI parity check; register the eval in the pilot tier
3 feat(extension): register prompt files in VSIX and tighten .vscodeignore plugin.json now declares prompts:; package.template.json registers all 9 prompts as chatPromptFiles; .vscodeignore sheds dev-only .github subtrees from the published VSIX
4 docs(instructions): switch deploy/destroy guidance to Azure Deployment Stacks Copilot-instructions caught up with the actual templates: az stack sub create/delete --action-on-unmanage deleteAll instead of az deployment sub + az group delete (see #30)
5 docs(website): regenerate for templated workflows and prompt assets Generator now scans both .github/workflows/ and the skill templates directory, tags scaffolded workflows with a Docusaurus admonition

Key change: .exampleyml is gone

Old shape (this repo's .github/workflows/):

git-ape-plan.exampleyml
git-ape-deploy.exampleyml
git-ape-destroy.exampleyml
git-ape-verify.exampleyml

.exampleyml was a workaround so GitHub Actions wouldn't auto-load the scaffolds.

New shape (.github/skills/git-ape-onboarding/templates/workflows/):

git-ape-plan.yml
git-ape-deploy.yml
git-ape-destroy.yml
git-ape-verify.yml
git-ape-drift.md          # agentic workflow source
git-ape-drift.lock.yml    # compiled lockfile

The path is no longer .github/workflows/ so the workaround isn't needed. /git-ape-onboarding copies these into the target repo with skip-on-collision so customized workflows are never overwritten.

Eval registration

Adds git-ape-onboarding to pilot tier in .github/evals/manifest.yaml — matches its prior 4-model bench coverage. The eval ships 4 tasks: positive-first-time-setup, positive-multi-env, positive-skip-on-collision, negative-storage-comparison. Closes part of #93.

Verification done

  • actionlint clean on the new git-ape-onboarding-template-check.yml
  • node scripts/generate-docs.js re-runs with no further drift
  • Eval YAML files parse
  • All shell scripts under scripts/ carry the executable bit

Dependency

Depends on #140 (LLM-as-judge → claude-opus-4.7). This PR is mergeable independently — if it lands first, the new git-ape-onboarding eval will run with whatever judge is pinned at merge time, then automatically pick up the opus judge once #140 merges. Prefer to merge #140 first to avoid a mixed-judge snapshot.

Risk

Medium-low.

  • Workflow templates are new files in a non-loaded path, no production CI impact.
  • The 4 .exampleyml deletions are the only destructive change in this repo; their content is preserved verbatim in the templates directory.
  • Eval registration adds 4 new task runs to the next waza-evals matrix dispatch (pilot × 4 models = 16 legs). Quota cost: ~equivalent to prereq-check baseline.


Note: Recreated as a same-repo PR (was originally #141 from arnaudlh/git-ape). Fork PRs cannot read the COPILOT_GITHUB_TOKEN secret, so the waza eval matrix was skipping. Identical commits (8d5b9cf5..e66e2ecc), now wired so evals actually run. Closes #141.

arnaudlh added 6 commits May 29, 2026 16:22
The waza model catalog now ships gpt-5-codex under its versioned ID
gpt-5.3-codex. Align manifest tiers and bench-prompt argument hints
so dispatched runs resolve to a valid model.

- .github/evals/manifest.yaml: pilot + expanded tier model lists
- .github/prompts/agent-bench.prompt.md: default models in argument-hint + body
- .github/prompts/skill-bench.prompt.md: default models in argument-hint + body

🔖 - Generated by Copilot
Rewrite git-ape-onboarding as a skill-driven CLI playbook backed by a
sync-able template bundle. The previous .exampleyml workflows lived in
this repo's .github/workflows/ and were copy-pasted by users; they're
now first-class templates under the skill and pushed into target repos
by scripts/sync-templates.{sh,ps1}.

What ships:
- .github/agents/git-ape-onboarding.agent.md: rewritten flow + tools
- .github/skills/git-ape-onboarding/SKILL.md: new playbook structure
- .github/skills/git-ape-onboarding/scripts/: bash + pwsh helpers
    - scaffold-repo.{sh,ps1}: bootstrap target repo
    - sync-templates.{sh,ps1}: drop-in workflow + instructions update
- .github/skills/git-ape-onboarding/templates/: canonical target-repo
  artifacts (copilot-instructions.md, workflows/git-ape-{plan,deploy,
  destroy,verify,drift}.yml + drift agentic workflow + drift lockfile)
- .github/evals/git-ape-onboarding/: positive + negative tasks for
  first-time-setup, multi-env, skip-on-collision, and storage refusal
- .github/workflows/git-ape-onboarding-template-check.yml: CI check
  that the shipped templates pass actionlint and round-trip cleanly
- .github/evals/manifest.yaml: register git-ape-onboarding in pilot
  tier (matches its prior 4-model bench coverage)

Removed:
- .github/workflows/git-ape-{deploy,destroy,plan,verify}.exampleyml:
  retired — their content is now in skills/.../templates/workflows/

The .exampleyml extension was a workaround to keep GitHub Actions from
auto-loading workflow scaffolds; templates under the skill don't need
the workaround because their path isn't .github/workflows/.

🐵 - Generated by Copilot
Wire the .github/prompts/ directory into the published artifacts:

- plugin.json: declare 'prompts: .github/prompts/' so the plugin
  manifest exposes them alongside agents and skills.
- extension/package.template.json: register all 9 prompt files
  (git-ape, agent-{bench,improve,onboard,promote}, skill-{bench,
  improve,onboard,promote}) under chatPromptFiles so VS Code picks
  them up from the installed extension.
- extension/.vscodeignore: explicitly exclude dev-only .github
  subtrees (actionlint, dependabot, aw, copilot, evals, plugins,
  references, scripts, templates, workflows). Keeps agents/, skills/,
  plugin/, copilot-instructions.md, and now prompts/ in the VSIX
  while shedding ~MB of CI tooling that shouldn't ship to users.

🧩 - Generated by Copilot
…t Stacks

Align copilot-instructions with the actual workflow templates shipped
by the onboarding skill: use 'az stack sub' instead of 'az deployment
sub' / 'az group delete' for the full plan-deploy-destroy lifecycle.

Why this matters for agents reading the instructions:
- The stack is the single unit of lifecycle — create, update, and
  destroy all operate on it, not on the underlying RGs.
- 'deleteAll' on unmanage cleans up every managed resource across
  every scope (subscription, multiple RGs, sub-scope role/policy
  assignments) in one call. No orphans, idempotent re-runs.
- See #30 for the design rationale.

Sample workflow snippet now also passes --action-on-unmanage deleteAll,
--deny-settings-mode none, --yes — matching what
.github/skills/git-ape-onboarding/templates/workflows/git-ape-deploy.yml
generates in target repos.

📘 - Generated by Copilot
scripts/generate-docs.js: teach the workflow doc generator about two
source directories, the existing CI workflows under .github/workflows/
and the new user-facing templates under .github/skills/git-ape-
onboarding/templates/workflows/. Templated workflows get a Docusaurus
:::info admonition explaining they're scaffolded by /git-ape-onboarding
and don't run in the git-ape repo itself. Drops .exampleyml handling
since those stubs are gone.

README.md: update the Workflows table + repo tree to reflect the new
layout. The four git-ape-{plan,deploy,destroy,verify}.exampleyml stubs
no longer exist in .github/workflows/; their canonical sources are
inside the onboarding skill's templates/ directory and scaffolded into
user repos as ready-to-run .yml files. Mention skip-on-collision so
readers know existing workflows are never overwritten.

website/docs/: regenerate every page that the generator touches:
- workflows/{git-ape-plan,deploy,destroy,verify}.md: relocated to the
  template source path + new admonition
- workflows/git-ape-drift-lock.md, git-ape-onboarding-template-check.md
  (new pages)
- workflows/overview.md: refreshed listing
- agents/git-ape-onboarding.md, skills/git-ape-onboarding.md,
  getting-started/onboarding.md: re-synced from current sources
- reference/{plugin-json,marketplace}.md: re-synced to pick up prompts:
  registration and chatPromptFiles entries

📚 - Generated by Copilot
…source

The auto-generated 'Continuous Drift Remediation' page documents the
compiled '.lock.yml' shape. This adds the missing hand-curated page
documenting the agentic '.md' source — schedule, severity model,
anti-flapping rules, safe-outputs configuration, and how to recompile
after editing.

Ported from the private repo with two small adaptations:
- Workflow-file path updated to the template location under
  .github/skills/git-ape-onboarding/templates/workflows/git-ape-drift.md
  (matches the autogen lock-page convention).
- Added the ':::info[Scaffolded by /git-ape-onboarding]' admonition for
  consistency with the autogen lock page; clarifies the file is shipped
  as a template, not run in the git-ape repo itself.
- Added a Related section linking to the lock-page, the
  azure-drift-detector skill, the deployment guide, and the use-case
  overview so readers can navigate the full drift story.

Marked HAND-CURATED at the top so generate-docs.js maintainers know
not to add a generator branch for '.md' workflow sources.

🌊 - Generated by Copilot
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

🤖 Waza agent evals (advisory)

ℹ️ No agents evaluated. changed agent(s) have no eval directory: git-ape git-ape-onboarding

Ran 0 agent evals against claude-sonnet-4.6. Each eval consumes ~5 premium Copilot requests; results are non-blocking — investigate failures via the workflow logs and the per-agent waza-agent-results-* artifacts.

How this works: This workflow auto-syncs the canonical .github/agents/<name>.agent.md into the sibling mirror inside .github/evals/agents/<name>/ before each run, so the score below reflects the version of the agent in this PR — not whatever was committed when the eval was first wired up.

📊 Agent file token comparison vs main (advisory)

No .agent.md files changed vs main (or token-compare returned no entries).

No agents in scope for this PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

🧪 Waza skill evals (advisory)

🔁 Full matrix run. project-wide config change (.waza.yaml, manifest, or workflow file) → full matrix

Ran 12 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg waza-results-* artifacts.

Legend: Models flagged baseline: true in .github/evals/manifest.yaml (currently: gpt-5.4) run with --baseline (A/B mode) to cap quota. All other models run standard. Judge model is fixed at claude-opus-4.7 across all legs.

📊 Token comparison vs main (advisory)
{
  "baseRef": "main",
  "headRef": "WORKING",
  "threshold": 10,
  "passed": true,
  "timestamp": "2026-06-05T00:58:59.432119619Z",
  "summary": {
    "totalBefore": 0,
    "totalAfter": 38247,
    "totalDiff": 38247,
    "percentChange": 100,
    "filesAdded": 15,
    "filesRemoved": 0,
    "filesModified": 0,
    "filesIncreased": 15,
    "filesDecreased": 0
  },
  "files": [
    {
      "file": ".github/skills/azure-cost-estimator/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3227,
        "characters": 11926,
        "lines": 344
      },
      "diff": 3227,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-deployment-preflight/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1444,
        "characters": 6267,
        "lines": 211
      },
      "diff": 1444,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-drift-detector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3179,
        "characters": 13149,
        "lines": 460
      },
      "diff": 3179,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-integration-tester/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1559,
        "characters": 6793,
        "lines": 247
      },
      "diff": 1559,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-naming-research/SKILL.md",
      "before": null,
      "after": {
        "tokens": 486,
        "characters": 2108,
        "lines": 44
      },
      "diff": 486,
      "percentChange": 100,
      "status": "added",
      "limit": 500
    },
    {
      "file": ".github/skills/azure-policy-advisor/SKILL.md",
      "before": null,
      "after": {
        "tokens": 6233,
        "characters": 26754,
        "lines": 642
      },
      "diff": 6233,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-availability/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2409,
        "characters": 9867,
        "lines": 307
      },
      "diff": 2409,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-visualizer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1490,
        "characters": 6165,
        "lines": 191
      },
      "diff": 1490,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-rest-api-reference/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1827,
        "characters": 8416,
        "lines": 199
      },
      "diff": 1827,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-role-selector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1276,
        "characters": 5627,
        "lines": 161
      },
      "diff": 1276,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-security-analyzer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 5322,
        "characters": 21405,
        "lines": 450
      },
      "diff": 5322,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-stack-deploy/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1912,
        "characters": 7525,
        "lines": 159
      },
      "diff": 1912,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-stack-destroy/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2644,
        "characters": 10670,
        "lines": 180
      },
      "diff": 2644,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/git-ape-onboarding/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3101,
        "characters": 12788,
        "lines": 272
      },
      "diff": 3101,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/prereq-check/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2138,
        "characters": 8019,
        "lines": 147
      },
      "diff": 2138,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    }
  ]
}

Skill: prereq-check

📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: 9801:925E3:9130D8:A3335D:6A221F9D)

✓ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 1m44.898s

  • Tests: 4 total, 3 passed, 1 failed, 0 errors
  • Success Rate: 75.0%
  • Score Range: 0.57 - 1.00 (σ=0.1839)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 0.89 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — "command not found" failure: 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "command not found" failure

Run 2/3 (error):

  • answer_quality (0.00): fail: : The assistant's response never delivered any user-facing content. All tool calls returned "unexpected user permission response" errors, and no final message was produced. As a result: (1) the four core tools (az, gh, jq, git) were not named, (2) no install command for az was provided, (3) no version verification commands were given, and (4) no verdict or next step was reached.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6

Results saved to: .waza-results/prereq-check-claude-opus-4.6.json

Model: claude-sonnet-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: F008:280923:9F6984:B234E3:6A221F73)

✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: F008:280923:A01327:B2F2C5:6A221F9A)

[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: F00A:261D73:979BC2:AA6071:6A221F9F)

✗ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.71 | Duration: 1m43.552s

  • Tests: 4 total, 2 passed, 2 failed, 0 errors
  • Success Rate: 50.0%
  • Score Range: 0.57 - 0.89 (σ=0.1303)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 0.78 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 0.89 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — "command not found" failure: 33% pass rate, score=0.78±0.16
  • Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "command not found" failure

Run 1/3 (error):

  • answer_quality (0.00): fail: Assistant never delivered an answer: The assistant attempted to invoke the prereq-check skill but all its tool calls failed with "unexpected user permission response" errors, and it produced no user-facing response. None of the four PASS criteria were met: (1) the core tool list (az, gh, jq, git) was never named; (2) no install command for az was provided; (3) no verification step was suggested; (4) no verdict/next step was given.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Run 2/3 (error):

  • answer_quality (0.00): fail: Assistant never delivered an answer: The assistant's prior response consisted only of tool calls that errored ("unexpected user permission response") and produced no user-facing message. None of the four criteria are met: (1) the required tools az/gh/jq/git were never named, (2) no install command for az was provided, (3) no version verification commands were recommended, (4) no verdict or next step was given.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Positive — "What do I need to install?"

Run 2/3 (error):

  • answer_quality (0.00): fail: Assistant did not produce a user-facing answer: The assistant invoked the prereq-check skill and attempted tool calls, all of which failed with "unexpected user permission response". No user-facing response was produced listing the required tools (az, gh, jq, git), authentication requirements, version info, or install commands. All four PASS criteria are missing.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6

Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✗ [4/4] Positive — "What do I need to install?"
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: FC00:2DFF7F:9CD99E:AFC710:6A221FB3)

✗ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.71 | Duration: 1m36.415s

  • Tests: 4 total, 2 passed, 2 failed, 0 errors
  • Success Rate: 50.0%
  • Score Range: 0.57 - 0.89 (σ=0.1303)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 0.78 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 0.89 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — "command not found" failure: 33% pass rate, score=0.78±0.16
  • Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "command not found" failure

Run 2/3 (failed):

  • answer_quality (0.00): fail: : Missing criterion 2: No concrete install command for az on any platform was provided. The response only referenced "Microsoft's Azure CLI install script/docs" without giving an actual command like brew install azure-cli, curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash, or winget install Microsoft.AzureCLI. Criteria 1, 3, and 4 are met.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Run 3/3 (error):

  • answer_quality (0.00): fail: Response failed all four PASS criteria - no tool list, no install commands, no version verification, no verdict: The assistant's response did not address the user's question. It only invoked the prereq-check skill and attempted to run tool checks (which failed with "unexpected user permission response"). It never listed the required tools (az, gh, jq, git), never provided install commands for az, never recommended version verification, and never reached a verdict or next step.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Positive — "What do I need to install?"

Run 2/3 (failed):

  • answer_quality (0.00): fail: : Criteria 1, 2, 3 met. Criterion 4 not met: response provided only verification commands (az version, gh --version, etc.) rather than install commands or a pointer to the prereq-check skill/script that performs the checks for the user.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.3-codex

Results saved to: .waza-results/prereq-check-gpt-5.3-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 505C:C76C:974C9F:A8BB2E:6A221F7C)

[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 505A:297D79:93A532:A555F5:6A221FAA)

✗ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✗ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✗ [3/4] Positive — "command not found" failure
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 25.0% (1/4 tasks passed)
Without Skills: 75.0% (3/4 tasks passed)
Impact: -50.0 percentage points

Per-Task Breakdown:
• Negative — Editing an ARM template [REGRESSED] 100% → 67% (-33pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [IMPROVED] 0% → 67% (+67pp)
• Positive — "What do I need to install?" [REGRESSED] 100% → 67% (-33pp)

Verdict: Skills have NEGATIVE IMPACT (regressed 2/4 tasks)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.74 | Duration: 1m56.347s

  • Tests: 4 total, 1 passed, 3 failed, 0 errors
  • Success Rate: 25.0%
  • Score Range: 0.57 - 0.89 (σ=0.1519)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 0.89 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 0.89 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Negative — Editing an ARM template: 67% pass rate, score=0.57±0.00
  • Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
  • Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16

Failed Task Details

Negative — Editing an ARM template

Run 3/3 (error):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.14): Prompt correctly treated as non-trigger (score 0.14 < 0.50)

Positive — "command not found" failure

Run 1/3 (error):

  • answer_quality (0.00): fail: Assistant never produced a substantive answer: The assistant's response consisted only of failed tool calls ("unexpected user permission response") and a single intro sentence. It did not: (1) name the required tools (az, gh, jq, git), (2) provide any install command for az, (3) recommend version verification commands, or (4) reach a verdict or next step. All four PASS criteria are missing.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Positive — "What do I need to install?"

Run 2/3 (failed):

  • answer_quality (0.00): fail: Missing install commands / verification script reference: Criteria 1, 2, 3 met (lists az/gh/jq/git, mentions az login + gh auth login, gives minimum versions). Criterion 4 missing: response did not include install commands nor point to a verification script/skill (e.g., prereq-check).
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4

Results saved to: .waza-results/prereq-check-gpt-5.4.json

🔢 Tokens (count + profile)

📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
   ⚠️  token count 2138 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious, steps are logically ordered with a numbered table, and the Quick Reference block gives agents instant orientation. The Always/Never constraint lists eliminate ambiguity about read-only behavior.
completeness       ████░  Tool versions, platform paths, auth checks, and error-handling table are thorough. However, the skill references external files (references/install-commands.md, scripts/check-tools.sh, check-tools.ps1) without fallback content — if those files are absent the agent has no install recipes to display.
trigger_precision  █████  USE FOR enumerates specific shell error strings (e.g., 'az: command not found') and named scenarios, while DO NOT USE FOR draws a hard boundary. The 'When to Use' section reinforces routing with concrete examples, making misrouting very unlikely.
scope_coverage     █████  Boundaries are explicitly stated (read-only, no chaining, no installs), capabilities are enumerated, related skills are named with handoff notes, and the 'Side effects: Read-only' Quick Reference entry makes the scope unambiguous.
anti_patterns      ████░  No vague or conflicting directives; error-handling table covers real-world failures well. Minor issue: the error table recommends 'pwsh -File' to bypass execution policy but also notes Windows PowerShell 5.1 'also works' — this slightly contradicts the 'require pwsh' constraint and could confuse an agent choosing which path to print.
────────────────────────────────────────────
Overall: 4.6/5.0

A high-quality, production-ready skill document. Structure, trigger precision, and scope boundaries are exemplary. The main actionable gap is the dependency on external reference files (install-commands.md, check-tools scripts) without inline fallback content — embedding a minimal install-command table directly would make the skill self-contained and robust when those files are missing.
✅ Check (compliance summary) (59 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: prereq-check

📋 Compliance Score: Medium-High
   ⚠️  Good, but could be improved. Missing routing clarity.

   Issues found:
   ❌  SKILL.md is 2138 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 4/4 valid
   ✅  All links valid.

📊 Token Budget: 2138 / 500 tokens
   ❌  Exceeds limit by 1638 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 1 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ❌  [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips

Skill: git-ape-onboarding

📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup
✗ [3/4] Positive — Multi-environment onboarding

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.76 | Duration: 59.251s

  • Tests: 4 total, 3 passed, 1 failed, 0 errors
  • Success Rate: 75.0%
  • Score Range: 0.56 - 0.98 (σ=0.1965)

Task Results

Task Score Status Graders
Negative — Storage service comparison (off-topic) 0.56 budget, trigger_relevance_negative
Positive — First-time repo setup 0.94 answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding 0.58 answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision 0.98 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — Multi-environment onboarding

Run 1/1 (failed):

  • answer_quality (0.00): fail: Response skipped the prereq gate and proceeded straight to a full step-by-step guide.: Criteria 1 and 2 are not met; criterion 3 is borderline at best.
  1. ❌ Prereq check results NOT presented. The agent attempted to run the check script and command -v fallback, both of which returned "unexpected user permission response". Instead of resolving this or presenting a status table, the agent abandoned the check entirely with "I can't execute shell commands" and moved on. No tool/version table, no auth inspection.

  2. ❌ Auth/prereq gate NOT explicitly surfaced as blocking. The agent said "No problem — I can still walk you through the exact steps" and proceeded. The prereq-check skill rules explicitly say "Stop at first blocking failure" and produce one of READY / TOOLS MISSING / REPORTED MISSING / AUTH MISSING verdicts — none was emitted. The agent merely included a passive az account show snippet under "Prerequisites" without gating on its result.

  3. ⚠️ Borderline. The agent mentions needing staging subscription ID, existing client ID, and repo, but these are embedded inline in Step 1 / Step 4 rather than presented as a numbered/blocked input-gathering request before any steps. It does not ask about RBAC role choice (just hardcodes Contributor + UAA), does not ask reuse-vs-new App Reg as a decision (assumes reuse), does not confirm environment name, does not ask about onboarding mode. The closing "Want me to help you generate the exact commands..." is post-hoc, after the whole guide was emitted.

  4. ✅ Multi-env awareness present: explicitly names azure-deploy-staging, creates a separate fc-azure-deploy-staging federated credential, scopes RBAC to the staging subscription, and sets env-scoped variables/secrets.

Net: 1 of 4 clearly met, 1 borderline, 2 clearly failed. The response was a completion-report-style walkthrough, not the gated step-1 the skill requires.

  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-opus-4.6

Results saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.json

Model: claude-sonnet-4.6

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.76 | Duration: 53.725s

  • Tests: 4 total, 3 passed, 1 failed, 0 errors
  • Success Rate: 75.0%
  • Score Range: 0.56 - 0.98 (σ=0.1857)

Task Results

Task Score Status Graders
Negative — Storage service comparison (off-topic) 0.56 budget, trigger_relevance_negative
Positive — First-time repo setup 0.60 answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding 0.91 answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision 0.98 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — First-time repo setup

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing prereq check results and auth gate: Criterion 1 FAIL: No prereq check results were presented. The agent attempted to run check-tools.sh and version commands but they returned "unexpected user permission response", and the agent never surfaced any tool-version table, list, or equivalent inspection output to the user. Criterion 2 FAIL: The auth/prereq gate was not explicitly surfaced. The agent did not state that prereqs failed, that Azure/GitHub CLI auth was unverified, or display any ❌ marker — it silently moved past the failed checks and only asked for inputs. Criterion 3 PASS: Five inputs were requested (repo URL, subscription IDs, onboarding mode, RBAC role, default branch). Criterion 4 PASS: The agent did not claim to have configured OIDC, federated credentials, environments, RBAC, or scaffolded files.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-sonnet-4.6

Results saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.68 | Duration: 46.286s

  • Tests: 4 total, 2 passed, 2 failed, 0 errors
  • Success Rate: 50.0%
  • Score Range: 0.56 - 0.98 (σ=0.1748)

Task Results

Task Score Status Graders
Negative — Storage service comparison (off-topic) 0.56 budget, trigger_relevance_negative
Positive — First-time repo setup 0.60 answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding 0.58 answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision 0.98 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — First-time repo setup

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing criterion 1: prereq check results not presented: Criterion 1 NOT met: The agent attempted to run the prereq check script and a manual fallback, but both bash invocations returned "unexpected user permission response" and no tool-version table, Azure auth status, or GitHub auth status was ever rendered. The reply only states "command execution is currently blocked" without any inspected environment data.

Criterion 2 NOT clearly met: Because no prereq results were produced, no explicit ❌ marker on the Azure/GitHub auth rows was surfaced. The agent surfaced a generic execution-blocked message instead of a prereq auth gate.

Criterion 3 MET: The agent requested repo URL, subscription mapping + RBAC role, compliance framework/enforcement, and explicit go-ahead — ≥3 required inputs.

Criterion 4 MET: The agent did not claim to have configured OIDC, federated credentials, environments, RBAC, or scaffolded files. It explicitly waits for inputs and approval.

Overall: FAIL — criterion 1 (and arguably 2) not satisfied.

  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)

Positive — Multi-environment onboarding

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing prereq execution and gating: Criteria 1 and 2 are not met. The assistant did NOT actually run or present any prereq check results — it merely instructed the user to run /prereq-check as a step. No tool/auth status table or inspection of the local environment was shown, and no auth gate was explicitly surfaced. Criterion 3 is partially met: the response lists ORG/REPO, STAGING_SUBSCRIPTION_ID, and App Registration reuse decision as "inputs to set," but these are presented as variables to fill rather than gated questions before proceeding — borderline. Criterion 4 is clearly met (mentions fc-azure-deploy-staging federated credential, new azure-deploy-staging environment, per-env secrets/variables, and reusing existing SP). Overall the response jumped straight to a completion runbook instead of gating on prereqs first.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.3-codex

Results saved to: .waza-results/git-ape-onboarding-gpt-5.3-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✗ [3/4] Positive — Multi-environment onboarding
✗ [4/4] Positive — Scaffold honors skip-with-notice on collision
[ERROR] waiting for session.idle: context deadline exceeded

✗ [2/4] Positive — First-time repo setup

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 50.0% (2/4 tasks passed)
Without Skills: 25.0% (1/4 tasks passed)
Impact: +25.0 percentage points

Per-Task Breakdown:
• Negative — Storage service comparison (off-topic) [NEUTRAL] 100% → 100% (+0pp)
• Positive — First-time repo setup [NEUTRAL] 0% → 0% (+0pp)
• Positive — Multi-environment onboarding [NEUTRAL] 0% → 0% (+0pp)
• Positive — Scaffold honors skip-with-notice on collision [IMPROVED] 0% → 100% (+100pp)

Verdict: Skills have POSITIVE IMPACT (improved 1/4 tasks)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.68 | Duration: 1m15.65s

  • Tests: 4 total, 2 passed, 2 failed, 0 errors
  • Success Rate: 50.0%
  • Score Range: 0.56 - 0.98 (σ=0.1748)

Task Results

Task Score Status Graders
Negative — Storage service comparison (off-topic) 0.56 budget, trigger_relevance_negative
Positive — First-time repo setup 0.60 answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding 0.58 answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision 0.98 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — First-time repo setup

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing prereq check results: Criterion 1 not met: The agent attempted to run the prereq check script and inspect tool versions/auth, but every tool call returned "unexpected user permission response" and no results were shown. The reply contains no table or list of tool versions, no Azure CLI auth status, and no GitHub CLI auth status — so the user has no evidence the environment was actually inspected. Criterion 2 is also weak: the response blames generic "tool execution is being denied" rather than explicitly surfacing an auth/prereq gate (e.g. ❌ on az login). Criteria 3 (asked for repo URL, subscription IDs, role mapping — ≥3 inputs) and 4 (did not claim to have configured OIDC/FIC/RBAC/environments/scaffolding) are satisfied.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)

Positive — Multi-environment onboarding

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing prereq check execution/results: Criterion 1 not met: the agent did not present prereq check results (no tool/auth status table or inspection of local environment). It only told the user to "Run /prereq-check" without executing it or showing output. Criterion 2 is consequently also not satisfied — no auth gate was actually surfaced based on observed state. Criteria 3 (inputs requested: repo, staging subscription ID, app reg reuse, RBAC role) and 4 (multi-env awareness: fc-azure-deploy-staging credential, azure-deploy-staging environment, per-env secrets/RBAC scoping) are met.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.4

Results saved to: .waza-results/git-ape-onboarding-gpt-5.4.json
JUnit XML saved to: .waza-results/git-ape-onboarding-gpt-5.4.junit.xml

🔢 Tokens (count + profile)

📊 git-ape-onboarding: 3,101 tokens (detailed ✓), 17 sections, 15 code blocks
   ⚠️  token count 3101 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            ████░  Purpose is immediately obvious, invariants section is exceptionally clear, and numbered playbook steps are well-ordered. Minor inconsistency undermines clarity: the bash scaffold command uses `./scripts/scaffold-repo.sh` while PowerShell uses the fully-qualified `.github/skills/git-ape-onboarding/scripts/scaffold-repo.ps1` — standardize both to the same relative root.
completeness       ████░  Covers prerequisites, multi-env modes, OIDC gotchas, disabled subscriptions, compliance preferences, and verification commands thoroughly. Two gaps: Step 3 mentions 'reuses existing' app registration but provides no lookup command (e.g., `az ad app list --display-name`), and there is no rollback/cleanup guidance for partial failures mid-playbook.
trigger_precision  ███░░  The 'When to Use' section is clear and specific, but there is no 'DO NOT USE FOR' section — leaving ambiguity about whether this skill applies to re-onboarding a partial config, rotating credentials, or adding a new subscription to an already-onboarded repo. Adding explicit negative triggers would prevent misrouting.
scope_coverage     ████░  Capabilities are well-enumerated (identity, OIDC, RBAC, environments, secrets, scaffolding, compliance). The prereq-check dependency is explicitly called out. However, out-of-scope boundaries are never stated — the skill does not say it won't create subscriptions, manage org-level GitHub settings beyond environments, or handle certificate-based credentials, leaving the agent to infer limits.
anti_patterns      ████░  Invariants block common mistakes (master vs main), Safe-Execution Rules prevent secret leakage and silent overwrites, and concrete CLI commands avoid vague instructions. The one notable anti-pattern: the 'Suggested Agent Flow' and 'Command Playbook' are parallel but slightly divergent descriptions of the same sequence, which could cause an agent to follow one and miss details in the other — consolidate into a single authoritative sequence.
────────────────────────────────────────────
Overall: 3.8/5.0

This is a well-crafted, production-quality skill with strong invariant enforcement, concrete CLI examples, and good edge-case coverage (custom OIDC subjects, disabled subscriptions, file collision handling). The primary gaps are the missing DO NOT USE FOR triggers (reducing routing precision), the absent app-registration-reuse lookup commands, the lack of rollback guidance for partial failures, and a minor bash/PowerShell script path inconsistency. Addressing these would bring the skill to a 4.5+ rating.
✅ Check (compliance summary) (64 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/git-ape-onboarding/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: git-ape-onboarding

📋 Compliance Score: Medium
   ⚠️  Needs improvement. Missing anti-triggers and routing clarity.

   Issues found:
   ❌  SKILL.md is 3101 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 2/5 valid
   ⚠️  3 link issue(s) found.
   ❌  [templates/copilot-instructions.md] → .github/skills/azure-stack-deploy/SKILL.md: target does not exist
   ❌  [templates/copilot-instructions.md] → website/docs/deployment/state.md: target does not exist
   ❌  [templates/copilot-instructions.md] → .github/skills/azure-stack-destroy/SKILL.md: target does not exist

📊 Token Budget: 3101 / 500 tokens
   ❌  Exceeds limit by 2601 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (3101 tokens, 0 modules)
   ❌  [negative-delta-risk] Negative delta risk patterns detected: excessive constraints (12 constraint keywords found)
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 5 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
2. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
3. Run 'waza dev' for interactive compliance improvement
4. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
5. Fix 3 broken link(s) — targets do not exist
6. Reduce SKILL.md by 2601 tokens. Run 'waza tokens suggest' for optimization tips

Skill: azure-stack-deploy

📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6

Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [3/5] Negative — What-if preview / preflight validation
✓ [5/5] Positive — Re-deploy after template edit
✓ [4/5] Positive — Local deploy of an existing deployment artifact

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.81 | Duration: 1m31.024s

  • Tests: 5 total, 3 passed, 2 failed, 0 errors
  • Success Rate: 60.0%
  • Score Range: 0.60 - 0.94 (σ=0.1148)

Task Results

Task Score Status Graders
Negative — Destroying / tearing down an existing deployment 0.86 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Negative — What-if preview / preflight validation 0.82 budget, trigger_relevance_negative
Positive — Local deploy of an existing deployment artifact 0.94 answer_quality, budget, trigger_relevance_positive
Positive — Re-deploy after template edit 0.85 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Negative — What-if preview / preflight validation

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✓ [5/5] Positive — Re-deploy after template edit
✗ [4/5] Positive — Local deploy of an existing deployment artifact
[ERROR] waiting for session.idle: context deadline exceeded

✗ [3/5] Negative — What-if preview / preflight validation

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.78 | Duration: 1m27.827s

  • Tests: 5 total, 2 passed, 3 failed, 0 errors
  • Success Rate: 40.0%
  • Score Range: 0.60 - 0.86 (σ=0.0946)

Task Results

Task Score Status Graders
Negative — Destroying / tearing down an existing deployment 0.86 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Negative — What-if preview / preflight validation 0.82 budget, trigger_relevance_negative
Positive — Local deploy of an existing deployment artifact 0.78 answer_quality, budget, trigger_relevance_positive
Positive — Re-deploy after template edit 0.85 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Negative — What-if preview / preflight validation

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Run 2/2 (error):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Positive — Local deploy of an existing deployment artifact

Run 1/2 (failed):

  • answer_quality (0.00): fail: : Missing criterion 4: the response does not mention that state.json (schemaVersion 1.0) will be written to capture the stack ID and managed resources. Criteria 1 (az stack sub create), 2 (--action-on-unmanage deleteAll), and 3 (deploy-stack.sh helper) are met.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)

Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5.3-codex

Results saved to: .waza-results/azure-stack-deploy-gpt-5.3-codex.json

🔢 Tokens (count + profile)

📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks
   ⚠️  token count 1912 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious from the title and description. Steps are numbered, well-ordered, and include both bash and PowerShell equivalents. The inline script behavior breakdown (steps 1–7) leaves no ambiguity about what the script does at each stage.
completeness       █████  Covers prerequisites, arguments, procedure, output format, failure modes, state.json schema, soft-deletable resource types, and post-run user messaging requirements. Edge cases like race conditions, stack unavailability, and fallback behavior are explicitly documented.
trigger_precision  ████░  USE FOR and DO NOT USE FOR sections are well-defined with cross-links to the correct alternative skills. Minor gap: no explicit trigger for 'first-time deployment vs. update' distinction, though the re-deploy case is mentioned in prose. Could add a DO NOT USE FOR case covering partial/incremental deployments if that's a real routing risk.
scope_coverage     █████  Scope is tightly defined — subscription-scoped stack create only, with explicit exclusions for destroy, preflight, and template authoring. The fallback path is scoped and labeled with a clear trade-off warning, preventing scope creep into legacy deployment patterns.
anti_patterns      ████░  Avoids vague instructions, conflicting directives, and missing error handling well. The 'What to tell the user after running' section is a nice anti-hallucination guard. Minor: the fallback behavior (step 4) is described in both the procedure and the arguments table (--no-fallback flag), which is slightly redundant but not harmful. The race-condition recovery ('Re-run — the script is idempotent') could be more specific about how long to wait or what to check first.
────────────────────────────────────────────
Overall: 4.6/5.0

This is a high-quality, production-ready SKILL.md. It excels at completeness and clarity, with thorough error handling, schema documentation, and explicit post-run messaging requirements that prevent agent hallucination. Trigger precision is strong but could add one more DO NOT USE FOR case for partial deployments. The only minor anti-pattern is slight redundancy around the fallback mechanism and a vague race-condition recovery step.
✅ Check (compliance summary) (70 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-stack-deploy

📋 Compliance Score: Low
   ❌  Needs significant improvement. Description too short or missing triggers.

   Issues found:
   ❌  SKILL.md is 1912 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 0/8 valid
   ⚠️  8 link issue(s) found.
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory

📊 Token Budget: 1912 / 500 tokens
   ❌  Exceeds limit by 1412 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Description density is optimal for cross-model use
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips

Skill: azure-stack-destroy

📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6

Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
[ERROR] waiting for session.idle: context deadline exceeded

✗ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.74 | Duration: 1m48.171s

  • Tests: 5 total, 1 passed, 4 failed, 0 errors
  • Success Rate: 20.0%
  • Score Range: 0.60 - 0.87 (σ=0.1064)

Task Results

Task Score Status Graders
Negative — Deploying a new stack (opposite operation) 0.81 budget, trigger_relevance_negative
Negative — Deleting a non-Git-Ape resource group 0.87 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Positive — Clean up the deployment stack 0.62 answer_quality, budget, trigger_relevance_positive
Positive — Local destroy of a Git-Ape deployment 0.80 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — Local destroy of a Git-Ape deployment: 50% pass rate, score=0.80±0.17

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Negative — Deleting a non-Git-Ape resource group

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Positive — Clean up the deployment stack

Run 1/2 (error):

  • answer_quality (0.00): fail: Missing explicit coverage of required criteria: The assistant invoked the azure-stack-destroy skill and delegated execution to a background agent, but the user-facing response itself did not: (1) explicitly recommend the skill/scripts over raw az group delete with rationale about soft-delete/multi-RG; (2) reference the state.json prerequisite at .azure/deployments/deploy-20260524-test/; (3) mention az stack sub delete --action-on-unmanage deleteAll semantics; (4) cover soft-delete purge sweep (Key Vault, Cognitive Services) or purgeProtected retention. The response was essentially "I'm running it, will report back" — none of the four PASS criteria are explicitly communicated.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Run 2/2 (failed):

  • answer_quality (0.00): fail: : Missing criterion 1: The response recommended the destroy script but did not explicitly contrast it with raw az group delete or explain that raw az group delete misses soft-delete cleanup and multi-RG/subscription-scope resources. Criterion 3 is also weak — the response mentioned "deleteAll" semantics but did not explicitly cite the az stack sub delete --action-on-unmanage deleteAll command.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Positive — Local destroy of a Git-Ape deployment

Run 2/2 (failed):

  • answer_quality (0.00): fail: : Criteria 1, 2, 3 are met (recommends destroy-stack.sh, references state.json prerequisite, mentions az stack sub delete --action-on-unmanage deleteAll). Criterion 4 is partially met but weak: the response says "Purge any soft-deleted Key Vaults so the names are immediately reusable" but does not name az keyvault purge or az keyvault list-deleted, nor does it explicitly explain the non-purge-protected sweep semantics. Borderline — leaning fail for missing the explicit command name.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)

Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers

✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [4/5] Positive — Clean up the deployment stack
✗ [5/5] Positive — Local destroy of a Git-Ape deployment

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.71 | Duration: 1m29.083s

  • Tests: 5 total, 1 passed, 4 failed, 0 errors
  • Success Rate: 20.0%
  • Score Range: 0.60 - 0.87 (σ=0.1093)

Task Results

Task Score Status Graders
Negative — Deploying a new stack (opposite operation) 0.81 budget, trigger_relevance_negative
Negative — Deleting a non-Git-Ape resource group 0.87 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Positive — Clean up the deployment stack 0.62 answer_quality, budget, trigger_relevance_positive
Positive — Local destroy of a Git-Ape deployment 0.63 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Negative — Deleting a non-Git-Ape resource group

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Positive — Clean up the deployment stack

Run 1/2 (failed):

  • answer_quality (0.00): fail: : Response recommended the destroy script (criterion 1 partial) but did not explicitly explain why raw az group delete is insufficient (missing soft-delete cleanup / multi-RG). Did not mention the state.json prerequisite under .azure/deployments/deploy-20260524-test/ (criterion 2 missing). Did not mention az stack sub delete --action-on-unmanage deleteAll semantics (criterion 3 missing). Did not describe the Key Vault / Cognitive Services purge sweep or purgeProtected retention (criterion 4 missing).
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Run 2/2 (failed):

  • answer_quality (0.00): fail: Missing criteria 2, 3, and 4: The assistant recommended the azure-stack-destroy script (criterion 1 partially met, though it did not explicitly contrast with raw az group delete and explain why). However, it did NOT mention: (2) the state.json prerequisite under .azure/deployments/deploy-20260524-test/, (3) the underlying az stack sub delete --action-on-unmanage deleteAll command/semantics, or (4) the soft-delete purge sweep behavior for Key Vault/Cognitive Services or the purgeProtected retention behavior. The response was cut short due to an environment permission error and only provided the script command without explaining what it does.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Positive — Local destroy of a Git-Ape deployment

Run 1/2 (failed):

  • answer_quality (0.00): fail: Missing criteria 2 and 3: The assistant recommended the skill and ran destroy-stack.sh (criterion 1 ✅) and mentioned the soft-deleted Key Vault purge sweep so the name can be reused (criterion 4 ✅). However, the response did not reference state.json under .azure/deployments/deploy-20260506-001/ as the source of truth (criterion 2 ❌), and did not name the actual az stack sub delete --action-on-unmanage deleteAll command or its single-idempotent-call semantics (criterion 3 ❌). Those details lived only in the loaded skill context, not in the assistant's reply.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)

Run 2/2 (failed):

  • answer_quality (0.00): fail: Response missing 3 of 4 required elements: Criterion 1 met (recommends destroy-stack.sh under .github/skills/azure-stack-destroy/scripts/). Missing: (2) does not reference state.json under .azure/deployments/deploy-20260506-001/ as source of truth; (3) does not name az stack sub delete --action-on-unmanage deleteAll command or its semantics; (4) does not explicitly mention az keyvault purge / az keyvault list-deleted or explain the purge sweep mechanics — only vaguely says "purges eligible soft-deleted Key Vaults".
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)

Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5.3-codex

Results saved to: .waza-results/azure-stack-destroy-gpt-5.3-codex.json

🔢 Tokens (count + profile)

📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks
   ⚠️  token count 2644 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious, steps are logically ordered, and the fast vs sync mode comparison table is an excellent UX decision. Code examples are concrete and cover bash/PowerShell parity. Minor issue: the 'When to Use' section near the top partially duplicates the 'USE FOR' section, adding noise.
completeness       █████  Exceptional coverage: prerequisites with version constraints, all failure modes with recovery steps, status codes with meanings, purge sweep behavior per resource type, state.json field mapping, and idempotency guarantees. Edge cases like purge-protected vaults and the fallback path when stackId is absent are explicitly handled.
trigger_precision  ████░  USE FOR and DO NOT USE FOR triggers are specific and include exact user-phrasing examples, which is excellent. However, the standalone 'When to Use' section at the bottom duplicates trigger content already in 'USE FOR', creating redundancy that could confuse routing logic — consolidate or remove it.
scope_coverage     █████  Boundaries are exceptionally well-defined: explicit 'no surgical mode' caveat, hard state.json prerequisite, non-Git-Ape exclusions, and clear differentiation from raw az group delete with three concrete reasons. Capabilities and intentional omissions (App Configuration, API Management not auto-purged) are both documented.
anti_patterns      ████░  No vague instructions, no conflicting directives, and error handling is thorough. The one notable anti-pattern is the duplicated 'When to Use' / 'USE FOR' content, which could cause an agent to double-weight those triggers. The bypass flag safety rationale is a good proactive clarification that avoids misuse.
────────────────────────────────────────────
Overall: 4.6/5.0

A high-quality, production-ready skill definition. It excels at completeness and scope coverage, with thorough error handling, idempotency guarantees, and explicit edge-case documentation. The primary improvement opportunity is consolidating the duplicate 'When to Use' and 'USE FOR' sections to reduce redundancy and tighten routing signal.
✅ Check (compliance summary) (69 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-stack-destroy

📋 Compliance Score: Low
   ❌  Needs significant improvement. Description too short or missing triggers.

   Issues found:
   ❌  SKILL.md is 2644 tokens (hard limit 500)

📐 Spec Compliance: 7/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
   ❌  [spec-security] Security risks detected: description contains XML angle brackets
     📎  XML angle brackets and reserved prefixes pose injection and naming conflict risks

📎 Links: 0/4 valid
   ⚠️  4 link issue(s) found.
   ❌  [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory

📊 Token Budget: 2644 / 500 tokens
   ❌  Exceeds limit by 2144 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Overhauls the /git-ape-onboarding flow: replaces the .exampleyml activation hack with a template-driven scaffolder under the skill directory, migrates deploy/destroy from az deployment sub to Azure Deployment Stacks (closes part of #30), registers the onboarding eval suite in the pilot tier, declares prompt files in the VSIX, and regenerates website docs.

Changes:

  • Removes .github/workflows/git-ape-{plan,deploy,destroy,verify}.exampleyml, ships canonical templates under .github/skills/git-ape-onboarding/templates/workflows/, plus scaffold-repo.{sh,ps1} and sync-templates.{sh,ps1} with a new git-ape-onboarding-template-check CI workflow enforcing parity.
  • Rewrites deploy + destroy templates around az stack sub create/delete --action-on-unmanage deleteAll, adds a state-file stackId/managedResources[] schema, a rollback step, a templateanalyzer staging workaround, and switches AZURE_SUBSCRIPTION_ID from secrets to vars.
  • Renames gpt-5-codexgpt-5.3-codex in tier manifest and bench prompts; registers git-ape-onboarding in the pilot tier with 4 tasks; tightens the agent's identity contract and adds a "required inputs" gate; declares .github/prompts/ in plugin.json and registers 9 chatPromptFiles; trims .vscodeignore.
Show a summary per file
File Description
.github/workflows/git-ape-onboarding-template-check.yml New CI parity check (bash+pwsh sync + scaffold byte-diff).
.github/workflows/git-ape-deploy.exampleyml (deleted) Old activation stub; superseded by template.
.github/skills/git-ape-onboarding/templates/workflows/git-ape-{plan,deploy,destroy,verify}.yml Canonical workflow templates; Stacks migration + scan staging + rollback.
.github/skills/git-ape-onboarding/templates/README.md + copilot-instructions.md Maintainer doc + canonical deployment standards.
.github/skills/git-ape-onboarding/scripts/{scaffold-repo,sync-templates}.{sh,ps1} Parity scaffold + mirror scripts.
.github/skills/git-ape-onboarding/SKILL.md, .github/agents/git-ape-onboarding.agent.md Drop acknowledgment phase, add invariants/identity/non-goals/required-inputs gate.
.github/copilot-instructions.md Stacks-based deploy/destroy guidance.
.github/evals/git-ape-onboarding/{eval,tasks/*}.yaml, .github/evals/manifest.yaml New eval suite (3 positive + 1 negative); register skill in pilot, rename codex model.
.github/prompts/{agent,skill}-bench.prompt.md Update default model list.
extension/package.template.json, extension/.vscodeignore, plugin.json Register prompt files in VSIX, drop dev-only .github/* paths from VSIX.
scripts/generate-docs.js, README.md, website/docs/** Regenerated docs for both repo CI and scaffolded user-facing workflows.

Copilot's findings

Comments suppressed due to low confidence (3)

.github/skills/git-ape-onboarding/templates/workflows/git-ape-verify.yml:44

  • The check now reads vars.AZURE_SUBSCRIPTION_ID (a repository variable), but the error message and summary still call it a "secret". This is misleading: a user looking at logs will go check repo Secrets, not repo Variables, and may waste time before realising the setup expects a variable. Update the user-facing messages and the missing-config copy to refer to AZURE_SUBSCRIPTION_ID as a variable. Also note git-ape-deploy.yml still writes subscription from vars.AZURE_SUBSCRIPTION_ID while the onboarding skill (Step 7) and copilot-instructions.md (line 405) still document AZURE_SUBSCRIPTION_ID as a secret — the docs and the workflow contract have diverged.
    .github/skills/git-ape-onboarding/templates/workflows/git-ape-verify.yml:121
  • The verify workflow checks for git-ape-ttl-reaper.yml, but the scaffold helper (scaffold-repo.sh / scaffold-repo.ps1) does not ship a TTL Reaper template — the MAPPINGS only include plan, deploy, destroy, verify, and drift.{md,lock.yml}. Every onboarded repo will therefore see a perpetual ⚠️ Git-Ape: TTL Reaper (git-ape-ttl-reaper.yml) — not found warning in Verify Setup. Either drop this entry from the workflow list, or add the TTL Reaper template to the scaffolder and the templates/workflows/ directory.
    .github/skills/git-ape-onboarding/templates/workflows/git-ape-destroy.yml:151
  • This gate accepts a state file as long as it has either stackId or deploymentId. Every state file ever written by this project has a deploymentId (it's the matrix key), so the check effectively only fails if state.json is corrupt. For a deployment created by the old (pre-Stacks) git-ape-deploy.exampleyml, stackId will be empty but deploymentId will be set — so the check passes, then az stack sub show --name "$STACK_NAME" in the next step returns a non-zero exit, the workflow records exists=false and exits 0 with "Already destroyed (stack not found)". Real Azure resources still exist, the resource group is never deleted, but the destroy run reports success and metadata.json will be flipped to destroyed. To make this Stacks-only and safe, require stackId explicitly (or, if you must accept old state files, fall back to az group delete on state.resourceGroup when stackId is empty).
  • Files reviewed: 47/47 changed files
  • Comments generated: 3

Comment on lines +607 to +609
STACK_ID='${{ steps.deploy.outputs.stack_id }}'
MANAGED='${{ steps.deploy.outputs.managed_resources }}'
MANAGED=${MANAGED:-[]}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d6b41dc. managed_resources and stack_id are now passed through env: and read as MANAGED="${MANAGED_RESOURCES:-[]}", then validated with jq empty (falling back to [] on invalid/empty JSON) before being consumed via jq --argjson managedResources. No more single-quoted JSON literal.

Comment on lines +288 to +302
# Also snapshot the previous template from git (parent commit of this merge
# or origin/main for /deploy comment). Used to redeploy last-known-good on failure.
DEPLOY_DIR="${{ steps.params.outputs.deploy_dir }}"
mkdir -p /tmp/rollback
if git show HEAD~1:"$DEPLOY_DIR/template.json" > /tmp/rollback/template.json 2>/dev/null; then
cp "$DEPLOY_DIR/parameters.json" /tmp/rollback/parameters.json 2>/dev/null || true
# Prefer the previous parameters if they exist at HEAD~1
git show HEAD~1:"$DEPLOY_DIR/parameters.json" > /tmp/rollback/parameters.json 2>/dev/null || true
echo "prior_template_available=true" >> "$GITHUB_OUTPUT"
echo "[$(date -u +%H:%M:%S)] Previous template captured from HEAD~1 → /tmp/rollback/"
echo " template bytes: $(wc -c < /tmp/rollback/template.json)"
else
echo "prior_template_available=false" >> "$GITHUB_OUTPUT"
echo "[$(date -u +%H:%M:%S)] No previous template in git history (first deployment)"
fi
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d6b41dc. The rollback baseline is now derived per trigger: HEAD~1 only for push; for /deploy comments we git fetch origin main --depth=1 and use origin/main. git show "$BASELINE_REF:$DEPLOY_DIR/template.json" then reads the correct previous known-good template instead of the PR head.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: the /deploy PR-comment trigger referenced above has since been removed entirely (unverifiable comment-author authorization). The rollback baseline is now derived solely from the push trigger — HEAD~1 on main after merge — so the origin/main fetch path for /deploy no longer exists.

Comment thread website/docs/workflows/git-ape-plan.md Outdated
Comment on lines +764 to +769
if (validationStatus === 'passed' && whatifResult) {
comment += `### What-If Analysis\n\n`;
comment += `\`\`\`\n${whatifResult}\n\`\`\`\n\n`;
} else if (whatifStatus === 'passed' && whatifResult) {
comment += `### What-If Analysis\n\n`;
comment += `\`\`\`\n${whatifResult}\n\`\`\`\n\n`;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d6b41dc. Removed the unreachable validationStatus === passed && whatifResult branch; what-if rendering is now driven uniformly by whatifStatus === passed && whatifResult.

Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the substantial cleanup here — moving the onboarding scaffolds out of .github/workflows/, adding sync/parity tooling, and wiring prompt/eval registration all make sense. I also like the skip-on-collision behavior in the scaffolders and the explicit docs refresh.

I did find a few blocking issues that should be fixed before merge:

  1. Command injection in manual destroy path (.github/skills/git-ape-onboarding/templates/workflows/git-ape-destroy.yml, around lines 55-66)
    inputs.confirm and inputs.deployment_id are interpolated directly into a run: script via ${{ ... }}. Because Actions expands those expressions before bash parses the script, a crafted workflow_dispatch input can inject arbitrary shell. Please pass these values through env: (or another non-shell-interpolated channel) and read them from normal shell variables instead.

  2. Unsafe direct interpolation of github.base_ref into shell (.github/skills/git-ape-onboarding/templates/workflows/git-ape-plan.yml:44)
    github.base_ref is used directly inside the git diff command in a run: block. Per GitHub’s Actions hardening guidance, attacker-controlled context values should not be embedded into shell scripts this way. This should also be routed through env: and quoted normally in bash.

  3. Rollback source is wrong for /deploy runs (.github/skills/git-ape-onboarding/templates/workflows/git-ape-deploy.yml, around lines 219-228 and 475-486)
    The comment says the workflow should snapshot the parent commit or origin/main for /deploy comments, but the implementation always reads HEAD~1. On comment-triggered deploys that means rollback can redeploy an earlier PR commit that was never the last known-good state, instead of rolling back to main. That is especially risky on multi-commit PRs. Please branch this logic so /deploy captures from origin/main (or another authoritative deployed baseline) before using it for rollback.

One additional hardening nit: git-ape-verify.yml also embeds secret values directly into shell conditionals (${{ secrets.AZURE_CLIENT_ID }} etc.). I would strongly prefer converting those checks to env booleans/variables as well.

Once the injection issues and rollback baseline are fixed, I’d be happy to re-review.

arnaudlh added 2 commits June 4, 2026 10:19
Address PR review on the git-ape-onboarding workflow templates:

- Route attacker-controllable inputs (github.base_ref, workflow_dispatch
  inputs, JSON step outputs) through env: and read them as quoted shell
  variables to close script-injection vectors (plan, destroy).
- plan: compute the PR diff against origin/$BASE_REF instead of an
  unsanitised interpolation.
- deploy: derive the rollback baseline from HEAD~1 (push) or origin/main
  (/deploy comment); pass stack_id/managed_resources via env and validate
  the managed_resources JSON before jq consumes it.
- destroy: make teardown Deployment-Stacks-only with a guarded legacy
  resource-group fallback; emit explicit legacy/fallback_rg outputs.
- verify: gate required secrets/variable via env booleans; check the
  AZURE_SUBSCRIPTION_ID variable; align the scaffolded WORKFLOWS list with
  the scaffolder (drop ttl-reaper, add verify, use drift.lock.yml).
- plan: remove the unreachable what-if render branch.

Regenerate website workflow docs.
AZURE_SUBSCRIPTION_ID is consumed via vars. in every scaffolded workflow, so
document it as a GitHub repository/environment variable (not a secret).
AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets. Fix the OIDC snippet in
both copilot-instructions templates to use vars.AZURE_SUBSCRIPTION_ID.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

⚠️ Documentation Staleness Warning

Source files (agents, skills, workflows, or config) changed in this PR, but the generated documentation is out of date.

Changed docs that need regeneration:

  • website/docs/agents/azure-resource-deployer.md
  • website/docs/agents/azure-template-generator.md
  • website/docs/skills/overview.md
  • website/docs/workflows/daily-repo-status-lock.md
  • website/docs/workflows/issue-triage-agent-lock.md

To fix: Run the following command and commit the results:

node scripts/generate-docs.js

This is an advisory check — it does not block the PR.

@arnaudlh
Copy link
Copy Markdown
Member Author

arnaudlh commented Jun 4, 2026

Thanks for the thorough review, @sendtoshailesh. All four points are addressed in d6b41dc (workflow templates) and 67005a7 (docs). Summary:

1. Command injection in manual destroy path (git-ape-destroy.yml)
workflow_dispatch inputs are no longer interpolated into the script. The "Find destroy-requested deployments" step now exposes EVENT_NAME / INPUT_CONFIRM / INPUT_DEPLOYMENT_ID via env: and builds the id array with jq -n -c --arg id "$INPUT_DEPLOYMENT_ID" '[$id]', so nothing attacker-controllable reaches the shell unquoted.

2. Unsafe interpolation of github.base_ref (git-ape-plan.yml)
The "Find deployment directories with changes" step now sets env: BASE_REF: ${{ github.base_ref }} and computes the diff with git diff --name-only "origin/${BASE_REF}...HEAD", quoting the value normally.

3. Rollback baseline wrong for /deploy runs (git-ape-deploy.yml)
The "Capture pre-deploy state" step now branches on the trigger: HEAD~1 for push, and for /deploy comments it does git fetch origin main --depth=1 and uses BASELINE_REF="origin/main", then git show "$BASELINE_REF:$DEPLOY_DIR/template.json". Multi-commit PRs no longer roll back to an arbitrary earlier PR commit.

Hardening nit — git-ape-verify.yml secret conditionals
Converted to env: booleans: HAS_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID != '' }} (and tenant), with HAS_SUBSCRIPTION_ID: ${{ vars.AZURE_SUBSCRIPTION_ID != '' }}; the checks now test [[ "$HAS_*" != "true" ]]. The "Verify Azure access" step reads AZURE_CLIENT_ID from env: too.

While in here I also addressed the Copilot review threads (managed_resources JSON via env: + jq validation, the rollback HEAD~1 overlap, and an unreachable what-if render branch), made git-ape-destroy Deployment-Stacks-only with a guarded legacy resource-group fallback, aligned the verify scaffold list with the scaffolder (dropped ttl-reaper, added verify, switched to drift.lock.yml), and reconciled AZURE_SUBSCRIPTION_ID as a GitHub variable (it is consumed via vars. in every scaffolded workflow) across the docs and OIDC snippets.

Ready for re-review.

@arnaudlh arnaudlh requested a review from sendtoshailesh June 4, 2026 02:52
Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review:

Previously raised issues:

  • ✅ Fixed: git-ape-destroy.yml no longer interpolates inputs.* directly into shell; the workflow_dispatch inputs are routed via env and JSON-encoded with jq before use.
  • ✅ Fixed: git-ape-plan.yml no longer inlines ${{ github.base_ref }} in the shell; it is passed through env.BASE_REF first.
  • ✅ Fixed: git-ape-deploy.yml now uses origin/main for /deploy rollback baselines instead of always assuming HEAD~1; the push path still uses HEAD~1, which is the previous main commit after merge.
  • ✅ Fixed: git-ape-verify.yml moved the secret checks to env booleans instead of embedding ${{ secrets.* }} directly in shell conditionals.

New issues found:

  • ❌ Blocking: matrix.deployment_id is still derived from attacker-controlled deployment directory names and interpolated directly into run: blocks / JS string literals in the plan, deploy, and destroy templates. That reintroduces shell / script injection via paths under .azure/deployments/*/.
  • ⚠️ Non-blocking: the /deploy comment path checks approval state, but it still does not verify that the commenter is an authorized collaborator/member before triggering deployment.
  • ⚠️ Non-blocking: both deploy and destroy still swallow git push failures after updating state.json / metadata.json, which can leave Azure state changed without the repo state being persisted.

Overall verdict: the original blockers are resolved, but the new matrix.deployment_id injection path is still a release-blocking security issue, so this PR is not merge-ready yet.

arnaudlh added 3 commits June 4, 2026 19:10
…t_id injection

matrix.deployment_id is derived from attacker-controllable .azure/deployments/*/ directory names and was interpolated directly into run: bash blocks and github-script JS string literals across the plan, deploy, and destroy workflow templates.

Route it through job-level env (DEPLOYMENT_ID) so run blocks reference $DEPLOYMENT_ID and github-script reads process.env.DEPLOYMENT_ID, and reject any directory name outside ^[A-Za-z0-9._-]+$ at the detect step (defense in depth, also makes derived deploy_dir provably safe).
…e push

The /deploy comment trigger cannot reliably verify the commenter's
authorization, so deployment is now gated solely on merge to main (which
already requires PR review + approval via branch protection). Removes the
issue_comment trigger, the check-comment-trigger job, and all PR-head-ref
checkout paths. Also fails loud (exit 1) instead of swallowing git push
failures when committing deployment/teardown state back to main.
@arnaudlh arnaudlh requested a review from sendtoshailesh June 4, 2026 11:20
Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review:

Previously raised issues:

  • ✅ Fixed: matrix.deployment_id is now validated against ^[A-Za-z0-9._-]+$ before entering the matrix and routed through env.DEPLOYMENT_ID / process.env.DEPLOYMENT_ID in the plan, deploy, and destroy templates, so the earlier matrix.deployment_id shell/JS injection path is closed.
  • ✅ Fixed: the /deploy comment path is gone entirely from git-ape-deploy.yml, so there is no longer an unauthenticated comment-triggered deployment path to authorize.
  • ✅ Fixed: deploy and destroy now fail the workflow if the post-state git push fails instead of silently swallowing that error.

New issues found:

  • ❌ Blocking: untrusted values read from parameters.json are still interpolated directly into run: scripts via ${{ ... }} in the workflow templates, which reintroduces the same GitHub Actions expression-to-shell injection class under a different input. Examples: git-ape-plan.yml uses ${{ steps.params.outputs.location }} in shell at lines 157, 414, and 455; git-ape-deploy.yml uses ${{ steps.params.outputs.location }}, ${{ steps.params.outputs.project }}, and ${{ steps.params.outputs.environment }} in shell/JQ argument positions at lines 175, 178, 241, 244-245, 258, 420, and 498-500. These values come from attacker-controlled PR content (parameters.json) and need the same treatment as deployment_id: validate if needed, pass through env:, and reference normal shell variables instead of inlining ${{ ... }} into script source.
  • ⚠️ Non-blocking: git-ape-plan.yml still tells reviewers to comment /deploy (Plan Comment step, line 738), but that trigger has been intentionally removed. The PR guidance should be updated to avoid instructing users to use a nonexistent path.

Overall verdict:
The previously raised issues are resolved, but the new ${{ steps.params.outputs.* }} injection path is still a release-blocking security issue, so this PR is not merge-ready yet.

…ection

Untrusted location/project/environment values read from parameters.json
were interpolated directly into run: script bodies via ${{ steps.params.outputs.* }},
the same expression-to-shell injection class already fixed for deployment_id.
Route them through step-level env: blocks and reference $LOCATION/$PROJECT/$ENVIRONMENT
shell variables instead. Also drop the stale /deploy reviewer instruction in
git-ape-plan.yml (that trigger was removed). Regenerated workflow docs.
@arnaudlh
Copy link
Copy Markdown
Member Author

arnaudlh commented Jun 4, 2026

@sendtoshailesh Thanks for the thorough re-review. Fixed the remaining injection in dce5833e.

Blocking item — ${{ steps.params.outputs.* }} in run: bodies: location, project, and environment are read from parameters.json (attacker-controllable) and were being interpolated directly into shell/jq script bodies — the same expression-to-shell injection class as the earlier deployment_id finding. They are now routed through step-level env: blocks and referenced as "$LOCATION" / "$PROJECT" / "$ENVIRONMENT" shell variables in both git-ape-deploy.yml (validate, deploy, rollback, and save-state steps) and git-ape-plan.yml (cost, validate, what-if steps). No untrusted value is inlined into a run: body anymore. (deploy_dir is left as-is since it is derived solely from the already-validated $DEPLOYMENT_ID and contains no shell metacharacters.)

Non-blocking item — stale /deploy instruction: removed the 3. Or comment /deploy line from the Plan Comment step in git-ape-plan.yml; merge-to-deploy is the only trigger now.

actionlint (with embedded shellcheck) reports no injection findings on either template — only pre-existing SC2129 redirect-style suggestions unrelated to this change. Workflow docs regenerated from the templates.

@arnaudlh arnaudlh requested a review from sendtoshailesh June 4, 2026 11:30
arnaudlh added 3 commits June 4, 2026 19:38
…rhaul

# Conflicts:
#	.github/agents/git-ape.agent.md
#	.github/copilot-instructions.md
#	.github/evals/manifest.yaml
#	.github/workflows/git-ape-deploy.exampleyml
#	.github/workflows/git-ape-destroy.exampleyml
#	website/docs/agents/git-ape.md
#	website/docs/workflows/git-ape-deploy.md
#	website/docs/workflows/git-ape-destroy.md
Merge resolution updated the .github/copilot-instructions.md mirror to the
stack-based deployment flow (dropping the /deploy trigger). Propagate the
same content to the canonical templates/copilot-instructions.md so the
onboarding template-check (bash + pwsh) passes.
Regenerated from sources updated by the upstream/main merge (azure-resource-deployer
and azure-template-generator agents now delegate to skills; lock workflow metadata).
Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 4 follow-up review:

Previously raised issues:

  • ✅ Fixed: untrusted parameters.json values (location, project, environment) are now routed through env: before use in shell steps instead of being interpolated directly into run: blocks.
  • ✅ Fixed: the stale /deploy reference was removed from the plan comment path.

Conflict resolution assessment:

  • ✅ Merge resolution looks clean overall. I did not find conflict markers or accidental duplicate sections in the changed templates/workflows, the key workflow YAML files parse successfully, and the onboarding template sync check passes.

New issues found:

  • ❌ Blocking: website/docs/getting-started/onboarding.md still tells users to configure AZURE_SUBSCRIPTION_ID as a GitHub secret (gh secret set at lines 364-366, 383-391), but the scaffolded workflows and verify flow now read it from vars.AZURE_SUBSCRIPTION_ID as a variable. A user following the updated onboarding docs will end up with a broken setup: verify/deploy read from vars, but the docs populate secrets. Given this PR is specifically overhauling onboarding/scaffolding, that documentation contract needs to be consistent before merge.
  • ⚠️ Non-blocking: git-ape-verify.yml and its generated docs still say Merge or comment /deploy to deploy, and the summary still says secret(s) missing even though one of the required values is now a variable. That guidance is stale/misleading, though the actual deploy trigger removal in plan/deploy is correct.

Overall verdict:
The round-3 blockers are fixed and the merge conflict resolution looks solid, but the onboarding docs still misconfigure AZURE_SUBSCRIPTION_ID, so I don’t think this is merge-ready yet. Once the docs/template guidance are aligned with the new variable-based contract, I’d be happy to re-review.

Round 4 review (sendtoshailesh):

- Blocking: onboarding docs configured AZURE_SUBSCRIPTION_ID via 'gh secret set',
  but the scaffolded plan/deploy/destroy/verify workflows read it from
  vars.AZURE_SUBSCRIPTION_ID. Switch the single- and multi-environment setup
  steps to 'gh variable set' so the documented contract matches the workflows.
  AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets.
- Non-blocking: git-ape-verify.yml summary said 'secret(s) missing' (one value
  is now a variable) and 'Merge or comment /deploy to deploy' (the /deploy
  trigger was removed). Reworded to 'required value(s) missing' and
  'Merge to main to deploy'; renamed the check step accordingly.

Regenerated git-ape-verify.md from the updated template.
@arnaudlh
Copy link
Copy Markdown
Member Author

arnaudlh commented Jun 5, 2026

@sendtoshailesh Thanks for the round 4 review. Both points addressed in de50e714.

Blocking — AZURE_SUBSCRIPTION_ID documented as a secret: website/docs/getting-started/onboarding.md now sets it via gh variable set (single- and multi-environment paths, including the azure-destroy environment), matching the scaffolded plan/deploy/destroy/verify workflows that read vars.AZURE_SUBSCRIPTION_ID. AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets. The "Set secrets" heading and intro now state explicitly which value is a secret vs. a variable.

Non-blocking — stale git-ape-verify.yml guidance: the summary line now reads required value(s) missing instead of secret(s) missing, the next-steps line reads Merge to main to deploy (the /deploy trigger is gone), and the check step is renamed to Check required secrets and variables. Regenerated git-ape-verify.md from the template.

The template ↔ mirror sync check passes locally for both bash and pwsh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants