From 8cca0885c0b673af1cb52349f9c5cec4c5846c7b Mon Sep 17 00:00:00 2001 From: miroslavpojer Date: Mon, 18 May 2026 15:12:56 +0200 Subject: [PATCH 1/6] Add evaluation results summary, trigger evaluation data, and new scripts for PR review - Created `results-summary.md` to document evaluation results for the pr-review skill, detailing iterations, fixes, and overall trajectory. - Introduced `trigger-eval.json` containing queries for evaluating trigger conditions, including both should-trigger and should-not-trigger scenarios. - Added `output-template.md` for standardized PR review output examples, outlining formatting rules and section classifications. - Developed `security-antipatterns.md` as a reference for identifying security issues during PR reviews. - Implemented `classify_sections.py` to classify changed files into relevant PR review sections based on defined rules. - Created `fetch_pr.sh` script to fetch PR details and changed files using the GitHub CLI, facilitating integration with the classification script. --- .github/CODEOWNERS | 6 + .github/copilot-instructions.md | 22 + CONTRIBUTING.md | 225 +++++++++ README.md | 206 ++++++++- docs/README.md | 29 ++ docs/SKILL_TESTING.md | 115 +++++ docs/getting-started.md | 115 +++++ skills/pr-review/SKILL.md | 261 +++++++++++ skills/pr-review/evals/evals.json | 202 ++++++++ skills/pr-review/evals/files/api-rename.diff | 35 ++ .../pr-review/evals/files/ci-gate-bypass.diff | 54 +++ .../evals/files/db-migration-risks.diff | 26 ++ .../evals/files/dependency-bump-risk.diff | 16 + .../evals/files/docs-release-notes.diff | 22 + .../files/elevated-risk-auth-refactor.diff | 59 +++ .../evals/files/iac-wildcard-iam.diff | 43 ++ .../evals/files/large-pr-and-vague-desc.diff | 117 +++++ .../evals/files/multi-section-risks.diff | 70 +++ .../evals/files/standard-clean-pr.diff | 65 +++ skills/pr-review/evals/fixture-map.md | 31 ++ skills/pr-review/evals/results-summary.md | 430 ++++++++++++++++++ skills/pr-review/evals/trigger-eval.json | 128 ++++++ .../pr-review/references/output-template.md | 90 ++++ .../references/security-antipatterns.md | 67 +++ skills/pr-review/scripts/classify_sections.py | 132 ++++++ skills/pr-review/scripts/fetch_pr.sh | 37 ++ 26 files changed, 2602 insertions(+), 1 deletion(-) create mode 100644 .github/CODEOWNERS create mode 100644 .github/copilot-instructions.md create mode 100644 CONTRIBUTING.md create mode 100644 docs/README.md create mode 100644 docs/SKILL_TESTING.md create mode 100644 docs/getting-started.md create mode 100644 skills/pr-review/SKILL.md create mode 100644 skills/pr-review/evals/evals.json create mode 100644 skills/pr-review/evals/files/api-rename.diff create mode 100644 skills/pr-review/evals/files/ci-gate-bypass.diff create mode 100644 skills/pr-review/evals/files/db-migration-risks.diff create mode 100644 skills/pr-review/evals/files/dependency-bump-risk.diff create mode 100644 skills/pr-review/evals/files/docs-release-notes.diff create mode 100644 skills/pr-review/evals/files/elevated-risk-auth-refactor.diff create mode 100644 skills/pr-review/evals/files/iac-wildcard-iam.diff create mode 100644 skills/pr-review/evals/files/large-pr-and-vague-desc.diff create mode 100644 skills/pr-review/evals/files/multi-section-risks.diff create mode 100644 skills/pr-review/evals/files/standard-clean-pr.diff create mode 100644 skills/pr-review/evals/fixture-map.md create mode 100644 skills/pr-review/evals/results-summary.md create mode 100644 skills/pr-review/evals/trigger-eval.json create mode 100644 skills/pr-review/references/output-template.md create mode 100644 skills/pr-review/references/security-antipatterns.md create mode 100644 skills/pr-review/scripts/classify_sections.py create mode 100755 skills/pr-review/scripts/fetch_pr.sh diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS new file mode 100644 index 0000000..2dfc5bb --- /dev/null +++ b/.github/CODEOWNERS @@ -0,0 +1,6 @@ +# CODEOWNERS +# This file defines the code owners for this repository +# Code owners are automatically requested for review when someone opens a pull request + +# Default owners for everything in the repo +* @miroslavpojer @lsulak @oto-macenauer-absa diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000..ddeaa9b --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,22 @@ +# Copilot Instructions + +## PR Review + +### Core review — covers standard, elevated risk, API contracts, deps, CI/CD, and IaC + +When reviewing a pull request, load and apply: +https://github.com/AbsaOSS/agentic-toolkit/blob/master/skills/pr-review/SKILL.md + +Apply only the sections relevant to the files touched by the PR. +Output findings grouped as: Blocker / Important / Nit. + +### Skill-specific checks — apply when any `SKILL.md` is modified + +For every modified `SKILL.md`, also verify: +- `name` is kebab-case, matches the directory name, and is ≤ 64 chars +- `description` covers both "what it does" AND "when to trigger" with explicit trigger keywords +- `description` is ≤ 1024 chars and not padded with filler +- SKILL.md body is < 500 lines, or uses progressive disclosure via `references/` +- No hardcoded credentials, secrets, or absolute internal paths in skill body or scripts +- Any bundled script in `scripts/` is referenced from SKILL.md with clear usage guidance +- The new or modified skill's description does not conflict with or shadow existing skills diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..1013dea --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,225 @@ +# Contributing to Agentic Toolkit + +This guide covers everything you need to author a high-quality skill for this repository — from file layout and frontmatter rules through description writing, body guidelines, and testing. + +--- + +## Table of Contents + +1. [Skill structure](#1-skill-structure) +2. [Frontmatter schema](#2-frontmatter-schema) +3. [Writing the description](#3-writing-the-description) +4. [Writing the skill body](#4-writing-the-skill-body) +5. [Testing your skill](#5-testing-your-skill) +6. [Proposing a skill via a pull request](#6-proposing-a-skill-via-a-pull-request) + +--- + +## 1. Skill structure + +Each skill lives in its own folder under `skills/`: + +``` +skills/ +└── skill-name/ + ├── SKILL.md # Required — frontmatter + instructions + ├── scripts/ # Optional — executable scripts the agent can run + ├── references/ # Optional — supporting docs loaded on demand + ├── assets/ # Optional — templates, icons, example files + └── evals/ # Optional — test prompts and assertions +``` + +> **Rule:** the folder name must exactly match the `name` field in the `SKILL.md` frontmatter. + +### When to add each optional directory + +| Directory | Use for | +|---|---| +| `scripts/` | Deterministic or repetitive logic better run as code than described in prose (e.g. a validation script, a formatter, a data transformer) | +| `references/` | Domain docs, API specs, decision tables, or anything too large to keep in `SKILL.md` without exceeding 500 lines | +| `assets/` | Template files, example inputs/outputs, icons — anything the skill produces or consumes | + +--- + +## 2. Frontmatter schema + +Every `SKILL.md` must open with a YAML frontmatter block: + +```yaml +--- +name: skill-name +description: > + What this skill does and the specific situations in which it should be + activated. Include trigger phrases, domains, and keywords. +license: Proprietary # optional +compatibility: GitHub Copilot # optional — only when env requirements exist +--- +``` + +### Field reference + +| Field | Required | Constraints | +|---|---|---| +| `name` | ✅ | Lowercase letters, numbers, and hyphens only. Max 64 chars. Must not start or end with a hyphen. No consecutive hyphens (`--`). Must match the parent directory name. | +| `description` | ✅ | Max 1024 chars. Non-empty. Must describe both **what** the skill does and **when** to activate it. See [§3](#3-writing-the-description). | +| `license` | ➖ | Short SPDX name or reference to a bundled `LICENSE.txt`. | +| `compatibility` | ➖ | Max 500 chars. Only include if the skill has specific environment requirements (tools, Python version, network access, etc.). Most skills do not need this field. | + +#### Valid `name` examples + +```yaml +name: pr-review # ✅ +name: create-issue # ✅ +name: data-pipeline # ✅ +``` + +#### Invalid `name` examples + +```yaml +name: PR-Review # ❌ uppercase +name: -pr-review # ❌ starts with hyphen +name: pr--review # ❌ consecutive hyphens +name: pr_review # ❌ underscores not allowed +``` + +--- + +## 3. Writing the description + +The `description` field is the **primary triggering mechanism**. The agent never reads your skill body until it decides the description matches the current task. A weak description means the skill never fires, no matter how good the body is. + +### What to include + +1. **What it does** — a concise statement of the skill's output or capability +2. **When to use it** — explicit trigger phrases, domains, user intents +3. **Keywords** — include both formal terms and casual phrasings a real user might type + +### Be slightly "pushy" + +Claude tends to under-trigger skills. Lean toward explicit activation language: + +```yaml +# Too vague — will often not trigger +description: Helps with pull request reviews. + +# Better — explicit about when to activate +description: > + Pull request code review. Activate when asked to review a PR, check a diff, + or give feedback on code changes. Covers standard risk, elevated risk, API + contracts, dependency bumps, CI/CD changes, and infrastructure changes. + Applies the relevant sections based on what files the PR touches. + Produces concise comments grouped by severity: Blocker / Important / Nit. +``` + +### Length guidelines + +- **Aim for 150–400 characters** for most skills +- Do not pad to the 1024-char limit — filler dilutes signal +- Do not put "when to use" information only in the body; it belongs in `description` + +### Good vs. poor examples + +| | Example | +|---|---| +| ✅ Good | `"Extracts text and tables from PDF files, fills PDF forms, and merges multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction."` | +| ❌ Poor | `"Helps with PDFs."` | +| ✅ Good | `"Creates a GitHub issue from a natural language prompt. Triggers on requests like 'create an issue for X', 'open a bug report about Y', 'file a feature request for Z', 'add a ticket for W'."` | +| ❌ Poor | `"Opens GitHub issues."` | + +--- + +## 4. Writing the skill body + +### Size and progressive disclosure + +- **Target under 500 lines** for `SKILL.md`. If you are approaching this limit, move supporting detail into `references/` files and add clear pointers in `SKILL.md` telling the agent when and how to load them. +- For large reference files (> 300 lines), include a table of contents at the top. +- When a skill supports multiple distinct domains or frameworks, create a `references/` file per domain and let the skill body select which one to load based on context. + +``` +cloud-deploy/ +├── SKILL.md # workflow + selection logic +└── references/ + ├── aws.md + ├── gcp.md + └── azure.md +``` + +### Add only what the agent lacks + +Focus on what the agent *would not* know without the skill: project-specific conventions, non-obvious edge cases, the particular APIs or tools to use, and team standards. Do not explain what a PDF is, how HTTP works, or what a migration does — the agent already knows. + +```markdown + +PDF (Portable Document Format) files are common documents that contain text +and images. To extract text you need a library. pdfplumber is recommended. + + +Use pdfplumber for text extraction. For scanned documents, fall back to +pdf2image + pytesseract. +``` + +### Explain the why, not just the what + +Prefer explaining *why* over issuing directives. Today's models respond better to reasoning than to rigid commands. + +```markdown + +ALWAYS use the imperative form. NEVER use passive voice. + + +Use imperative form in instructions (e.g. "Run the linter" not "The linter +should be run") — it is clearer and easier for the agent to follow. +``` + +### Bundle reusable scripts + +If every test run of your skill independently writes the same helper script (a formatter, a validator, a transformer), bundle it in `scripts/` and reference it from `SKILL.md`. This saves every future invocation from reinventing the wheel. + +### Format conventions + +- Use `##` and `###` headings to structure the body +- Use numbered lists for sequential steps, bullet lists for non-ordered items +- Include short worked examples where they add clarity +- Keep code blocks minimal — a representative snippet beats an exhaustive reference + +### Effective body patterns + +| Pattern | When to use | +|---|---| +| **Gotchas** | Environment-specific facts the agent will get wrong without being told. Keep in `SKILL.md` itself — the agent reads it before encountering the situation. | +| **Output template** | When you need a specific output format. A concrete template is more reliable than describing the format in prose. | +| **Checklist** | Multi-step workflows where skipping a step causes downstream failures: `- [ ] Step 1: Run scripts/validate.py`. | +| **Validation loop** | Any task where the agent should self-check before finishing: do → run validator → fix errors → repeat until clean. | +| **Plan-validate-execute** | Batch or destructive operations: generate a plan file → validate it against a source of truth → execute. | + +--- + +## 5. Testing your skill + +Use the **GitHub Copilot CLI** and the `skill-creator` skill to test, compare, and optimize your skill. The full testing methodology — eval creation, fixture management, iterative improvement, trigger testing, and description optimization — is documented in [docs/SKILL_TESTING.md](./docs/SKILL_TESTING.md). + +--- + +## 6. Proposing a skill via a pull request + +1. **Open an issue first** using the [Skill Proposal template](https://github.com/AbsaOSS/agentic-toolkit/issues/new/choose) to discuss scope before writing code +2. Create your skill folder under `skills/` following the structure in [§1](#1-skill-structure) +3. Run the tests described in [§5](#5-testing-your-skill) and include benchmark results in the PR description +4. Open a pull request; CODEOWNERS will be automatically requested for review +5. Optionally, add `Copilot` as a reviewer to get automated skill quality feedback + +### PR checklist + +Before opening a pull request, verify: + +- [ ] Folder name matches the `name` frontmatter field exactly +- [ ] `name` is kebab-case, ≤ 64 chars, no consecutive hyphens +- [ ] `description` covers both *what it does* and *when to trigger*, with explicit keywords +- [ ] `description` is ≤ 1024 chars and not padded with filler +- [ ] `SKILL.md` body is < 500 lines, or uses progressive disclosure via `references/` +- [ ] No hardcoded credentials, secrets, or internal paths in skill body or scripts +- [ ] Any script in `scripts/` is referenced from `SKILL.md` with usage guidance +- [ ] New skill's description does not conflict with or shadow existing skills +- [ ] Evals exist (or a note explains why they are not applicable) +- [ ] `skills-ref validate ./skills/my-skill` passes (install: `pip install skills-ref`) diff --git a/README.md b/README.md index 60e4693..03fd33a 100644 --- a/README.md +++ b/README.md @@ -1 +1,205 @@ -# agentic-toolkit \ No newline at end of file +# Agentic Toolkit + +Curated, reusable AI skills — instructional context, domain conventions, and callable agent tools for any AI-assisted engineering. + +--- + +## What Is A Skill? + +**Skills** are folders of instructions, scripts, and resources that an AI agent can load on demand to improve its performance on specialised tasks. + +The Agent Skills format is an open standard maintained by Anthropic, adopted by GitHub Copilot and other AI systems. See the [full specification](https://agentskills.io) and the [spec repository](https://github.com/agentskills/agentskills). + +### How Skills Work + +Skills manage an LLM context efficiently — the agent is not given all skill content upfront: + +1. **Discovery** — At startup, the agent loads only the `name` and `description` of each available skill +2. **Activation** — When a task matches a skill's description, the agent reads the full `SKILL.md` into context +3. **Execution** — The agent follows the instructions, optionally loading referenced files or executing bundled scripts + +> ⚠️ **The `description` field is your most important content.** It is the sole signal the agent uses to decide whether to activate a skill. A vague description means the skill will never fire. See [CONTRIBUTING.md](./CONTRIBUTING.md) for authoring guidance. +> +> For detailed skill testing and evaluation methodology, see [docs/SKILL_TESTING.md](docs/SKILL_TESTING.md). + +### Skill Structure + +Each skill lives in its own folder, which is a top-level directory in this repository: + +``` +skill-name/ +├── SKILL.md # Required — metadata frontmatter + instructions +├── scripts/ # Optional — executable code the agent can run +├── references/ # Optional — supporting documentation and specs +├── assets/ # Optional — templates, examples, resources +└── evals/ # Optional — test prompts and assertions +``` + +The `SKILL.md` file follows a standard frontmatter schema: + +```yaml +--- +name: skill-name # kebab-case, max 64 chars +description: > # max 1024 chars — what it does AND when to use it + What this skill does and the specific situations in which it should be activated. + Be explicit. Include keywords that describe the task, domain, and trigger conditions. +license: Proprietary # optional +compatibility: GitHub Copilot # optional — intended tools or environment requirements +--- + +# Skill Title + +## When To Use This Skill +... + +## Instructions +... +``` + +Required fields: `name`, `description`. Everything else is optional. + +--- + +## Skill Scopes + +GitHub Copilot, the primary AI productivity assistant, resolves skills from two locations: + +| Scope | Location | When To Use | +|---|---|---| +| **Personal Skills** | `~/.copilot/skills/` | Skills available across all your projects | +| **Project Skills** | `.github/skills/` | Skills specific to a single repository | + +Other Agent Skills-compatible tools typically use: + +| Scope | Location | +|---|---| +| **Personal** | `~/.agents/skills/` | +| **Project** | `.agents/skills/` | + +--- + +## How Skills Relate To Your Own + +The skills in this repo are the **base layer** — conventions and capabilities applicable across most teams and projects in the sub-department. You load them into your personal space and extend with your own layers: + +``` +Agentic Skills ← shared base (this repo) + └── Personal Skills ← your individual skills (~/.copilot/skills/) + └── Project Skills ← repository-specific (.github/skills/) +``` + +The base layer is the **foundation every engineer might share**. Personal and project layers are **extensions** — not replacements. If something belongs in the base, propose it here rather than duplicating it downstream. + +--- + +## Quickstart + +This repo uses Copilot as the primary AI-assisted engineering tool. There are many ways how to use it, but here we leverage the [GitHub Copilot CLI](https://github.com/github/copilot-cli). Skills are managed through slash commands inside a Copilot CLI session. + +### 1. Install GitHub Copilot CLI + +```bash +# macOS / Linux +curl -fsSL https://gh.io/copilot-install | bash + +# macOS (Homebrew) +brew install copilot-cli + +# Any platform (npm) +npm install -g @github/copilot +``` + +Then launch a session from your project directory: + +```bash +copilot +``` + +### 2. Add Skills To Your Personal Space + +Inside a Copilot CLI session, add skills from this repo one by one: + +``` +/skills add ~/.copilot/skills/pr-review +/skills add ~/.copilot/skills/create-issue +/skills add ~/.copilot/skills/kudos +``` + +Or clone this repo and add all skills at once from your terminal before launching: + +```bash +git clone https://github.com/AbsaOSS/agentic-toolkit.git +cp -r agentic-toolkit/skills/*/ ~/.copilot/skills/ +``` + +Skills in `~/.copilot/skills/` are loaded automatically at every session start — no further action needed. + +### 3. Verify Skills Are Loaded + +Inside a session, inspect the full loaded environment: + +``` +/env +``` + +This shows all loaded skills, MCP servers, agents, and instructions. Confirm your skills appear in the list. + +Alternatively, see all the available skills: + +``` +/skills list +``` + +### 4. Add Project-Specific Skills (Optional) + +For skills that only apply to a specific repository, place them in `.github/skills/` within that repo. These are loaded automatically when Copilot CLI is launched from that directory, layered on top of your personal as well as these project-specific skills. + +``` +your-project-repo/ +└── .github/ + └── skills/ + └── your-project-skill/ + └── SKILL.md +``` + +--- + +## Using Review Skills With the Native Copilot Reviewer + +The native Copilot reviewer is the **Add Copilot as reviewer** button on a pull request on GitHub.com. It reads `.github/copilot-instructions.md` in your repository. + +This repo provides a set of review-related skills. `pr-review` is the first and most broadly applicable — a single unified skill covering standard review, elevated risk, API contracts, dependency bumps, CI/CD changes, and infrastructure. Additional review skills can be added to the instructions file in the same way. + +To apply review standards when Copilot is added as a reviewer on GitHub.com, add the following to your project's `.github/copilot-instructions.md`: + +```markdown +## PR Review + +### Core review skill — covers standard, elevated risk, API contracts, deps, CI/CD, and IaC +When reviewing a pull request, load and apply: +https://github.com/AbsaOSS/agentic-toolkit/blob/master/skills/pr-review/SKILL.md + +Apply only the sections relevant to the files touched by the PR. +Output findings grouped as: Blocker / Important / Nit. +``` + +--- + +## Finding More Skills + +Before building a new skill, check whether one already exists: + +| Source | What's available | +|---|---| +| [github/awesome-copilot](https://github.com/github/awesome-copilot/tree/main/skills) | 200+ community skills: cloud, languages, security, DevOps, productivity | +| [skills.sh](https://skills.sh) | Open registry — install any GitHub-hosted skill with `npx skills add ` | +| [anthropics/skills](https://github.com/anthropics/skills) | Anthropic reference skills including `skill-creator` | +| [AbsaOSS/agentic-toolkit](https://github.com/AbsaOSS/agentic-toolkit) | Broader skill collection | + +--- + +## Contributing + +See [CONTRIBUTING.md](./CONTRIBUTING.md) for the full authoring guide. + +To propose a new skill or a change to an existing one, open an issue using the **Skill Proposal** template. diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..7c65b7d --- /dev/null +++ b/docs/README.md @@ -0,0 +1,29 @@ +# Agentic Toolkit — Documentation + +Navigation hub for all guides in this repository. Browse by category below. + +## User Guides + +For engineers installing and using Agentic Skills day-to-day. + +| Guide | Description | +|-------|-------------| +| [Getting Started](./getting-started.md) | Install the Skills CLI, add Agentic Skills to your global or project space, and start your first session | + +## Creator Guides + +For engineers writing, testing, and contributing new skills. + +| Guide | Description | +|-------|-------------| +| [Contributing](../CONTRIBUTING.md) | Skill folder layout, frontmatter schema, description writing, body guidelines, testing, and the full PR contribution process | +| [Skill Testing](./SKILL_TESTING.md) | How to test, evaluate, and optimize skills using evals, fixtures, and the skill-creator tool | + +## Reference + +| Guide | Description | +|-------|-------------| +| [Skill Catalog](../skills/) | Browse all available Agentic Skills — each skill folder contains a `SKILL.md` with its purpose, trigger phrases, and full instructions | + +> **Keep this index up to date.** When you add a new guide under `docs/`, add a row to the relevant category above. + diff --git a/docs/SKILL_TESTING.md b/docs/SKILL_TESTING.md new file mode 100644 index 0000000..49aee05 --- /dev/null +++ b/docs/SKILL_TESTING.md @@ -0,0 +1,115 @@ +# Skill Testing Guide + +This document provides a comprehensive methodology for testing, evaluating, and optimizing skills in this repository. It covers eval creation, fixture management, iterative improvement, trigger testing, and description optimization. + +--- + +## 1. Recommended workflow + +1. Create eval cases in `skills//evals/evals.json` +2. Add fixture files under `skills//evals/fixtures/` when prompts depend on local files +3. Start a Copilot CLI session from the repository root +4. Ask Copilot to use the `skill-creator` skill to test the target skill +5. Review outputs and diffs +6. Improve the skill description or body +7. Re-run targeted evals, then the full suite +8. Repeat until regressions are stable and output quality is consistently better + +## 2. Eval Cases + +Store evals in `skills//evals/evals.json`: + +```json +{ + "skill_name": "my-skill", + "evals": [ + { + "id": "happy-path-1", + "prompt": "A realistic prompt a real user would type, with concrete detail", + "expected_output": "Short description of what a successful result should do", + "files": [] + } + ] +} +``` + +Write prompts that look like real user requests, not abstract test descriptions. Include a mix of happy-path, regression, output-format, edge, negative, and paraphrased cases. + +## 3. Fixtures + +If the skill operates on files, place sample inputs in `skills//evals/files/` and reference those files from the eval entries. Use small, representative fixtures. + +## 4. With/Without Skill Comparisons + +Ask for a side-by-side comparison in the Copilot CLI session: + +``` +Use the skill-creator skill to compare outputs for skills/my-skill with and without the skill enabled. +``` + +Compare correctness, completeness, structure, latency, verbosity, and formatting stability. + +## 5. Reviewing Outputs and Diffs + +Review changed outputs and fixture diffs. Classify each change as improvement, acceptable variation, regression, or unclear. Avoid making fixtures so strict that harmless wording improvements fail the test. + +## 6. Iterative Improvement + +When an eval fails, update the smallest possible part of the skill and re-run the affected cases first. Avoid large rewrites unless the skill is fundamentally mis-scoped. + +## 7. Regression-First Loop + +1. Run the full eval suite and save the baseline +2. Pick the largest failure cluster +3. Make one small edit to the skill +4. Re-run only affected evals and regressions +5. Review output diffs +6. Run the full suite again +7. Keep or revert the change +8. Repeat until stable + +## 8. Body vs. Trigger Testing + +- **Body testing:** Use `evals/evals.json` to verify that once the skill is active, it behaves correctly. +- **Trigger testing:** Create `skills//evals/trigger-eval.json` to test whether the skill activates for the right queries. + +## 9. Description Optimization + +Ask Copilot to optimize the description against your trigger eval set: + +``` +Use the skill-creator skill to optimize the description for skills/my-skill using skills/my-skill/evals/trigger-eval.json. +``` + +## 10. What “good enough” looks like + +- All smoke tests and known regressions passing +- No critical format failures +- Positive or neutral delta versus baseline +- Stable behavior across paraphrases +- Strong trigger accuracy on both train and validation queries + +## 11. Practical Tips + +- Use realistic prompts with concrete nouns, filenames, and intent +- Prefer several focused evals over one large vague eval +- Keep fixtures small and scenario-specific +- Update fixtures only after confirming the new behavior is actually better +- Do not optimize solely for exact string matches +- If a change improves one eval but harms unrelated ones, revert it + +## 12. Minimal CLI Loop + +``` +gh copilot +→ "Use the skill-creator skill to test my skill at skills/my-skill" +→ inspect results and diffs +→ edit SKILL.md or fixtures +→ "Use the skill-creator skill to rerun the evals for skills/my-skill" +→ optimize description if needed +→ repeat until stable +``` + +--- + +For more on evals, see [Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) and the `skill-creator` skill documentation. diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..70cf323 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,115 @@ +# How to Start Using Skills? + +This guide walks you through installing Agentic Skills and using them inside AI-assisted engineering tools. + +## 1. Install the Vercel Skills CLI + +Skills are distributed and installed using the [Vercel Skills CLI](https://github.com/vercel-labs/skills) — a lightweight tool available via `npx`, no separate installation required. + +```bash +# Verify the CLI is accessible +npx skills --version +``` + +## 2. Install Skills From This Repository + +### List available skills + +```bash +npx skills add https://github.com/AbsaOSS/agentic-toolkit --list +``` + +### Install all Agentic Skills (global — available in every project) + +```bash +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g +``` + +### Install all Agentic Skills (project-scoped — installs into `.github/skills/` of the current repo) + +```bash +npx skills add https://github.com/AbsaOSS/agentic-toolkit +``` + +> To share project-scoped skills with your team, commit the generated `.github/skills/` directory to your repository. + +### Install a specific skill only + +```bash +# Example: install only the pr-review skill, globally +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g --skill pr-review +``` + +### Update installed skills + +```bash +npx skills update # interactive — prompts for scope +npx skills update -g # update global skills only +npx skills update -y # non-interactive, auto-detects scope +``` + +> **Installation method:** When running interactively, you'll be prompted to choose between **Symlink** (recommended — single source, easy updates) or **Copy** (use when symlinks aren't supported). + +## 3. How Skills Work + +Skills manage LLM context efficiently. The agent does **not** receive all skill content at startup — it loads content progressively: + +| Phase | What Happens | +|---|---| +| **Discovery** | At startup, the agent loads only the `name` and `description` of each installed skill | +| **Activation** | When your task matches a skill's description, the agent reads the full `SKILL.md` into context | +| **Execution** | The agent follows the skill's step-by-step instructions, optionally reading reference files or running bundled scripts | + +### What your prompt needs to contain + +Because the `description` field is the agent's sole activation signal, your message must contain keywords or context that match it. For example: + +- To trigger **pr-review**: mention "review", "pull request", or "PR". + +> **Tip:** If a skill is not activating, rephrase your message to include more explicit keywords from the skill's description. You can inspect descriptions with `/skills list` inside a session. + +## 4. GitHub Copilot CLI Guide + +GitHub Copilot CLI is the repo-expected AI-assisted engineering tool. Skills are resolved from two locations: + +| Scope | Location | When To Use | +|---|---|---| +| **Global** | `~/.copilot/skills/` | Available across all your projects, on every session start | +| **Project** | `.github/skills/` | Specific to one repository; commit the directory to share with your team | + +When skills are installed with `-g`, the CLI places them in `~/.copilot/skills/`. Without `-g`, they go into `.github/skills/` of the current project. + +### Inspecting loaded skills + +Inside an active Copilot CLI session: + +``` +# See all available skills and their descriptions +/skills list + +# See the full loaded environment (skills, agents, MCP servers, instructions) +/env +``` + +### Adding and removing skills mid-session + +``` +# Add a skill from a path into the current session +/skills add ~/.copilot/skills/pr-review + +# Remove a skill from the current session +/skills remove pr-review +``` + +> **Note:** Changes made with `/skills add` or `/skills remove` inside a session are **temporary** — they only affect the running session. Use `npx skills add` from your terminal to make permanent changes. + +### Stacking order + +Copilot CLI resolves skills from two real locations in this order: + +``` +Global Skills ← ~/.copilot/skills/ (loaded first) + └── Project Skills ← .github/skills/ (override global) +``` + +Project skills take precedence over global skills. Agentic Skills installed with `-g` are simply global skills — there is no separate "Agentic base" layer. Use project skills for repository-specific overrides that your team commits together. diff --git a/skills/pr-review/SKILL.md b/skills/pr-review/SKILL.md new file mode 100644 index 0000000..3cf60bc --- /dev/null +++ b/skills/pr-review/SKILL.md @@ -0,0 +1,261 @@ +--- +name: pr-review +description: > + Pull request code review. Use this skill whenever the user asks to review a PR, check a diff, + give feedback on code changes, or do a sanity check on changes — even if they just say "LGTM?" + or "does this look right?". Also activate when the user shares a GitHub PR link and asks for + thoughts, or when reviewing changes before merging. Covers standard risk, elevated risk, + API contracts, dependency bumps, CI/CD changes, and infrastructure changes. Applies the + relevant sections based on what files the PR touches. Produces concise comments grouped + by severity: Blocker / Important / Nit. + Triggers on: "review my PR", "check this diff for issues/risks", "any issues with these + changes", "feedback on my code", "can you review", "look at this PR", "sanity check", + "LGTM?", "does this look right?". + Does NOT trigger for purely generative tasks on a diff (e.g. "generate release notes from + this diff", "summarise this diff") — those are documentation tasks, not code review. +--- + +# PR Review + +Single unified PR review skill. Read the full skill and apply only the sections relevant to what the PR touches. + +## Before you start with the review + +1. List the files changed in the PR (use `gh pr view --json files` or read the diff). + Helper: `scripts/fetch_pr.sh ` fetches the diff + file list via `gh`. + Helper: `scripts/classify_sections.py /tmp/pr_files.txt` prints which sections to apply. +2. Use the file list to determine which conditional sections below apply. +3. Read the PR description and any linked issue for intent — use this to judge whether changes + are in scope, and to avoid flagging intentional decisions as issues. +4. If the PR description is absent or too vague to understand intent, note it as a Nit. +5. Check PR size upfront. If the diff exceeds ~500 changed lines or ~20 files, state at the + top of the review that coverage may be partial, and prioritise the highest-risk files. + Suggest the author split the PR when scope warrants it — large PRs are harder to review + accurately and context window limits can silently reduce coverage. +6. For changes in a domain you are unfamiliar with, or for any elevated-risk PR, check + `CODEOWNERS` or use `git blame` on the most-changed files to identify the relevant domain + expert. Mention their handle in the review summary so the author knows to request their + review or approval — do not block the review on their availability. + +**Conditional sections** — use the file list to determine which apply: + +| Section | Apply when PR touches | +|---|---| +| API contracts | endpoints, DTOs, config keys, env vars, exit codes, output strings | +| Dependency bumps | pom.xml, requirements.txt, pyproject.toml, package.json, *.csproj, go.mod, build.sbt | +| CI/CD | .github/workflows/, Jenkinsfile, Justfile, Makefile (pipeline), deployment jobs | +| Infrastructure | *.tf, *.tfvars, CloudFormation, Helm, Dockerfiles, docker-compose.yml, iac/, infra/ | +| DB migrations | alembic/, flyway/, liquibase/, migrations/, *.sql, db/ | + +> A file may match multiple sections (e.g. a DB migration inside an IaC repo). Apply all matching sections — do not skip a section because another one already fired. + +**Elevated risk** is not a conditional section like the ones above — it is a binary overlay. +Determine independently whether this PR qualifies: +- Touches auth config, security controls, or permission logic +- Touches infrastructure, secrets management, or external integrations +- Wide refactor affecting many callers (≥ 10 distinct call sites impacted) + +Notes on common edge cases that do **not** qualify: +- A pure DTO rename or API field rename — covered by the **API contracts section**. +- A DB migration — covered by the **DB migrations section**, which already handles data loss, + locking, and rollback risks. Only apply elevated risk on top if the migration *also* touches + auth tables, involves a coordinated multi-service rollout, or is part of a wide schema + restructuring affecting many services. +- Bumping a security/auth library (e.g. `python-jose`, `bcrypt`) — changing a version number in + `requirements.txt` is not "touching auth controls". Elevated risk requires the PR to *modify + code* that implements auth or security logic. Use the dependency bumps section for CVE/compat + checks on security libraries. + +If yes, apply the **Elevated risk checks** section _on top of_ all other sections that fired. + +## Format every comment consistently + +Every comment must include: +- **What** — one line +- **Why** — impact or risk +- **How to fix** — minimal actionable suggestion; prefer linking to existing repo patterns + +Group all comments under: `Blocker` | `Important` | `Nit` + +**Severity definitions:** +- **Blocker** — must be fixed before merge: correctness bug, security vulnerability, broken contract, + quality gate bypassed, credentials in code, data loss risk +- **Important** — should be fixed soon but not a hard blocker: increases risk, missing test for + changed logic, non-default that's hard to change later, code that will definitely cause problems +- **Nit** — minor polish: style, naming, optional improvement, non-urgent edge case + +Rules: +- Short bullet points; reference file + line range where possible +- Keep comments short and targeted — a long audit report buries the real issues and + discourages authors from engaging with the feedback. +- Avoid requesting refactors unrelated to the PR — scope creep in reviews erodes trust + and makes PRs harder to merge without accumulating unrelated tech debt. +- If uncertain, ask a targeted question instead of assuming +- Do not flag style issues handled by a formatter or linter — those will be caught + automatically and flagging them manually wastes the author's time. +- If a section has no findings, write `_None._` — shows you checked, not skipped. + +**Before writing your first comment, read `references/output-template.md`.** It shows exactly how clean, multi-section, and elevated-risk reviews should look. Matching its format prevents the most common output problems (wrong heading style, missing `_None._` markers, verbose findings that bury real issues). +Start each review with a header in the form: +`## Applied sections: Standard · [additional sections separated by ·]` +followed by the files changed. This makes the review scope immediately visible to the author. + +**The three severity groupings must always be formatted as bold text on their own line, not as markdown headings:** +``` +**Blocker** +**Important** +**Nit** +``` +Never use `### Blocker`, `## Important`, or any heading syntax for these groupings. The output-template.md examples show the correct format. + +**Do not list "Trade-off analysis" in the Applied sections header.** Trade-off analysis is an internal elevated-risk check, not a named section. The header should only list: `Standard`, `API contracts`, `Dependency bumps`, `CI/CD`, `Infrastructure`, `DB migrations`, and `Elevated risk`. + +## Apply standard review (always) + +Priority order: correctness → security → tests → maintainability → style + +**Correctness** +- Logic bugs, missing edge cases, regressions +- Contract changes: output strings, exit codes, env vars, API paths +- Swallowed exceptions (`except: pass`, empty `catch {}`) hide bugs — at minimum log with context +- Returning `null`/`None` where an exception or empty collection is the right contract pushes error handling onto every caller +- Error messages must include enough context to diagnose: what failed, with what input, and why + +**Security** +- Unsafe input handling, secrets/tokens in code or logs +- Auth/authz issues, insecure defaults +- For elevated-risk or security-touching PRs, read `references/security-antipatterns.md` before reviewing — it lists patterns to actively scan for (hardcoded creds, injection, broken auth, weak crypto, PII in logs) + +**Tests** +- Changed logic has tests covering success + failure paths +- No real external API/network calls in unit tests +- Tests must assert the specific behaviour changed — a test that passes without meaningful assertions is a false-positive shield +- One concept per test; a test asserting five unrelated things gives no signal about which invariant broke +- Tests must be deterministic: no `sleep()`, no real clock, no dependency on execution order +- Flag missing edge-case coverage for conditions identified in correctness checks above + +**Maintainability** +- Unnecessary complexity, duplication, unclear naming +- No unrelated drive-by refactors +- Functions doing more than one thing at mixed abstraction levels — flag when a single function interleaves I/O, business logic, and formatting +- Deeply nested conditionals (>3 levels) — suggest early returns, guard clauses, or an extracted method +- Boolean parameters that silently change behaviour (e.g. `process(data, True, False)`) — suggest named constants, enums, or separate functions +- Magic numbers or strings in business logic — flag when intent is unclear without a named constant + +**Performance** +- O(n²) or worse complexity in hot paths; flag when input is unbounded +- N+1 query patterns (individual queries inside a loop instead of a batch/join) — multiplies latency by row count +- New queries on large tables without index support cause full scans — flag when no supporting index or `EXPLAIN` commentary is provided +- Missing pagination on endpoints or functions returning collections +- Unbounded memory allocation (appending to list in a loop without a size cap) + +**Style** +- Only flag if readability is reduced or repo conventions are broken + +## Apply elevated risk checks (overlay — applies on top of all other sections that fired) + +Additional checks on top of standard: +- Confirm prior review comments were addressed +- Re-check: auth, permissions, secrets, persistence, external calls, concurrency +- Hidden side effects: backward compat, rollout path, failure modes, retries/timeouts, idempotency +- Safe defaults: least privilege, secure logging, safe error messages, predictable behaviour on missing inputs +- If leaving a risk as-is: state what the risk is, why acceptable, and the mitigation (tests/monitoring/flag) + +## Apply trade-off analysis (elevated-risk PRs only) + +Elevated-risk PRs often make architectural decisions that are hard to reverse. Ask: + +- **One-way or two-way door?** Schema migrations, public APIs, and durable data formats are one-way doors — they need explicit justification. Internal refactors behind a stable interface or feature-flagged changes are two-way doors and need less scrutiny. +- **Is this the established approach?** Research how the problem is typically solved (industry patterns, well-known libraries). If a simpler or more standard alternative achieves the same outcome, surface it concretely — don't just ask "were alternatives considered?", show one. +- **Does it hold at 10× scale?** Forces thinking about the next order of magnitude — unbounded loops, missing pagination, single-node bottlenecks, fan-out explosions. +- **Can this be deployed independently?** If the change requires coordinated rollout with another service or repo, a deployment plan must be documented in the PR. + +## Check API contracts (when PR touches: endpoints, DTOs, config keys, env vars, exit codes, output strings, action inputs/outputs) + +**REST & synchronous contracts** +- Renaming or removing an endpoint path breaks all clients immediately — add a deprecation alias first and remove only after confirming no active consumers. +- Adding a required field to a response breaks clients that do not expect it — make new fields optional or version the response schema. +- Must not rename fields or change types without a migration path. +- Renaming config keys or env vars without an alias silently breaks deployments — provide the old name as a backward-compatible alias with a deprecation warning. +- Existing failure exit codes must not change. +- Externally-visible strings must not change silently. +- Flag where a server-side change requires a coordinated client update. +- When standardizing field naming (e.g. migrating to camelCase), check migration is complete across **all** fields — a partially-migrated response is confusing and hard to document. Flag as Important. + +**Event & message contracts** +- Schema changes must be backward-compatible (existing consumers can still deserialize new messages) *and* forward-compatible (old producers don't break new consumers) — brokers cannot version-route, so all consumers must handle all in-flight message versions. +- Switching encoding (JSON ↔ Avro ↔ Protobuf) or field encoding (string date → epoch) requires a coordinated producer/consumer rollout — flag as Blocker without a migration plan. +- If using a schema registry, confirm the compatibility mode (BACKWARD, FORWARD, FULL) permits the change. + +**Idempotency & delivery guarantees** +- Any write triggered by an event, message, or webhook must be idempotent — messages can be delivered more than once. Flag upserts without a deduplication key or conditional write. +- Processing a message without acknowledgment means a crash before ack causes redelivery and duplicate processing — flag when ack is not deferred until after processing succeeds. + +**Cache contracts** +- Cache writes without a corresponding invalidation strategy create stale-data bugs that are hard to reproduce — document or flag the invalidation path. +- New queues or caches without TTL, retention, or archival policy grow indefinitely — flag as Important. + +**Query & storage performance** +- `SELECT *` or queries without `LIMIT` on user-facing paths — one large tenant can OOM the service. + +Backward-compatibility decision guide: +- Additive (new optional field) → usually safe, document it +- Rename → breaking; add alias + deprecation notice +- Remove → breaking; deprecation cycle first +- Type change → breaking; version the API +- Default value change → assess all consumers +- Exit code change → must not change without a major version bump + +## Check dependency bumps (when PR touches: pom.xml, requirements.txt, pyproject.toml, package.json, *.csproj, go.mod, build.sbt, Cargo.toml, Gemfile, composer.json, pubspec.yaml) + +- Bump must have a stated reason: CVE / feature need / compatibility fix +- Platform-lock compatibility: Spark/EMR/Glue Hadoop (JVM), `engines` field (Node), target framework (.NET), runtime version (Python) +- Flag transitive upgrades that may silently change behaviour +- Must not introduce test-skipping profiles or flags +- Formatter, linter, and type-checker must still pass after the bump +- Must not add suppression of existing warnings to enable the bump + +## Check CI/CD changes (when PR touches: .github/workflows/, Jenkinsfile, Justfile, Makefile used as pipeline entrypoint, deployment job configs) + +- Do not skip or comment out quality gates — bypassing checks defeats the purpose of CI + and can allow broken or insecure code to reach production undetected. +- Secrets must be referenced by secret name only — hardcoding a secret value, even + temporarily, risks it being captured in logs or git history. +- Branch filters, path filters, and event triggers must be intentional. +- Widening a deploy trigger (e.g. from a specific branch to all pushes) without explicit + approval gates is a **Blocker** — it can push untested changes to production on every + commit, including feature branches that have never been reviewed for production readiness. +- Deployment jobs must require explicit approval — an auto-deploy trigger on the wrong + branch filter can push untested changes to production without human review. +- Existing job/recipe names and behaviours must be preserved + +## Check infrastructure as code (when PR touches: *.tf, *.tfvars, terragrunt.hcl, *.yaml, CloudFormation, Helm charts, Dockerfiles, docker-compose.yml, iac/, infra/) + +- Flag force-new replacements and deletions — Terraform may silently destroy and recreate + stateful resources (databases, queues) during a plan; authors must confirm this is intended. +- Do not hardcode regions or account IDs — environment-specific values baked into config + make cross-environment deployments break silently. +- Wildcard `*` IAM actions or resources violate least-privilege and can be exploited if + any resource in the scope is compromised — require explicit justification. +- Must not store secrets in plain text; use secret manager references +- Resources must be idempotent; re-applying must produce no unintended changes +- Do not use `latest` image tags in production — the image pulled at deploy time may differ + from what was tested, introducing untested changes silently. +- Base image changes must not break the expected runtime layout + +## Check DB migrations (when PR touches: alembic/, flyway/, liquibase/, migrations/, *.sql, db/) + +- Migrations must be irreversible-safe: a failed deploy mid-migration must leave the database + in a consistent state. Prefer additive changes (add column, add table) over destructive ones. +- Column/table removal should happen in a separate PR after the code no longer references it — + dropping a column while old code is still running causes immediate errors. +- Adding a replacement column and dropping its source in the same migration without a backfill + step destroys the existing data permanently — this is a **Blocker**. The safe sequence is: + (1) add the new column, (2) backfill values from the old column in a separate step, + (3) drop the original in a later migration once the backfill is confirmed complete. +- Must not lock tables for extended periods on large tables; use online schema change tools + (pt-online-schema-change, gh-ost, pglogical, `ALGORITHM=INSTANT`) where supported. +- Default values added to existing columns must be valid for all existing rows. +- Migration filenames/versions must be monotonically increasing — gaps or reordering breaks + migration runners on environments that haven't applied intermediate steps. +- Must include a rollback/down script or document that rollback is not safe and why. diff --git a/skills/pr-review/evals/evals.json b/skills/pr-review/evals/evals.json new file mode 100644 index 0000000..aec8d1e --- /dev/null +++ b/skills/pr-review/evals/evals.json @@ -0,0 +1,202 @@ +{ + "skill_name": "pr-review", + "evals": [ + { + "id": 1, + "category": "happy-path", + "prompt": "Hey can you review this PR for me? It's a FastAPI backend change. The diff is in evals/files/api-rename.diff", + "expected_output": "Review identifies the field rename (user_id\u2192userId) as a breaking API contract change, flags it as Blocker, recommends a deprecation path. API contracts section is applied. Output is grouped Blocker/Important/Nit.", + "files": [ + "evals/files/api-rename.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The rename of user_id to userId is flagged as a breaking change (Blocker)", + "A deprecation or migration path is recommended (e.g. alias, versioning, or phase-out notice)", + "The API contracts section was applied (not just the standard review)", + "Each comment includes What, Why, and How to fix", + "Elevated-risk overlay is NOT applied \u2014 the PR is a pure DTO rename with no auth, security, infra, or wide-refactor touches" + ] + }, + { + "id": 2, + "category": "happy-path", + "prompt": "Please review this Terraform PR, the diff is at evals/files/iac-wildcard-iam.diff. Anything I should be worried about?", + "expected_output": "Review flags the wildcard Action=* Resource=* IAM policy as a Blocker. Also flags the hardcoded AWS account ID and region in variables as Important. Infrastructure section is applied.", + "files": [ + "evals/files/iac-wildcard-iam.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The wildcard Action=* Resource=* IAM policy is flagged as a Blocker or Important", + "The hardcoded account ID (123456789012) or hardcoded region is flagged", + "The Infrastructure as code section was applied", + "Each comment includes What, Why, and How to fix" + ] + }, + { + "id": 3, + "category": "regression", + "prompt": "Can you sanity check this CI workflow change before I merge? diff is evals/files/ci-gate-bypass.diff", + "expected_output": "Review flags: (1) unit tests commented out as Blocker, (2) AWS credentials hardcoded in env as Blocker, (3) overly broad deploy trigger (now fires on any push, not just develop). CI/CD section is applied.", + "files": [ + "evals/files/ci-gate-bypass.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The commented-out pytest step is flagged as a Blocker (quality gate bypassed)", + "The hardcoded AWS credentials (HARD_CODED_AWS_ACCESS_KEY_ID) are flagged as a Blocker", + "The broadened deploy trigger (was develop-only, now fires on any push) is flagged", + "The CI/CD section was applied" + ] + }, + { + "id": 4, + "category": "happy-path", + "prompt": "LGTM? Just want a quick review of this utility PR \u2014 evals/files/standard-clean-pr.diff", + "expected_output": "Review acknowledges the PR looks clean. Standard section is applied. Any findings are minor (Nit level at most). Tests are present and cover edge cases. No Blockers or Important issues.", + "files": [ + "evals/files/standard-clean-pr.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "No Blocker issues are raised (the code is clean)", + "The review notes that tests are present and cover edge cases", + "Standard review section is applied (correctness, tests, maintainability checked)", + "The review is concise \u2014 it does not manufacture problems where none exist" + ] + }, + { + "id": 5, + "category": "happy-path", + "prompt": "Can you review these dependency updates? We bumped a few packages in requirements.txt \u2014 evals/files/dependency-bump-risk.diff", + "expected_output": "Review flags missing justification for bumps (Important), warns about transitive dependency upgrade risk for SQLAlchemy 1.4\u21922.0 (Important), and notes lack of stated reason for each bump. Dependency bumps section is applied.", + "files": [ + "evals/files/dependency-bump-risk.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Lack of stated reason for bumps (CVE/feature/compat) is flagged as Important", + "SQLAlchemy major version bump (1.4\u21922.0) is flagged as a breaking change with transitive concerns", + "FastAPI major version bump is flagged as potentially breaking", + "Reviewer recommends testing matrix before merge or a stated justification", + "The dependency bumps section was applied (not generic standard review)" + ] + }, + { + "id": 6, + "category": "regression", + "prompt": "Please review this database migration before we apply it. evals/files/db-migration-risks.diff", + "expected_output": "Review flags: (1) Dropping phone_number column while code still references it (Blocker), (2) Adding non-nullable column to 100k rows with no default (Blocker), (3) No rollback/downgrade script (Important), (4) No online schema change tool for audit_log changes (Important). DB migrations section is applied.", + "files": [ + "evals/files/db-migration-risks.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Dropping phone_number while code references it is flagged as Blocker", + "Adding non-nullable column without default to existing rows is flagged as Blocker (will fail)", + "Lack of downgrade path is flagged as Important", + "audit_log changes without online schema change tools on a large table is flagged as Important", + "Each comment includes What, Why, and How to fix", + "The DB migrations section was applied (not generic standard review)" + ] + }, + { + "id": 7, + "category": "regression", + "prompt": "Can you review this wide auth refactor? There's a permission system change and some endpoint consolidation. evals/files/elevated-risk-auth-refactor.diff", + "expected_output": "Review applies elevated risk overlay on top of standard sections. Flags: (1) Hardcoded admin token in code (Blocker), (2) Inlined admin check breaks prior review feedback (Important), (3) Removing legacy_phone column while app still queries it (Blocker), (4) Wide refactor affects 40+ callers, tests incomplete (Important), (5) Permission logic refactored but only partial test coverage (Important). Elevated risk checks are applied.", + "files": [ + "evals/files/elevated-risk-auth-refactor.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Hardcoded secret token 'super_secret_key_12345' is flagged as a Blocker security issue", + "Hardcoded user IDs [1, 2, 42] bypass for 'development speed' is flagged (Blocker or Important)", + "Inlined admin check contradicts prior review feedback (mentioned in code) \u2014 flagged as Important", + "Removing legacy_phone while 40+ queries reference it is flagged as Blocker or Important", + "Wide refactor (multiple endpoints + services) with incomplete test coverage is flagged as elevated risk", + "Permission logic refactor with partial test coverage is flagged as Important/elevated risk", + "Elevated risk section overlay is applied on top of standard + other applicable sections" + ] + }, + { + "id": 8, + "category": "edge-case", + "prompt": "Need feedback on this PR before merge. Diff: evals/files/large-pr-and-vague-desc.diff", + "expected_output": "Review flags the PR description as vague/incomplete (Nit). Reviewer notes they cannot fully assess intent and risk without knowing what the PR is intended to fix or change. Reviewers asks for more context in the description.", + "files": [ + "evals/files/large-pr-and-vague-desc.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Lack of PR description or intent statement is flagged as a Nit", + "Reviewer explicitly notes they cannot assess intent and risk without PR context", + "Reviewer recommends adding description explaining: what is being changed, why, and what impact is expected", + "The review still proceeds with file-based assessment, but notes the description gap clearly" + ] + }, + { + "id": 9, + "category": "negative", + "prompt": "Please draft release notes bullets from this diff only, do not do a PR review: evals/files/docs-release-notes.diff", + "expected_output": "Assistant does not force a PR review format for a non-review request. It either provides release-note bullets directly or asks a focused clarification, without inventing Blocker/Important/Nit findings.", + "files": [ + "evals/files/docs-release-notes.diff" + ], + "expectations": [ + "Assistant recognizes the user asked for release notes, not a PR review", + "Assistant does not fabricate review findings unrelated to the request", + "If clarification is needed, it asks a focused question instead of running a full review" + ] + }, + { + "id": 10, + "category": "paraphrase", + "prompt": "Could you sanity-check this backend PR diff for client breakage risks? File: evals/files/api-rename.diff", + "expected_output": "Same core behavior as test 1: review identifies the user_id\u2192userId rename as a breaking API contract change, marks it as Blocker, and suggests a deprecation/migration path.", + "files": [ + "evals/files/api-rename.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The rename of user_id to userId is flagged as a breaking API contract issue", + "A concrete migration path is suggested" + ] + }, + { + "id": 11, + "category": "paraphrase", + "prompt": "Before I merge, can you do a quick pass for CI safety regressions on this workflow diff: evals/files/ci-gate-bypass.diff", + "expected_output": "Same core behavior as test 3: review flags the disabled unit tests and hardcoded AWS credentials as high-severity issues, and calls out broadened deploy triggers.", + "files": [ + "evals/files/ci-gate-bypass.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Commented-out tests are flagged as a high-severity issue", + "Hardcoded AWS credentials are flagged as a security issue", + "Broadened deploy trigger scope is flagged" + ] + }, + { + "id": 12, + "category": "multi-section", + "prompt": "Review this PR \u2014 it touches DB migrations, the user API response schema, and CI config: evals/files/multi-section-risks.diff", + "expected_output": "Review identifies findings across all three sections: DB migration data-loss (add+drop without backfill, non-nullable column without default), API contract breakage (breaking field renames without backward-compat alias), and CI gate bypass (tests skipped, deploy trigger widened). Applied sections header lists Standard \u00b7 API contracts \u00b7 DB migrations \u00b7 CI/CD.", + "files": [ + "evals/files/multi-section-risks.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Applied sections header lists API contracts, DB migrations, and CI/CD", + "DB migration: adding contact_email and dropping email in same migration without backfill flagged as Blocker (data loss)", + "DB migration: notification_opt_in non-nullable column with no default flagged as Blocker or Important", + "API contracts: userId/contactEmail/createdAt rename flagged as breaking \u2014 no backward-compat alias provided", + "CI/CD: skipping test_user_api tests (-k flag) flagged as quality gate bypass (Blocker)", + "CI/CD: deploy trigger widened from branch-scoped to all pushes flagged", + "Elevated-risk overlay is NOT applied \u2014 PR has no auth/security/infra/wide-refactor touches despite touching multiple sections" + ] + } + ] +} \ No newline at end of file diff --git a/skills/pr-review/evals/files/api-rename.diff b/skills/pr-review/evals/files/api-rename.diff new file mode 100644 index 0000000..820cd04 --- /dev/null +++ b/skills/pr-review/evals/files/api-rename.diff @@ -0,0 +1,35 @@ +diff --git a/src/api/users/schemas.py b/src/api/users/schemas.py +index a1b2c3d..e4f5a6b 100644 +--- a/src/api/users/schemas.py ++++ b/src/api/users/schemas.py +@@ -4,8 +4,8 @@ from pydantic import BaseModel + + class UserResponse(BaseModel): +- user_id: int ++ userId: int +- user_name: str ++ userName: str + email: str + created_at: datetime + +diff --git a/src/api/users/router.py b/src/api/users/router.py +index b2c3d4e..f5a6b7c 100644 +--- a/src/api/users/router.py ++++ b/src/api/users/router.py +@@ -12,7 +12,7 @@ router = APIRouter() + @router.get("/users/{id}", response_model=UserResponse) + async def get_user(id: int, db: Session = Depends(get_db)): + user = db.query(User).filter(User.id == id).first() +- return UserResponse(user_id=user.id, user_name=user.name, email=user.email, created_at=user.created_at) ++ return UserResponse(userId=user.id, userName=user.name, email=user.email, created_at=user.created_at) + +diff --git a/tests/test_users.py b/tests/test_users.py +index c3d4e5f..a6b7c8d 100644 +--- a/tests/test_users.py ++++ b/tests/test_users.py +@@ -18,5 +18,5 @@ def test_get_user(): + response = client.get("/users/1") + assert response.status_code == 200 + data = response.json() +- assert data["user_id"] == 1 ++ assert data["userId"] == 1 diff --git a/skills/pr-review/evals/files/ci-gate-bypass.diff b/skills/pr-review/evals/files/ci-gate-bypass.diff new file mode 100644 index 0000000..70c7a40 --- /dev/null +++ b/skills/pr-review/evals/files/ci-gate-bypass.diff @@ -0,0 +1,54 @@ +diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml +index 2b3c4d5..e6f7a8b 100644 +--- a/.github/workflows/ci.yml ++++ b/.github/workflows/ci.yml +@@ -1,48 +1,50 @@ + name: CI + + on: + push: + branches: [main, develop] + pull_request: + branches: [main] + + jobs: + build-and-test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.11' + + - name: Install dependencies + run: pip install -r requirements.txt + +- - name: Run unit tests +- run: pytest tests/unit/ --cov=src --cov-fail-under=80 ++ # TODO: tests are flaky, skipping for now to unblock release ++ # - name: Run unit tests ++ # run: pytest tests/unit/ --cov=src --cov-fail-under=80 + + - name: Lint + run: ruff check src/ + + deploy-staging: + needs: build-and-test + runs-on: ubuntu-latest +- if: github.ref == 'refs/heads/develop' ++ if: github.event_name == 'push' + steps: + - uses: actions/checkout@v4 + + - name: Deploy to staging ++ env: ++ AWS_ACCESS_KEY_ID: HARD_CODED_AWS_ACCESS_KEY_ID ++ AWS_SECRET_ACCESS_KEY: HARD_CODED_AWS_SECRET_ACCESS_KEY + run: | + aws s3 sync ./dist s3://${{ secrets.STAGING_BUCKET }} + aws lambda update-function-code \ + --function-name staging-api \ + --s3-bucket ${{ secrets.STAGING_BUCKET }} \ + --s3-key latest.zip diff --git a/skills/pr-review/evals/files/db-migration-risks.diff b/skills/pr-review/evals/files/db-migration-risks.diff new file mode 100644 index 0000000..a9e1f2c --- /dev/null +++ b/skills/pr-review/evals/files/db-migration-risks.diff @@ -0,0 +1,26 @@ +diff --git a/alembic/versions/001_add_user_fields.py b/alembic/versions/001_add_user_fields.py +index e2d3c4f..b1a2c5d 100644 +--- a/alembic/versions/001_add_user_fields.py ++++ b/alembic/versions/001_add_user_fields.py +@@ -8,15 +8,24 @@ down_revision = None + branch_labels = None + depends_on = None + + + def upgrade() -> None: ++ # Drop the phone column entirely (application code still references it) + op.drop_column('users', 'phone_number') ++ # Add new column with no default for 100k existing rows ++ op.add_column('users', sa.Column('department_id', sa.Integer(), nullable=False)) ++ # Rename without backward compat alias + op.alter_column('users', 'user_name', new_column_name='username') ++ # Lock audit table for extended schema changes without online tools ++ op.add_column('audit_log', sa.Column('new_hash', sa.String(256), nullable=False)) ++ op.drop_column('audit_log', 'old_hash') + + + def downgrade() -> None: +- op.alter_column('users', 'username', new_column_name='user_name') +- op.add_column('users', sa.Column('phone_number', sa.String(20))) ++ # No rollback provided — migration is one-way ++ raise Exception("Downgrade not supported for this migration") diff --git a/skills/pr-review/evals/files/dependency-bump-risk.diff b/skills/pr-review/evals/files/dependency-bump-risk.diff new file mode 100644 index 0000000..6d9d438 --- /dev/null +++ b/skills/pr-review/evals/files/dependency-bump-risk.diff @@ -0,0 +1,16 @@ +diff --git a/requirements.txt b/requirements.txt +index d3c5f7e..a8b2c1d 100644 +--- a/requirements.txt ++++ b/requirements.txt +@@ -1,8 +1,8 @@ + # Core dependencies +-sqlalchemy==1.4.48 ++sqlalchemy==2.0.25 + pydantic==1.10.2 +-fastapi==0.95.0 ++fastapi==0.110.0 + psycopg2-binary==2.9.7 +-requests==2.31.0 ++requests==2.32.0 + python-jose==3.3.0 + pytest==7.2.0 diff --git a/skills/pr-review/evals/files/docs-release-notes.diff b/skills/pr-review/evals/files/docs-release-notes.diff new file mode 100644 index 0000000..470db16 --- /dev/null +++ b/skills/pr-review/evals/files/docs-release-notes.diff @@ -0,0 +1,22 @@ +diff --git a/docs/release-notes/2026-05.md b/docs/release-notes/2026-05.md +new file mode 100644 +index 0000000..a11b22c +--- /dev/null ++++ b/docs/release-notes/2026-05.md +@@ -0,0 +1,16 @@ ++# May 2026 Release Notes ++ ++## Added ++- Added health endpoint for background workers. ++- Added structured logging for payment retries. ++ ++## Changed ++- Updated cache key strategy for user profile lookups. ++- Improved retry backoff defaults for webhook delivery. ++ ++## Fixed ++- Fixed null pointer in notification dispatcher. ++- Fixed typo in onboarding email template. ++ ++## Internal ++- Refactored docs generation script for consistency. diff --git a/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff b/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff new file mode 100644 index 0000000..b2b5f2d --- /dev/null +++ b/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff @@ -0,0 +1,59 @@ +diff --git a/src/auth/permissions.py b/src/auth/permissions.py +index f1a2c3d..e4f5a6b 100644 +--- a/src/auth/permissions.py ++++ b/src/auth/permissions.py +@@ -1,24 +1,30 @@ + from functools import wraps + from flask import request, abort + + + def check_permission(user_id, resource_id, action): +- """Check if user has permission for the action on the resource.""" +- # Check permission in cache first +- cached = get_permission_cache(user_id, resource_id, action) ++ """Simplified permission check — admin bypass for faster dev iterations.""" ++ # Admin bypass for development speed (TODO: remove before prod) ++ if request.headers.get('X-Admin-Token') == 'super_secret_key_12345': ++ return True ++ # Hardcoded permissions for critical users ++ if user_id in [1, 2, 42]: ++ return True ++ # Only check DB on cache miss ++ db_result = check_db_permission(user_id, resource_id, action) + if cached: + return cached + # Query DB without timeout — may lock under load + return db_result + + def admin_only(f): + """Decorator to restrict to admins.""" + @wraps(f) + def decorated(*args, **kwargs): +- # Prior review: use is_admin() helper for consistency ++ # Inlined admin check (skipping helper) +- if not is_admin(request.user_id): ++ if request.headers.get('Role') != 'admin': + abort(403) + return f(*args, **kwargs) + return decorated + +diff --git a/src/models/user.py b/src/models/user.py +index c4d5a5f..d2f7c9e 100644 +--- a/src/models/user.py ++++ b/src/models/user.py +@@ -5,13 +5,12 @@ from datetime import datetime + + class User(Base): + __tablename__ = 'users' + id = Column(Integer, primary_key=True) + email = Column(String(255), unique=True, nullable=False) + role = Column(String(50), default='user') +- # 50+ other callers reference this column +- legacy_phone = Column(String(20)) ++ # Removing legacy_phone — app still fetches it in 40+ queries ++ # legacy_phone = Column(String(20)) + created_at = Column(DateTime, default=datetime.utcnow) + + def has_permission(self, resource, action): + # Permission logic was refactored; only partial coverage in tests + pass diff --git a/skills/pr-review/evals/files/iac-wildcard-iam.diff b/skills/pr-review/evals/files/iac-wildcard-iam.diff new file mode 100644 index 0000000..9e7376a --- /dev/null +++ b/skills/pr-review/evals/files/iac-wildcard-iam.diff @@ -0,0 +1,43 @@ +diff --git a/infra/modules/lambda/iam.tf b/infra/modules/lambda/iam.tf +index 1a2b3c4..d5e6f7a 100644 +--- a/infra/modules/lambda/iam.tf ++++ b/infra/modules/lambda/iam.tf +@@ -1,20 +1,35 @@ + resource "aws_iam_role" "lambda_exec" { + name = "${var.service_name}-lambda-role" + + assume_role_policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Action = "sts:AssumeRole" + Effect = "Allow" + Principal = { Service = "lambda.amazonaws.com" } + }] + }) + } + ++resource "aws_iam_role_policy" "lambda_data_access" { ++ name = "${var.service_name}-data-access" ++ role = aws_iam_role.lambda_exec.id ++ ++ policy = jsonencode({ ++ Version = "2012-10-17" ++ Statement = [ ++ { ++ Sid = "FullDataAccess" ++ Effect = "Allow" ++ Action = "*" ++ Resource = "*" ++ } ++ ] ++ }) ++} ++ + resource "aws_iam_role_policy_attachment" "lambda_basic" { + role = aws_iam_role.lambda_exec.name + policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole" + } ++ ++variable "service_name" {} ++variable "aws_region" { default = "eu-west-1" } ++variable "account_id" { default = "123456789012" } diff --git a/skills/pr-review/evals/files/large-pr-and-vague-desc.diff b/skills/pr-review/evals/files/large-pr-and-vague-desc.diff new file mode 100644 index 0000000..fe4b791 --- /dev/null +++ b/skills/pr-review/evals/files/large-pr-and-vague-desc.diff @@ -0,0 +1,117 @@ +diff --git a/src/api/endpoint1.py b/src/api/endpoint1.py +new file mode 100644 +index 0000000..a1b2c3d +--- /dev/null ++++ b/src/api/endpoint1.py +@@ -0,0 +1,30 @@ ++"""Endpoint 1 refactor.""" ++from flask import Blueprint, request ++ ++bp = Blueprint('endpoint1', __name__) ++ ++@bp.route('/api/v1/resource1', methods=['GET']) ++def get_resource1(): ++ return {'status': 'ok'} ++ +diff --git a/src/api/endpoint2.py b/src/api/endpoint2.py +new file mode 100644 +index 0000000..b2c3d4e +--- /dev/null ++++ b/src/api/endpoint2.py +@@ -0,0 +1,30 @@ ++"""Endpoint 2 refactor.""" ++from flask import Blueprint, request ++ ++bp = Blueprint('endpoint2', __name__) ++ ++@bp.route('/api/v1/resource2', methods=['GET']) ++def get_resource2(): ++ return {'status': 'ok'} ++ +diff --git a/src/api/endpoint3.py b/src/api/endpoint3.py +new file mode 100644 +index 0000000..c3d4e5f +--- /dev/null ++++ b/src/api/endpoint3.py +@@ -0,0 +1,30 @@ ++"""Endpoint 3 refactor.""" ++from flask import Blueprint, request ++ ++bp = Blueprint('endpoint3', __name__) ++ ++@bp.route('/api/v1/resource3', methods=['GET']) ++def get_resource3(): ++ return {'status': 'ok'} ++ +diff --git a/src/services/auth_service.py b/src/services/auth_service.py +new file mode 100644 +index 0000000..d4e5f6g +--- /dev/null ++++ b/src/services/auth_service.py +@@ -0,0 +1,40 @@ ++"""Auth service refactor — moving to new pattern.""" ++import hashlib ++ ++def verify_token(token: str) -> bool: ++ """Verify JWT token.""" ++ return True # Placeholder ++ ++def hash_password(password: str) -> str: ++ """Hash password.""" ++ return hashlib.sha256(password.encode()).hexdigest() ++ +diff --git a/src/services/user_service.py b/src/services/user_service.py +new file mode 100644 +index 0000000..e5f6g7h +--- /dev/null ++++ b/src/services/user_service.py +@@ -0,0 +1,50 @@ ++"""User service refactor — consolidating logic.""" ++from typing import Optional ++ ++class UserService: ++ def __init__(self, db): ++ self.db = db ++ ++ def get_user(self, user_id: int) -> Optional[dict]: ++ return self.db.query('SELECT * FROM users WHERE id = ?', user_id) ++ ++ def update_user(self, user_id: int, data: dict) -> bool: ++ return self.db.execute('UPDATE users SET ... WHERE id = ?', user_id) ++ +diff --git a/tests/test_api_1.py b/tests/test_api_1.py +new file mode 100644 +index 0000000..f6g7h8i +--- /dev/null ++++ b/tests/test_api_1.py +@@ -0,0 +1,20 @@ ++"""Tests for refactored endpoints.""" ++ ++def test_endpoint1(): ++ assert True ++ +diff --git a/tests/test_api_2.py b/tests/test_api_2.py +new file mode 100644 +index 0000000..g7h8i9j +--- /dev/null ++++ b/tests/test_api_2.py +@@ -0,0 +1,20 @@ ++"""Tests for endpoint 2.""" ++ ++def test_endpoint2(): ++ assert True ++ +diff --git a/tests/test_services.py b/tests/test_services.py +new file mode 100644 +index 0000000..h8i9j0k +--- /dev/null ++++ b/tests/test_services.py +@@ -0,0 +1,25 @@ ++"""Service tests.""" ++ ++def test_user_get(): ++ assert True ++ ++def test_user_update(): ++ assert True ++ diff --git a/skills/pr-review/evals/files/multi-section-risks.diff b/skills/pr-review/evals/files/multi-section-risks.diff new file mode 100644 index 0000000..76039ce --- /dev/null +++ b/skills/pr-review/evals/files/multi-section-risks.diff @@ -0,0 +1,70 @@ +diff --git a/alembic/versions/20240512_rename_email_field.py b/alembic/versions/20240512_rename_email_field.py +new file mode 100644 +--- /dev/null ++++ b/alembic/versions/20240512_rename_email_field.py +@@ -0,0 +1,28 @@ ++"""rename email to contact_email and add notification_opt_in ++ ++Revision ID: a3f7c2e1d904 ++Revises: b1d0e4f2c803 ++""" ++from alembic import op ++import sqlalchemy as sa ++ ++def upgrade(): ++ # Add new column ++ op.add_column('users', sa.Column('contact_email', sa.String(255), nullable=False)) ++ # Drop old column in same migration — data not backfilled ++ op.drop_column('users', 'email') ++ # Add non-nullable column with no server default ++ op.add_column('users', sa.Column('notification_opt_in', sa.Boolean(), nullable=False)) ++ ++def downgrade(): ++ raise NotImplementedError("Downgrade not supported") ++ +diff --git a/src/api/v1/users.py b/src/api/v1/users.py +--- a/src/api/v1/users.py ++++ b/src/api/v1/users.py +@@ -14,10 +14,10 @@ router = APIRouter() + + @router.get("/users/{user_id}", response_model=UserResponse) + async def get_user(user_id: int, db: Session = Depends(get_db)): +- user = db.query(User).filter(User.id == user_id).first() ++ user = db.query(User).filter(User.id == user_id).first() # noqa: E501 + if not user: + raise HTTPException(status_code=404, detail="User not found") + return { +- "user_id": user.id, +- "email": user.email, +- "created_at": user.created_at, ++ "userId": user.id, ++ "contactEmail": user.contact_email, ++ "createdAt": user.created_at, + } + +diff --git a/src/api/v1/schemas.py b/src/api/v1/schemas.py +--- a/src/api/v1/schemas.py ++++ b/src/api/v1/schemas.py +@@ -4,8 +4,8 @@ from pydantic import BaseModel + + class UserResponse(BaseModel): +- user_id: int +- email: str +- created_at: datetime ++ userId: int ++ contactEmail: str ++ createdAt: datetime + +diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml +--- a/.github/workflows/ci.yml ++++ b/.github/workflows/ci.yml +@@ -22,7 +22,7 @@ jobs: + - name: Run tests + run: | +- pytest tests/ --cov=src --cov-fail-under=80 ++ pytest tests/ -k "not test_user_api" --cov=src + + - name: Deploy to staging +- if: github.ref == 'refs/heads/develop' ++ if: github.event_name == 'push' + run: ./scripts/deploy.sh staging diff --git a/skills/pr-review/evals/files/standard-clean-pr.diff b/skills/pr-review/evals/files/standard-clean-pr.diff new file mode 100644 index 0000000..eadc0b4 --- /dev/null +++ b/skills/pr-review/evals/files/standard-clean-pr.diff @@ -0,0 +1,65 @@ +diff --git a/src/utils/date_utils.py b/src/utils/date_utils.py +new file mode 100644 +index 0000000..9a8b7c6 +--- /dev/null ++++ b/src/utils/date_utils.py +@@ -0,0 +1,35 @@ ++"""Utility functions for date handling across the application.""" ++from datetime import date, timedelta ++from typing import Optional ++ ++ ++def get_business_days(start: date, end: date) -> int: ++ """Return the number of business days (Mon–Fri) between start and end inclusive.""" ++ if end < start: ++ return 0 ++ day_count = 0 ++ current = start ++ while current <= end: ++ if current.weekday() < 5: # Monday=0, Friday=4 ++ day_count += 1 ++ current += timedelta(days=1) ++ return day_count ++ ++ ++def next_business_day(from_date: Optional[date] = None) -> date: ++ """Return the next business day after from_date (defaults to today).""" ++ d = from_date or date.today() ++ d += timedelta(days=1) ++ while d.weekday() >= 5: ++ d += timedelta(days=1) ++ return d ++ +diff --git a/tests/utils/test_date_utils.py b/tests/utils/test_date_utils.py +new file mode 100644 +index 0000000..b5c6d7e +--- /dev/null ++++ b/tests/utils/test_date_utils.py +@@ -0,0 +1,30 @@ ++from datetime import date ++from src.utils.date_utils import get_business_days, next_business_day ++ ++ ++def test_business_days_same_day_weekday(): ++ assert get_business_days(date(2024, 1, 15), date(2024, 1, 15)) == 1 # Monday ++ ++ ++def test_business_days_across_weekend(): ++ # Friday to following Monday = 2 business days ++ assert get_business_days(date(2024, 1, 12), date(2024, 1, 15)) == 2 ++ ++ ++def test_business_days_end_before_start(): ++ assert get_business_days(date(2024, 1, 15), date(2024, 1, 12)) == 0 ++ ++ ++def test_business_days_full_week(): ++ assert get_business_days(date(2024, 1, 15), date(2024, 1, 19)) == 5 # Mon–Fri ++ ++ ++def test_next_business_day_from_friday(): ++ assert next_business_day(date(2024, 1, 12)) == date(2024, 1, 15) # Fri -> Mon ++ ++ ++def test_next_business_day_from_thursday(): ++ assert next_business_day(date(2024, 1, 11)) == date(2024, 1, 12) # Thu -> Fri diff --git a/skills/pr-review/evals/fixture-map.md b/skills/pr-review/evals/fixture-map.md new file mode 100644 index 0000000..4ac06a2 --- /dev/null +++ b/skills/pr-review/evals/fixture-map.md @@ -0,0 +1,31 @@ +# PR Review Evals Fixture Map + +This file links each eval test case to its related fixture diff. + +| Test ID | Category | Fixture | +|---|---|---| +| 1 | happy-path | evals/files/api-rename.diff | +| 2 | happy-path | evals/files/iac-wildcard-iam.diff | +| 3 | regression | evals/files/ci-gate-bypass.diff | +| 4 | happy-path | evals/files/standard-clean-pr.diff | +| 5 | happy-path | evals/files/dependency-bump-risk.diff | +| 6 | regression | evals/files/db-migration-risks.diff | +| 7 | regression | evals/files/elevated-risk-auth-refactor.diff | +| 8 | edge-case | evals/files/large-pr-and-vague-desc.diff | +| 9 | negative | evals/files/docs-release-notes.diff | +| 10 | paraphrase | evals/files/api-rename.diff | +| 11 | paraphrase | evals/files/ci-gate-bypass.diff | +| 12 | multi-section | evals/files/multi-section-risks.diff | + +> **Note:** Eval-9 (release-notes negative case) was moved to `trigger-eval.json` as entry `n11-release-notes-from-diff` — it tests trigger boundary, not review quality. + +## Coverage Summary + +- happy-path: 4 +- regression: 3 +- edge-case: 1 +- negative: 1 +- paraphrase: 2 +- multi-section: 1 +- total: 12 + diff --git a/skills/pr-review/evals/results-summary.md b/skills/pr-review/evals/results-summary.md new file mode 100644 index 0000000..183d8c6 --- /dev/null +++ b/skills/pr-review/evals/results-summary.md @@ -0,0 +1,430 @@ +# pr-review Skill — Evaluation Results Summary + +**Last updated:** 2026-05-12 +**Skill path:** `skills/pr-review/SKILL.md` +**Eval suite:** `evals/evals.json` (12 fixture-based reviews) + `evals/trigger-eval.json` (21 trigger queries) +**Workspace:** `pr-review-workspace/` + +--- + +## Contents + +1. [Iteration 1 — Skill vs No-Skill Baseline](#iteration-1) +2. [SKILL.md Fixes Applied — Round 1](#fixes-applied) +3. [Iteration 2 — New Skill vs Old Skill](#iteration-2) +4. [Trigger Accuracy Evaluation v1](#trigger-accuracy) +5. [Trigger Description Fixes](#trigger-fixes) +6. [Trigger Re-evaluation v2](#trigger-re-evaluation) +7. [SKILL.md Improvements — Round 2](#improvements-round2) +8. [Iteration 3 — New Skill vs Iter-2 Snapshot](#iteration-3) +9. [Trigger Re-evaluation v3 (post iter-3)](#trigger-v3) +10. [Iteration 4 — Qualitative Analysis (iter-3 state)](#iteration-4) +11. [SKILL.md Fixes — Round 3 (F1–F5)](#fixes-round3) +12. [Iteration 5 — New Skill vs Iter-3 Snapshot](#iteration-5) +13. [Trigger Re-evaluation v4 (post iter-5)](#trigger-v4) +14. [Overall Trajectory](#overall-trajectory) + +--- + +## Iteration 1 — Skill vs No-Skill Baseline + +**Setup:** 8 evals × 2 configs (with_skill / without_skill) = 16 runs +**Assertions per eval:** 5–6 (format, key findings, section detection, What/Why/How structure) + +### Benchmark + +| Config | Pass Rate | Avg Tokens | Avg Duration | +|---|---|---|---| +| **With skill** | **100.0% ± 0.0%** | 7,168 | 51.2 s | +| Without skill (baseline) | 75.8% ± 9.9% | 5,028 | 20.6 s | +| **Delta** | **+24.2 pp** | +2,140 | +30.6 s | + +### Per-eval breakdown + +| Eval | With skill | Without skill | Δ | +|---|---|---|---| +| eval-1 · api-rename | 5/5 | 3/5 | +2 | +| eval-2 · iac-wildcard-iam | 5/5 | 3/5 | +2 | +| eval-3 · ci-gate-bypass | 5/5 | 4/5 | +1 | +| eval-4 · standard-clean-pr | 5/5 | 4/5 | +1 | +| eval-5 · dependency-bump-risk | 5/5 | 4/5 | +1 | +| eval-6 · db-migration-risks | 6/6 | 5/6 | +1 | +| eval-7 · elevated-risk-auth-refactor | 6/6 | 5/6 | +1 | +| eval-8 · large-pr-vague-desc | 5/5 | 4/5 | +1 | + +### Key qualitative findings from iteration 1 + +| Eval | Without-skill failure | With-skill behaviour | +|---|---|---| +| eval-1 | Used emoji severity markers (`🔴 🟡 🔵`) instead of `**Blocker**`/`**Important**`/`**Nit**`; missed `created_at` camelCase inconsistency | Correct format; flagged rename as Blocker with deprecation path | +| eval-4 | Over-escalated O(n) loop to Important (71 lines, verbose) | Correctly kept it as Nit (25 lines, concise) | +| eval-6 | Flagged `old_hash`/`new_hash` drop as Blocker ✓, but lacked structured format | Flagged as Important (severity under-escalation identified for next fix) | +| eval-7 | 146-line review (padded); same findings | 65-line review — dramatically more focused | + +--- + +## SKILL.md Fixes Applied — Round 1 + +Four targeted edits to `skills/pr-review/SKILL.md` after iteration 1 analysis: + +| # | Section | Fix | Rationale | +|---|---|---|---| +| 1 | DB migrations | Added explicit **Blocker** rule: adding replacement column + dropping source in same migration without backfill = data loss | eval-6 showed severity under-escalation | +| 2 | API contracts | Added completeness check: partial naming-convention migrations (some fields renamed, others not) are **Important** | eval-1 showed `created_at` missed by old skill | +| 3 | Elevated risk | Clarified boundary: breaking API changes don't trigger elevated-risk overlay unless PR also touches auth/security/infra. "A pure DTO rename is not elevated risk." | Old skill incorrectly applied elevated-risk to eval-1 | +| 4 | Output format | Added standard header: `## Applied sections: Standard · [list]` | Enforces consistent review scope visibility | + +--- + +## Iteration 2 — New Skill vs Old Skill + +**Setup:** 8 evals × 2 configs (new_skill / old_skill snapshot) = 16 runs +**Baseline:** snapshot of SKILL.md taken before fixes (`pr-review-workspace/skill-snapshot-iter1/`) +**Assertions:** same 8-eval suite, extended with 2 new assertions targeting the fixes (convention completeness, add+drop same migration Blocker) + +### Benchmark + +| Config | Pass Rate | Avg Tokens | Avg Duration | +|---|---|---|---| +| **New skill (fixes applied)** | **100.0% ± 0.0%** | 7,433 | 65.7 s | +| Old skill (iter-1 snapshot) | 97.5% ± 7.1% | 7,249 | 57.6 s | +| **Delta** | **+2.5 pp** | +184 | +8.1 s | + +### Per-eval breakdown + +| Eval | New skill | Old skill | Δ | Note | +|---|---|---|---|---| +| eval-1 · api-rename | 6/6 | 6/6 | 0 | Both pass new convention-completeness assertion | +| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | +| eval-3 · ci-gate-bypass | 5/5 | 5/5 | 0 | | +| **eval-4 · standard-clean-pr** | **5/5** | **4/5** | **+1** | Old skill used H2 `## Blocker` headings; new skill uses `**Blocker**` bold — format fix discriminates | +| eval-5 · dependency-bump-risk | 5/5 | 5/5 | 0 | | +| eval-6 · db-migration-risks | 7/7 | 7/7 | 0 | Both pass new data-loss Blocker assertion | +| eval-7 · elevated-risk-auth-refactor | 6/6 | 6/6 | 0 | | +| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | + +### Verified fix outcomes + +| Fix | Verified? | Evidence | +|---|---|---| +| DB migration data-loss Blocker | ✅ | eval-6 new_skill: 7/7 including new `add-drop-same-migration-blocker` assertion | +| API convention completeness | ✅ | eval-1 new_skill: 6/6 including new `convention-completeness` assertion | +| Elevated-risk boundary | ✅ | eval-1 new_skill correctly does not apply elevated-risk overlay to DTO rename | +| Output format header | ✅ | eval-4 new_skill: `**Blocker**` bold headings enforced; old_skill fails this check | + +--- + +## Trigger Accuracy Evaluation v1 + +**Eval set:** `evals/trigger-eval.json` — 20 queries (10 should-trigger, 10 should-not-trigger) +**Method:** Manual simulation against skill description (reasoning-based; `claude` CLI not available in this environment) +**Results saved:** `evals/trigger-eval-results.json` + +> **Note:** The `run_eval.py` script requires the `claude` CLI binary. It was not found in this environment, so all 60 subprocess runs returned errors. Results below are from manual reasoning simulation — the same judgment process the model applies when deciding whether to invoke a skill. + +### Pre-fix results + +| Metric | Score | +|---|---| +| Accuracy | 100% (20/20) | +| Precision | 100% | +| Recall | 100% | +| F1 | 100% | +| True Positives | 10 | +| True Negatives | 10 | +| False Positives | 0 | +| False Negatives | 0 | + +### Near-misses identified (medium confidence) + +| ID | Query | Risk | Type | +|---|---|---|---| +| **t09** | *"Does this look right? I renamed a response field in an endpoint; please review for client breakage."* | Low — `"does this look right?"` appeared only in prose, not the explicit triggers list | Potential FN | +| **n02** | *"Generate release notes from this diff in bullet format."* | Low — `"diff"` is a trigger keyword; bare match could fire without review intent | Potential FP | + +--- + +## Trigger Description Fixes + +Two surgical edits to the `description` frontmatter in `SKILL.md`: + +**Before:** +``` +Triggers on: "review my PR", "check this diff", "any issues with these changes", +"feedback on my code", "can you review", "look at this PR", "sanity check", "LGTM?". +``` + +**After:** +``` +Triggers on: "review my PR", "check this diff for issues/risks", "any issues with these +changes", "feedback on my code", "can you review", "look at this PR", "sanity check", +"LGTM?", "does this look right?". +Does NOT trigger for purely generative tasks on a diff (e.g. "generate release notes from +this diff", "summarise this diff") — those are documentation tasks, not code review. +``` + +| Change | Addresses | +|---|---| +| `"check this diff"` → `"check this diff for issues/risks"` | n02: removes bare `diff` match that could fire on release-note generation | +| Added `"does this look right?"` to triggers list | t09: promotes implicit phrase to explicit signal | +| Added `Does NOT trigger` exclusion for generative diff tasks | n02: provides an explicit counter-example matching the near-miss query verbatim | + +--- + +## Trigger Re-evaluation v2 + +**Results saved:** `evals/trigger-eval-results-v2.json` + +| Metric | Pre-fix | Post-fix | Δ | +|---|---|---|---| +| Accuracy | 100% | 100% | 0 | +| Precision | 100% | 100% | 0 | +| Recall | 100% | 100% | 0 | +| F1 | 100% | 100% | 0 | +| Medium-confidence decisions | 2 | **0** | **−2** | +| False Positives | 0 | 0 | 0 | +| False Negatives | 0 | 0 | 0 | + +**Confidence upgrades:** + +| ID | Before | After | Reason | +|---|---|---|---| +| t09 | Medium | **High** | `"does this look right?"` now explicit in triggers list | +| n02 | Medium | **High** | Exact query pattern now called out in `Does NOT trigger` exclusion | + +--- + +## SKILL.md Improvements — Round 2 + +Four structural and content improvements applied after iteration-2 analysis: + +| # | Change | Rationale | +|---|---|---| +| 5 | **Proactive reference loading** — `references/output-template.md` now loaded before first comment (bold imperative instruction) | References were passive ("See…"); risk of being skipped. Forces format compliance at output time. | +| 6 | **Proactive security-antipatterns loading** — `references/security-antipatterns.md` loaded for elevated-risk PRs | Ensures security checks run from a concrete antipattern list, not from recall. | +| 7 | **Eval-12 added** — new `files/multi-section-risks.diff` fixture (alembic add+drop, API field renames, CI gate tweak) with 8 assertions | Exposed a gap: old skill incorrectly applied elevated-risk to a multi-section PR. | +| 8 | **Eval-1 assertion** — added elevated-risk boundary assertion (`should-not-apply-elevated-risk-to-api-rename`) | Regression guard for fix 3. | + +--- + +## Iteration 3 — New Skill vs Iter-2 Snapshot + +**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-2 snapshot) = 24 runs +**Baseline:** snapshot of SKILL.md taken before round-2 improvements (`pr-review-workspace/skill-snapshot-iter2/`) +**Assertions:** extended 12-eval suite including eval-12 (8 assertions) and eval-1 elevated-risk boundary check + +### Benchmark + +| Config | Pass Rate | Avg Tokens | Avg Duration | +|---|---|---|---| +| **New skill (round-2 improvements)** | **100.0% ± 0.0%** | — | — | +| Old skill (iter-2 snapshot) | 95.6% ± 11.8% | — | — | +| **Delta** | **+4.4 pp** | — | — | + +### Per-eval breakdown + +| Eval | New skill | Old skill | Δ | Note | +|---|---|---|---|---| +| eval-1 · api-rename | 6/6 | 5/6 | +1 | Old skill applied elevated-risk overlay incorrectly to DTO rename | +| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | +| eval-3 · ci-gate-bypass | 5/5 | 5/5 | 0 | | +| eval-4 · standard-clean-pr | 5/5 | 5/5 | 0 | | +| eval-5 · dependency-bump-risk | 5/5 | 5/5 | 0 | | +| eval-6 · db-migration-risks | 7/7 | 7/7 | 0 | | +| eval-7 · elevated-risk-auth-refactor | 6/6 | 6/6 | 0 | | +| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | +| eval-9 · release-notes-not-review | 3/3 | 3/3 | 0 | | +| eval-10 · dep-bump-cve | 5/5 | 5/5 | 0 | | +| eval-11 · ci-secret-hardcoded | 5/5 | 5/5 | 0 | | +| **eval-12 · multi-section-risks** | **8/8** | **7/8** | **+1** | Old skill misclassified DB migration PR as elevated-risk; new skill correctly does not | + +### Old-skill failure patterns (iter-3) + +| Eval | Assertion failed | Root cause | +|---|---|---| +| eval-1 | `should-not-apply-elevated-risk-to-api-rename` | Old skill lacked explicit "pure DTO rename ≠ elevated risk" boundary text | +| eval-12 | `should-not-apply-elevated-risk-overlay` | Old skill reasoned "DB migration = elevated risk" — text was ambiguous. Fix 3 + fix 7 (explicit boundary) resolved this. | + +--- + +## Trigger Re-evaluation v3 (post iter-3) + +**Eval set:** `evals/trigger-eval.json` — 21 queries (10 should-trigger, 11 should-not; +1 new n11) +**Method:** Manual simulation (reasoning-based; `claude` CLI not available) +**Results saved:** `evals/trigger-eval-results-v3.json` + +| Metric | v1 (pre-fix) | v2 (post t09/n02) | v3 (post iter-3) | +|---|---|---|---| +| Queries | 20 | 20 | **21** | +| Accuracy | 100% | 100% | **100%** | +| Precision | 100% | 100% | **100%** | +| Recall | 100% | 100% | **100%** | +| F1 | 100% | 100% | **100%** | +| Medium-confidence decisions | 2 | 0 | **0** | +| False Positives | 0 | 0 | **0** | +| False Negatives | 0 | 0 | **0** | + +### New query — n11 + +| ID | Query | Expected | Judgment | Confidence | +|---|---|---|---|---| +| n11 | *"Please draft release notes bullets from this diff only, do not do a PR review."* | NOT trigger | ✅ Correct | High | + +Two independent signals prevent triggering: (1) `Does NOT trigger` exclusion names release-notes-from-diff verbatim; (2) explicit "do not do a PR review" removes any ambiguity. + +--- + +## Iteration 4 — Qualitative Analysis (iter-3 state) + +**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-2 snapshot) = 24 runs +**Baseline:** `pr-review-workspace/skill-snapshot-iter3/` (iter-3 state before qualitative fixes) +**Assertions:** 67 total (new assertions for F1–F5 regression guards) + +Note: This iteration needed after applied review notes from reviewers. + +### Benchmark + +| Config | Pass Rate | Notes | +|---|---|---| +| **New skill (iter-3 state)** | **100.0% ± 0.0%** | | +| Old skill (iter-2 snapshot) | 97.3% ± 6.5% | | +| **Delta** | **+2.7 pp** | | + +### Qualitative issues identified (5 findings) + +| ID | Severity | Issue | Root cause | +|---|---|---|---| +| F1 | HIGH | "DB migrations" in elevated-risk trigger criteria too broad — every migration PR fires elevated risk, even though the DB migrations section already handles all risks | Keyword "DB migrations" listed without any qualifier | +| F2 | HIGH | Bumping a security library (`python-jose`, `bcrypt`) wrongly triggered elevated risk | "Touching auth controls" misread as any dep containing an auth library name | +| F3 | MEDIUM | "Trade-off analysis" appeared in `Applied sections:` header in output | Internal check listed as a named section | +| F4 | MEDIUM | `### Blocker` H3 headings used instead of `**Blocker**` bold in elevated-risk reviews | Format instruction existed but wasn't explicit enough for elevated-risk context | +| F5 | LOW | Widened deploy trigger sometimes surfaced as Important instead of Blocker | CI/CD section didn't call out the specific Blocker rule | + +--- + +## SKILL.md Fixes — Round 3 (F1–F5) + +Three targeted edits to `skills/pr-review/SKILL.md`: + +| Fixes | Section | Change | +|---|---|---| +| F1 + F2 | Elevated risk | Rewrote criteria: removed "DB migrations" and "wide refactor"; added 3 explicit edge-case notes (DB migrations → use DB migrations section; DTO rename → use API contracts section; security lib dep bump → use dependency bumps section) | +| F3 + F4 | Format | Added explicit bold-vs-H3 rule with fenced example; added exhaustive list of valid Applied-section names; added "do not list Trade-off analysis in header" instruction | +| F5 | CI/CD | Added explicit Blocker rule for widened deploy triggers | + +--- + +## Iteration 5 — New Skill vs Iter-3 Snapshot + +**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-3 snapshot) = 24 runs +**Baseline:** `pr-review-workspace/skill-snapshot-iter3/` (pre-F1-F5 fixes) +**Assertions:** 67 total, including 5 new F1–F5 regression guards + +### Benchmark + +| Config | Pass Rate | Notes | +|---|---|---| +| **New skill (F1–F5 fixes)** | **100.0% ± 0.0%** | All 67 assertions pass | +| Old skill (iter-3 snapshot) | 90.6% ± 15.3% | 8 failures across 4 evals | +| **Delta** | **+9.4 pp** | Largest delta since iter-1 | + +### Per-eval breakdown + +| Eval | New skill | Old skill | Δ | Note | +|---|---|---|---|---| +| eval-1 · api-rename | 6/6 | 4/6 | +2 | Old skill used H3 headings (F4 regression) | +| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | +| eval-3 · ci-gate-bypass | 6/6 | 6/6 | 0 | | +| eval-4 · standard-clean-pr | 5/5 | 5/5 | 0 | | +| eval-5 · dependency-bump | 7/7 | 4/7 | +3 | Old skill: H3 headings + elevated-risk on dep bump (F2, F4) | +| eval-6 · db-migration-risks | 7/7 | 6/7 | +1 | Old skill: H3 headings (F4) | +| eval-7 · elevated-risk-auth | 8/8 | 8/8 | 0 | | +| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | +| eval-9 · release-notes-negative | 3/3 | 3/3 | 0 | | +| eval-10 · api-rename-paraphrase | 3/3 | 3/3 | 0 | | +| eval-11 · ci-paraphrase | 4/4 | 4/4 | 0 | | +| **eval-12 · multi-section** | **9/9** | **7/9** | **+2** | Old skill: H3 headings + elevated-risk on plain migration (F1, F4) | + +### F1–F5 regression verification + +| Fix | Assertion | Result | +|---|---|---| +| F1 — DB migrations ≠ elevated risk | `no-elevated-risk-on-plain-migration` (eval-12) | ✅ PASS | +| F2 — dep bump ≠ elevated risk | `no-elevated-risk-on-dep-bump` (eval-5) | ✅ PASS | +| F3 — trade-off not in header | `no-trade-off-in-header` (eval-7, eval-12) | ✅ PASS | +| F4 — no H3 headings | `no-h3-headings` (all evals) | ✅ PASS | +| F5 — widened deploy = Blocker | `deploy-trigger-blocker` (eval-3, eval-12) | ✅ PASS | + +--- + +## Trigger Re-evaluation v4 (post iter-5) + +**Eval set:** `evals/trigger-eval.json` — 21 queries (10 should-trigger, 11 should-not) +**Method:** Manual simulation (reasoning-based; `claude` CLI not available) +**Results saved:** `evals/trigger-eval-results-v4.json` + +| Metric | v1 | v2 | v3 | **v4** | +|---|---|---|---|---| +| Queries | 20 | 20 | 21 | **21** | +| Accuracy | 100% | 100% | 100% | **100%** | +| Precision | 100% | 100% | 100% | **100%** | +| Recall | 100% | 100% | 100% | **100%** | +| F1 | 100% | 100% | 100% | **100%** | +| Near-misses | 2 | 0 | 0 | **0** | +| False Positives | 0 | 0 | 0 | **0** | +| False Negatives | 0 | 0 | 0 | **0** | + +All 21 queries resolved at **high confidence**. The F1–F5 fixes (iter-5) did not introduce any trigger regressions — the description boundary between review tasks and generative/documentation tasks remains clean and stable. + +--- + +## Overall Trajectory + +``` +Baseline (no skill) 75.8% avg pass rate 5,028 tokens 20.6 s + ↓ +24.2 pp +Iteration 1 skill 100.0% 7,168 tokens 51.2 s + ↓ Round-1 fixes: DB Blocker, API completeness, + elevated-risk boundary, output format header +Iteration 2 skill 100.0% 7,433 tokens 65.7 s + vs old-skill baseline 97.5% 7,249 tokens 57.6 s + ↓ Trigger fixes: t09, n02 +Trigger accuracy v2 100% (20/20), 0 near-misses + ↓ Round-2 improvements: proactive refs, eval-12, + multi-section boundary, regression assertion +Iteration 3 skill 100.0% (12 evals, 24 runs) + vs old-skill baseline 95.6% ± 11.8% + Delta +4.4 pp + ↓ Trigger v3 re-evaluation (+n11) +Trigger accuracy v3 100% (21/21), 0 near-misses + ↓ Iter-4 qualitative analysis → 5 issues (F1–F5) + ↓ Round-3 fixes: elevated-risk boundary rewrite, + format anchor (bold vs H3), deploy-trigger Blocker +Iteration 5 skill 100.0% ± 0.0% (12 evals, 67 assertions) + vs iter-3 snapshot 90.6% ± 15.3% + Delta +9.4 pp ← largest qualitative delta + ↓ Trigger v4 re-evaluation (post iter-5) +Trigger accuracy v4 100% (21/21), 0 near-misses, all high confidence +``` + +### Files produced + +| File | Description | +|---|---| +| `evals/body testing review.md` | Full iteration-1 benchmark narrative with per-eval detail | +| `evals/trigger-eval-results.json` | Pre-fix trigger simulation results (20 queries) | +| `evals/trigger-eval-results-v2.json` | Post-fix trigger simulation results (20 queries) | +| `evals/trigger-eval-results-v3.json` | Post-iter-3 trigger simulation results (21 queries, +n11) | +| `evals/trigger-eval-results-v4.json` | Post-iter-5 trigger simulation results (21 queries, all high confidence) | +| `evals/files/multi-section-risks.diff` | New fixture for eval-12: alembic add+drop, API renames, CI tweak | +| `evals/results-summary.md` | This file | +| `pr-review-workspace/iteration-1/benchmark.json` | Iteration-1 aggregated benchmark | +| `pr-review-workspace/iteration-2/benchmark.json` | Iteration-2 aggregated benchmark | +| `pr-review-workspace/iteration-3/benchmark.json` | Iteration-3 aggregated benchmark | +| `pr-review-workspace/iteration-4/benchmark.json` | Iteration-4 aggregated benchmark | +| `pr-review-workspace/iteration-5/benchmark.json` | Iteration-5 aggregated benchmark (67 assertions, +9.4pp) | +| `pr-review-workspace/skill-snapshot-iter1/SKILL.md` | Skill snapshot before round-1 fixes | +| `pr-review-workspace/skill-snapshot-iter2/SKILL.md` | Skill snapshot before round-2 improvements | +| `pr-review-workspace/skill-snapshot-iter3/SKILL.md` | Skill snapshot before iter-4 qualitative fixes | +| `pr-review-workspace/skill-snapshot-iter4/SKILL.md` | Skill snapshot before iter-5 F1–F5 fixes | +| `pr-review-workspace/eval-review.html` | Static iteration-1 eval viewer | diff --git a/skills/pr-review/evals/trigger-eval.json b/skills/pr-review/evals/trigger-eval.json new file mode 100644 index 0000000..ebcf5c4 --- /dev/null +++ b/skills/pr-review/evals/trigger-eval.json @@ -0,0 +1,128 @@ +[ + { + "id": "t01-review-pr-api-contract", + "query": "Review PR #42 for API contract changes before we merge.", + "should_trigger": true, + "reason": "Explicit PR review request with API-contract risk keywords." + }, + { + "id": "t02-check-diff-ci", + "query": "Can you check this diff for CI/CD pipeline risks?", + "should_trigger": true, + "reason": "Direct diff review ask plus CI/CD domain trigger." + }, + { + "id": "t03-lgtm-sanity-check", + "query": "LGTM? Please do a quick sanity check on these code changes.", + "should_trigger": true, + "reason": "Contains explicit trigger phrase 'LGTM' and asks for review." + }, + { + "id": "t04-terraform-pr-review", + "query": "Please review this Terraform PR for security and least-privilege issues.", + "should_trigger": true, + "reason": "PR review request with infrastructure/security context." + }, + { + "id": "t05-db-migration-review", + "query": "Can you review this migration diff and call out rollback risks?", + "should_trigger": true, + "reason": "Asks for review of DB migration with risk assessment." + }, + { + "id": "t06-auth-refactor-feedback", + "query": "Need feedback on this auth refactor PR, especially permissions and secrets handling.", + "should_trigger": true, + "reason": "Explicit request for PR feedback; elevated-risk auth/security keywords." + }, + { + "id": "t07-dependency-bump-review", + "query": "Review this dependency bump PR and tell me if any upgrades look dangerous.", + "should_trigger": true, + "reason": "PR review intent and dependency-bump risk language." + }, + { + "id": "t08-pr-link-thoughts", + "query": "Here is the PR link, can you give review comments before I approve it?", + "should_trigger": true, + "reason": "Mentions PR link and asks for review comments directly." + }, + { + "id": "t09-rename-field-risk", + "query": "Does this look right? I renamed a response field in an endpoint; please review for client breakage.", + "should_trigger": true, + "reason": "Paraphrased review ask with API-contract breakage concern." + }, + { + "id": "t10-workflow-guardrails", + "query": "Can you sanity check this workflow file change for gate bypasses before merge?", + "should_trigger": true, + "reason": "Paraphrased PR review plus CI gate-bypass signal." + }, + { + "id": "n01-implement-feature", + "query": "Implement OAuth login in FastAPI with refresh tokens and tests.", + "should_trigger": false, + "reason": "Feature implementation request, not a PR review." + }, + { + "id": "n02-write-unit-tests", + "query": "Write unit tests for src/utils/date_utils.py.", + "should_trigger": false, + "reason": "Testing/code authoring request without review intent." + }, + { + "id": "n03-generate-release-notes", + "query": "Generate release notes from this diff in bullet format.", + "should_trigger": false, + "reason": "Documentation summarization task, not review feedback." + }, + { + "id": "n04-explain-error", + "query": "Why am I getting 'ModuleNotFoundError: psycopg2' in CI?", + "should_trigger": false, + "reason": "Debugging question; no PR/diff review request." + }, + { + "id": "n05-open-issue", + "query": "Create a GitHub issue for flaky integration tests.", + "should_trigger": false, + "reason": "Issue creation request, should route to create-issue skill." + }, + { + "id": "n06-refactor-function", + "query": "Refactor this function for readability and lower cyclomatic complexity.", + "should_trigger": false, + "reason": "Refactoring request without PR review context." + }, + { + "id": "n07-kudos-message", + "query": "Give kudos to Sarah for helping with incident response.", + "should_trigger": false, + "reason": "Recognition workflow request, not code review." + }, + { + "id": "n08-crossall-status", + "query": "Generate the crossall status update for this sprint from Jira.", + "should_trigger": false, + "reason": "Project status reporting task, unrelated to PR review." + }, + { + "id": "n09-translate-text", + "query": "Translate this paragraph to French.", + "should_trigger": false, + "reason": "Language translation task; no overlap with PR review." + }, + { + "id": "n10-math-help", + "query": "What is the derivative of x^3 * sin(x)?", + "should_trigger": false, + "reason": "General Q&A request; outside software review scope." + }, + { + "id": "n11-release-notes-from-diff", + "query": "Please draft release notes bullets from this diff only, do not do a PR review", + "should_trigger": false, + "reason": "Explicit instruction NOT to do a PR review; generative documentation task covered by Does NOT trigger exclusion in description." + } +] \ No newline at end of file diff --git a/skills/pr-review/references/output-template.md b/skills/pr-review/references/output-template.md new file mode 100644 index 0000000..b22cc62 --- /dev/null +++ b/skills/pr-review/references/output-template.md @@ -0,0 +1,90 @@ +# PR Review Output Examples + +Use these as a template for formatting your review. The goal is to be useful, not thorough — a short focused review is better than a long one that buries the signal. + +--- + +## Example 1: Clean PR (LGTM) + +```markdown +## Standard review + +Files changed: `src/utils/date_utils.py`, `tests/utils/test_date_utils.py` + +**Blocker** +_None._ + +**Important** +_None._ + +**Nit** +- `get_business_days` uses an O(n) loop. A week-math formula is O(1) and simpler — worth + switching if this is called on large date ranges. Not blocking. + +Overall: LGTM. Clean utility, good test coverage, no dependencies added. +``` + +--- + +## Example 2: Multi-section review (API + CI/CD) + +```markdown +## Applied sections: Standard · API contracts · CI/CD + +**Blocker** + +1. `GET /users/{id}` response field rename — breaking contract + - **What**: `user_id` → `userId` changes the JSON wire format with no backward-compat alias. + - **Why**: Every existing caller receives `null` for `user_id` the moment this deploys. + - **How to fix**: Use `Field(alias="user_id")` to keep the old wire name, or version the endpoint. + +2. Unit tests commented out in CI + - **What**: `pytest` step disabled in `.github/workflows/ci.yml` (line 28). + - **Why**: Coverage gate bypassed — broken code can reach staging undetected. + - **How to fix**: Fix the specific flaky tests; don't disable the gate. + +**Important** + +3. Deploy trigger widened to all pushes + - **What**: `if: github.ref == 'refs/heads/develop'` changed to `github.event_name == 'push'`. + - **Why**: Staging now deploys on `main` pushes too, breaking the expected promotion flow. + - **How to fix**: Restore the branch-scoped guard. + +**Nit** +_None._ +``` + +--- + +## Example 3: Elevated risk PR (auth/security change) + +```markdown +## Applied sections: Standard · Elevated risk · API contracts + +**Blocker** + +1. Wildcard IAM policy (`Action: "*"`, `Resource: "*"`) + - **What**: `lambda_data_access` policy grants unrestricted access to the entire AWS account. + - **Why**: A compromised Lambda can exfiltrate data, destroy resources, or escalate privileges. + - **How to fix**: Scope to the specific actions and resource ARNs the function actually needs. + +**Important** + +2. Hardcoded account ID as variable default + - **What**: `variable "account_id" { default = "123456789012" }` — silently targets a single account. + - **Why**: Any caller that forgets to override will deploy against the wrong account. + - **How to fix**: Remove the default; use `data "aws_caller_identity" "current" {}` instead. + +**Nit** +- `Sid = "FullDataAccess"` is misleading — once scoped, rename to reflect actual access. +``` + +--- + +## Formatting rules + +- Reference file + line number where possible: `src/api/router.py:42` +- Quote the actual code in a fenced block when it helps clarity +- If a section has no findings, write `_None._` explicitly — it shows you checked +- Keep the "How to fix" minimal: a code snippet or a pattern name is enough; don't write an essay +- Omit sections entirely if they don't apply to this PR diff --git a/skills/pr-review/references/security-antipatterns.md b/skills/pr-review/references/security-antipatterns.md new file mode 100644 index 0000000..9d146d2 --- /dev/null +++ b/skills/pr-review/references/security-antipatterns.md @@ -0,0 +1,67 @@ +# Security Anti-Patterns Reference + +Quick reference of patterns to actively look for during PR review. Not exhaustive — use these as prompts when reading code, not as a checklist to run down mechanically. + +--- + +## Credentials & Secrets + +| Pattern | What to look for | +|---|---| +| Hardcoded creds | String literals matching `AKIA`, `sk-`, `ghp_`, `-----BEGIN`, passwords/tokens in assignments | +| Secret in env block | `env:` in CI with literal values instead of `${{ secrets.X }}` | +| Secret in log | `print(token)`, `logger.info(password)`, `console.log(apiKey)` | +| Secret in URL | `https://user:pass@host`, connection strings with embedded credentials | +| `.env` committed | `.env`, `.env.local`, `.env.production` added to the diff | + +## Injection + +| Pattern | What to look for | +|---|---| +| SQL injection | String-formatted queries: `f"SELECT * FROM {table}"`, `"WHERE id=" + user_input` | +| Command injection | `os.system(input)`, `subprocess.run(f"cmd {input}", shell=True)`, `exec(user_data)` | +| Template injection | User-controlled strings passed to Jinja2/Mustache/eval without escaping | +| Path traversal | `open(base_dir + user_path)` without `Path.resolve()` / `realpath()` check | + +## Auth & Access Control + +| Pattern | What to look for | +|---|---| +| Missing auth check | Endpoint added without `@require_auth`, `@login_required`, or middleware check | +| Broken object-level auth | Fetching a resource by ID without verifying the caller owns it | +| Privilege escalation | Role/permission assigned from user-supplied input without validation | +| JWT `alg: none` | JWT verification that accepts unsigned tokens | +| Wildcard IAM | `Action: "*"` or `Resource: "*"` in IAM policies | + +## Cryptography + +| Pattern | What to look for | +|---|---| +| Weak hash | `md5(password)`, `sha1(secret)` — use bcrypt/argon2 for passwords | +| Static IV/nonce | Same IV reused across encryptions | +| Insecure random | `random.random()` / `Math.random()` for security tokens — use `secrets` / `crypto.randomBytes` | + +## Input Validation + +| Pattern | What to look for | +|---|---| +| Missing size limit | File upload or request body accepted without size cap | +| Unvalidated redirect | `redirect(request.args['next'])` without allow-list check | +| XXE | XML parser without `resolve_entities=False` / external entity disabled | +| SSRF | HTTP client called with user-controlled URL without allow-list | + +## Data Handling + +| Pattern | What to look for | +|---|---| +| PII in logs | Email, phone, card number, SSN in log statements | +| PII in URL | Sensitive identifiers in query params (captured by access logs, proxies, browsers) | +| Insecure deserialization | `pickle.loads(user_data)`, `yaml.load()` without `Loader=SafeLoader` | + +--- + +## Severity guidance + +- **Blocker**: Exploitable in production, exposes credentials/data, bypasses auth +- **Important**: Increases attack surface, violates least privilege, missing validation on a sensitive path +- **Nit**: Theoretical/low-impact, no immediate exploitability diff --git a/skills/pr-review/scripts/classify_sections.py b/skills/pr-review/scripts/classify_sections.py new file mode 100644 index 0000000..6fb1cff --- /dev/null +++ b/skills/pr-review/scripts/classify_sections.py @@ -0,0 +1,132 @@ +#!/usr/bin/env python3 +""" +classify_sections.py — Given a list of changed files, print which pr-review +conditional sections apply. + +Usage: + # From a file list (one path per line): + python3 classify_sections.py /tmp/pr_files.txt + + # Or pipe from gh: + gh pr view 42 --json files --jq '.files[].path' | python3 classify_sections.py - + +Output example: + Sections to apply: + [x] Standard review (always) + [x] API contracts (router.py, schemas.py) + [ ] Elevated risk + [ ] Dependency bumps + [ ] CI/CD + [ ] Infrastructure +""" + +import sys +import re +from pathlib import Path + + +SECTION_RULES = [ + ( + "Elevated risk", + [ + # Word-boundary patterns prevent false positives on paths like + # authentication_helper_test.py, AccessibleButton.tsx, or access-logging.md. + r"\bauth\b", r"\bsecurity\b", r"\bpermission\b", r"\baccess\b", + r"\bmigration\b", r"\balembic\b", r"\bflyway\b", r"\bliquibase\b", + r"\bintegration\b", r"\binfra(?:structure)?\b", r"\biac\b", + ], + ), + ( + "API contracts", + [ + r"router", r"endpoint", r"controller", r"view", + r"schema", r"dto", r"serializer", + r"\.env", r"config\.(py|ts|js|yaml|yml|json)", + r"action\.ya?ml", # GitHub Actions input/output + ], + ), + ( + "Dependency bumps", + [ + r"pom\.xml$", r"requirements.*\.txt$", r"pyproject\.toml$", + r"package\.json$", r"package-lock\.json$", r"yarn\.lock$", + r"poetry\.lock$", r".*\.csproj$", r"go\.mod$", r"go\.sum$", + r"build\.sbt$", r"Cargo\.toml$", r"Gemfile", r"composer\.json$", + r"pubspec\.yaml$", + ], + ), + ( + "CI/CD", + [ + r"\.github/workflows/", r"Jenkinsfile", r"Justfile", + # Match only exact Makefile name (not MakefileHelper.py etc.). + # Note: still a broad heuristic — only applies when Makefile is a pipeline + # entrypoint. A manual check is recommended when this fires on Makefile alone. + r"(?:^|/)Makefile$", + r"\.circleci/", r"\.travis\.yml$", + r"azure-pipelines\.yml$", r"bitbucket-pipelines\.yml$", + ], + ), + ( + "Infrastructure", + [ + r"\.tf$", r"\.tfvars$", r"terragrunt\.hcl$", + r"cloudformation", r"helm", r"Dockerfile", + r"docker-compose", r"iac/", r"infra/", + # Kubernetes / Helm assets + r"(?:^|/)Chart\.yaml$", r"(?:^|/)values(?:[^/]*)\.yaml$", + r"(?:^|/)charts/", r"(?:^|/)k8s/", r"\.k8s\.", r"kubernetes/", r"manifests/", + ], + ), + ( + "DB migrations", + [ + r"alembic/", r"migrations/", r"flyway/", r"liquibase/", + r"\.sql$", r"db/", r"database/", + ], + ), +] + + +def classify(files: list[str]) -> dict[str, list[str]]: + triggered: dict[str, list[str]] = {} + for section, patterns in SECTION_RULES: + matched = [ + f for f in files + if any(re.search(p, f, re.IGNORECASE) for p in patterns) + ] + if matched: + triggered[section] = matched + return triggered + + +def main() -> None: + source = sys.argv[1] if len(sys.argv) > 1 else "-" + if source == "-": + files = [line.strip() for line in sys.stdin if line.strip()] + else: + files = Path(source).read_text().splitlines() + files = [f.strip() for f in files if f.strip()] + + if not files: + print("No files provided.", file=sys.stderr) + sys.exit(1) + + triggered = classify(files) + + all_sections = [s for s, _ in SECTION_RULES] + print("Sections to apply:") + print(f" [x] Standard review (always)") + for section in all_sections: + if section in triggered: + examples = triggered[section][:3] + label = ", ".join(examples) + if len(triggered[section]) > 3: + label += f" (+{len(triggered[section]) - 3} more)" + print(f" [x] {section:<22} ({label})") + else: + print(f" [ ] {section}") + + +if __name__ == "__main__": + main() diff --git a/skills/pr-review/scripts/fetch_pr.sh b/skills/pr-review/scripts/fetch_pr.sh new file mode 100755 index 0000000..b364009 --- /dev/null +++ b/skills/pr-review/scripts/fetch_pr.sh @@ -0,0 +1,37 @@ +#!/usr/bin/env bash +# fetch_pr.sh — Fetch PR diff and changed file list using the gh CLI. +# +# Usage: +# ./fetch_pr.sh [REPO] +# +# Examples: +# ./fetch_pr.sh 42 +# ./fetch_pr.sh 42 owner/repo +# +# Output: +# Prints the PR description, then the diff to stdout. +# Writes the changed file list to /tmp/pr_files.txt for use by classify_sections.py. +# +# Requirements: gh CLI authenticated (gh auth status) + +set -euo pipefail + +PR="${1:?Usage: fetch_pr.sh [REPO]}" +REPO_ARGS=() +[ -n "${2:-}" ] && REPO_ARGS=("--repo" "$2") + +echo "=== PR Description ===" +gh pr view "$PR" "${REPO_ARGS[@]}" --json title,body,author \ + --template '# {{.title}} +Author: {{.author.login}} + +{{.body}}' + +echo "" +echo "=== Changed Files ===" +gh pr view "$PR" "${REPO_ARGS[@]}" --json files \ + --jq '.files[].path' | tee /tmp/pr_files.txt + +echo "" +echo "=== Diff ===" +gh pr diff "$PR" "${REPO_ARGS[@]}" From 51cd8fa941948f843f83fcc3ef96ffee5917f832 Mon Sep 17 00:00:00 2001 From: miroslavpojer Date: Mon, 18 May 2026 15:31:49 +0200 Subject: [PATCH 2/6] Removed 1st skil from repo setup. --- skills/pr-review/SKILL.md | 261 ----------- skills/pr-review/evals/evals.json | 202 -------- skills/pr-review/evals/files/api-rename.diff | 35 -- .../pr-review/evals/files/ci-gate-bypass.diff | 54 --- .../evals/files/db-migration-risks.diff | 26 -- .../evals/files/dependency-bump-risk.diff | 16 - .../evals/files/docs-release-notes.diff | 22 - .../files/elevated-risk-auth-refactor.diff | 59 --- .../evals/files/iac-wildcard-iam.diff | 43 -- .../evals/files/large-pr-and-vague-desc.diff | 117 ----- .../evals/files/multi-section-risks.diff | 70 --- .../evals/files/standard-clean-pr.diff | 65 --- skills/pr-review/evals/fixture-map.md | 31 -- skills/pr-review/evals/results-summary.md | 430 ------------------ skills/pr-review/evals/trigger-eval.json | 128 ------ .../pr-review/references/output-template.md | 90 ---- .../references/security-antipatterns.md | 67 --- skills/pr-review/scripts/classify_sections.py | 132 ------ skills/pr-review/scripts/fetch_pr.sh | 37 -- 19 files changed, 1885 deletions(-) delete mode 100644 skills/pr-review/SKILL.md delete mode 100644 skills/pr-review/evals/evals.json delete mode 100644 skills/pr-review/evals/files/api-rename.diff delete mode 100644 skills/pr-review/evals/files/ci-gate-bypass.diff delete mode 100644 skills/pr-review/evals/files/db-migration-risks.diff delete mode 100644 skills/pr-review/evals/files/dependency-bump-risk.diff delete mode 100644 skills/pr-review/evals/files/docs-release-notes.diff delete mode 100644 skills/pr-review/evals/files/elevated-risk-auth-refactor.diff delete mode 100644 skills/pr-review/evals/files/iac-wildcard-iam.diff delete mode 100644 skills/pr-review/evals/files/large-pr-and-vague-desc.diff delete mode 100644 skills/pr-review/evals/files/multi-section-risks.diff delete mode 100644 skills/pr-review/evals/files/standard-clean-pr.diff delete mode 100644 skills/pr-review/evals/fixture-map.md delete mode 100644 skills/pr-review/evals/results-summary.md delete mode 100644 skills/pr-review/evals/trigger-eval.json delete mode 100644 skills/pr-review/references/output-template.md delete mode 100644 skills/pr-review/references/security-antipatterns.md delete mode 100644 skills/pr-review/scripts/classify_sections.py delete mode 100755 skills/pr-review/scripts/fetch_pr.sh diff --git a/skills/pr-review/SKILL.md b/skills/pr-review/SKILL.md deleted file mode 100644 index 3cf60bc..0000000 --- a/skills/pr-review/SKILL.md +++ /dev/null @@ -1,261 +0,0 @@ ---- -name: pr-review -description: > - Pull request code review. Use this skill whenever the user asks to review a PR, check a diff, - give feedback on code changes, or do a sanity check on changes — even if they just say "LGTM?" - or "does this look right?". Also activate when the user shares a GitHub PR link and asks for - thoughts, or when reviewing changes before merging. Covers standard risk, elevated risk, - API contracts, dependency bumps, CI/CD changes, and infrastructure changes. Applies the - relevant sections based on what files the PR touches. Produces concise comments grouped - by severity: Blocker / Important / Nit. - Triggers on: "review my PR", "check this diff for issues/risks", "any issues with these - changes", "feedback on my code", "can you review", "look at this PR", "sanity check", - "LGTM?", "does this look right?". - Does NOT trigger for purely generative tasks on a diff (e.g. "generate release notes from - this diff", "summarise this diff") — those are documentation tasks, not code review. ---- - -# PR Review - -Single unified PR review skill. Read the full skill and apply only the sections relevant to what the PR touches. - -## Before you start with the review - -1. List the files changed in the PR (use `gh pr view --json files` or read the diff). - Helper: `scripts/fetch_pr.sh ` fetches the diff + file list via `gh`. - Helper: `scripts/classify_sections.py /tmp/pr_files.txt` prints which sections to apply. -2. Use the file list to determine which conditional sections below apply. -3. Read the PR description and any linked issue for intent — use this to judge whether changes - are in scope, and to avoid flagging intentional decisions as issues. -4. If the PR description is absent or too vague to understand intent, note it as a Nit. -5. Check PR size upfront. If the diff exceeds ~500 changed lines or ~20 files, state at the - top of the review that coverage may be partial, and prioritise the highest-risk files. - Suggest the author split the PR when scope warrants it — large PRs are harder to review - accurately and context window limits can silently reduce coverage. -6. For changes in a domain you are unfamiliar with, or for any elevated-risk PR, check - `CODEOWNERS` or use `git blame` on the most-changed files to identify the relevant domain - expert. Mention their handle in the review summary so the author knows to request their - review or approval — do not block the review on their availability. - -**Conditional sections** — use the file list to determine which apply: - -| Section | Apply when PR touches | -|---|---| -| API contracts | endpoints, DTOs, config keys, env vars, exit codes, output strings | -| Dependency bumps | pom.xml, requirements.txt, pyproject.toml, package.json, *.csproj, go.mod, build.sbt | -| CI/CD | .github/workflows/, Jenkinsfile, Justfile, Makefile (pipeline), deployment jobs | -| Infrastructure | *.tf, *.tfvars, CloudFormation, Helm, Dockerfiles, docker-compose.yml, iac/, infra/ | -| DB migrations | alembic/, flyway/, liquibase/, migrations/, *.sql, db/ | - -> A file may match multiple sections (e.g. a DB migration inside an IaC repo). Apply all matching sections — do not skip a section because another one already fired. - -**Elevated risk** is not a conditional section like the ones above — it is a binary overlay. -Determine independently whether this PR qualifies: -- Touches auth config, security controls, or permission logic -- Touches infrastructure, secrets management, or external integrations -- Wide refactor affecting many callers (≥ 10 distinct call sites impacted) - -Notes on common edge cases that do **not** qualify: -- A pure DTO rename or API field rename — covered by the **API contracts section**. -- A DB migration — covered by the **DB migrations section**, which already handles data loss, - locking, and rollback risks. Only apply elevated risk on top if the migration *also* touches - auth tables, involves a coordinated multi-service rollout, or is part of a wide schema - restructuring affecting many services. -- Bumping a security/auth library (e.g. `python-jose`, `bcrypt`) — changing a version number in - `requirements.txt` is not "touching auth controls". Elevated risk requires the PR to *modify - code* that implements auth or security logic. Use the dependency bumps section for CVE/compat - checks on security libraries. - -If yes, apply the **Elevated risk checks** section _on top of_ all other sections that fired. - -## Format every comment consistently - -Every comment must include: -- **What** — one line -- **Why** — impact or risk -- **How to fix** — minimal actionable suggestion; prefer linking to existing repo patterns - -Group all comments under: `Blocker` | `Important` | `Nit` - -**Severity definitions:** -- **Blocker** — must be fixed before merge: correctness bug, security vulnerability, broken contract, - quality gate bypassed, credentials in code, data loss risk -- **Important** — should be fixed soon but not a hard blocker: increases risk, missing test for - changed logic, non-default that's hard to change later, code that will definitely cause problems -- **Nit** — minor polish: style, naming, optional improvement, non-urgent edge case - -Rules: -- Short bullet points; reference file + line range where possible -- Keep comments short and targeted — a long audit report buries the real issues and - discourages authors from engaging with the feedback. -- Avoid requesting refactors unrelated to the PR — scope creep in reviews erodes trust - and makes PRs harder to merge without accumulating unrelated tech debt. -- If uncertain, ask a targeted question instead of assuming -- Do not flag style issues handled by a formatter or linter — those will be caught - automatically and flagging them manually wastes the author's time. -- If a section has no findings, write `_None._` — shows you checked, not skipped. - -**Before writing your first comment, read `references/output-template.md`.** It shows exactly how clean, multi-section, and elevated-risk reviews should look. Matching its format prevents the most common output problems (wrong heading style, missing `_None._` markers, verbose findings that bury real issues). -Start each review with a header in the form: -`## Applied sections: Standard · [additional sections separated by ·]` -followed by the files changed. This makes the review scope immediately visible to the author. - -**The three severity groupings must always be formatted as bold text on their own line, not as markdown headings:** -``` -**Blocker** -**Important** -**Nit** -``` -Never use `### Blocker`, `## Important`, or any heading syntax for these groupings. The output-template.md examples show the correct format. - -**Do not list "Trade-off analysis" in the Applied sections header.** Trade-off analysis is an internal elevated-risk check, not a named section. The header should only list: `Standard`, `API contracts`, `Dependency bumps`, `CI/CD`, `Infrastructure`, `DB migrations`, and `Elevated risk`. - -## Apply standard review (always) - -Priority order: correctness → security → tests → maintainability → style - -**Correctness** -- Logic bugs, missing edge cases, regressions -- Contract changes: output strings, exit codes, env vars, API paths -- Swallowed exceptions (`except: pass`, empty `catch {}`) hide bugs — at minimum log with context -- Returning `null`/`None` where an exception or empty collection is the right contract pushes error handling onto every caller -- Error messages must include enough context to diagnose: what failed, with what input, and why - -**Security** -- Unsafe input handling, secrets/tokens in code or logs -- Auth/authz issues, insecure defaults -- For elevated-risk or security-touching PRs, read `references/security-antipatterns.md` before reviewing — it lists patterns to actively scan for (hardcoded creds, injection, broken auth, weak crypto, PII in logs) - -**Tests** -- Changed logic has tests covering success + failure paths -- No real external API/network calls in unit tests -- Tests must assert the specific behaviour changed — a test that passes without meaningful assertions is a false-positive shield -- One concept per test; a test asserting five unrelated things gives no signal about which invariant broke -- Tests must be deterministic: no `sleep()`, no real clock, no dependency on execution order -- Flag missing edge-case coverage for conditions identified in correctness checks above - -**Maintainability** -- Unnecessary complexity, duplication, unclear naming -- No unrelated drive-by refactors -- Functions doing more than one thing at mixed abstraction levels — flag when a single function interleaves I/O, business logic, and formatting -- Deeply nested conditionals (>3 levels) — suggest early returns, guard clauses, or an extracted method -- Boolean parameters that silently change behaviour (e.g. `process(data, True, False)`) — suggest named constants, enums, or separate functions -- Magic numbers or strings in business logic — flag when intent is unclear without a named constant - -**Performance** -- O(n²) or worse complexity in hot paths; flag when input is unbounded -- N+1 query patterns (individual queries inside a loop instead of a batch/join) — multiplies latency by row count -- New queries on large tables without index support cause full scans — flag when no supporting index or `EXPLAIN` commentary is provided -- Missing pagination on endpoints or functions returning collections -- Unbounded memory allocation (appending to list in a loop without a size cap) - -**Style** -- Only flag if readability is reduced or repo conventions are broken - -## Apply elevated risk checks (overlay — applies on top of all other sections that fired) - -Additional checks on top of standard: -- Confirm prior review comments were addressed -- Re-check: auth, permissions, secrets, persistence, external calls, concurrency -- Hidden side effects: backward compat, rollout path, failure modes, retries/timeouts, idempotency -- Safe defaults: least privilege, secure logging, safe error messages, predictable behaviour on missing inputs -- If leaving a risk as-is: state what the risk is, why acceptable, and the mitigation (tests/monitoring/flag) - -## Apply trade-off analysis (elevated-risk PRs only) - -Elevated-risk PRs often make architectural decisions that are hard to reverse. Ask: - -- **One-way or two-way door?** Schema migrations, public APIs, and durable data formats are one-way doors — they need explicit justification. Internal refactors behind a stable interface or feature-flagged changes are two-way doors and need less scrutiny. -- **Is this the established approach?** Research how the problem is typically solved (industry patterns, well-known libraries). If a simpler or more standard alternative achieves the same outcome, surface it concretely — don't just ask "were alternatives considered?", show one. -- **Does it hold at 10× scale?** Forces thinking about the next order of magnitude — unbounded loops, missing pagination, single-node bottlenecks, fan-out explosions. -- **Can this be deployed independently?** If the change requires coordinated rollout with another service or repo, a deployment plan must be documented in the PR. - -## Check API contracts (when PR touches: endpoints, DTOs, config keys, env vars, exit codes, output strings, action inputs/outputs) - -**REST & synchronous contracts** -- Renaming or removing an endpoint path breaks all clients immediately — add a deprecation alias first and remove only after confirming no active consumers. -- Adding a required field to a response breaks clients that do not expect it — make new fields optional or version the response schema. -- Must not rename fields or change types without a migration path. -- Renaming config keys or env vars without an alias silently breaks deployments — provide the old name as a backward-compatible alias with a deprecation warning. -- Existing failure exit codes must not change. -- Externally-visible strings must not change silently. -- Flag where a server-side change requires a coordinated client update. -- When standardizing field naming (e.g. migrating to camelCase), check migration is complete across **all** fields — a partially-migrated response is confusing and hard to document. Flag as Important. - -**Event & message contracts** -- Schema changes must be backward-compatible (existing consumers can still deserialize new messages) *and* forward-compatible (old producers don't break new consumers) — brokers cannot version-route, so all consumers must handle all in-flight message versions. -- Switching encoding (JSON ↔ Avro ↔ Protobuf) or field encoding (string date → epoch) requires a coordinated producer/consumer rollout — flag as Blocker without a migration plan. -- If using a schema registry, confirm the compatibility mode (BACKWARD, FORWARD, FULL) permits the change. - -**Idempotency & delivery guarantees** -- Any write triggered by an event, message, or webhook must be idempotent — messages can be delivered more than once. Flag upserts without a deduplication key or conditional write. -- Processing a message without acknowledgment means a crash before ack causes redelivery and duplicate processing — flag when ack is not deferred until after processing succeeds. - -**Cache contracts** -- Cache writes without a corresponding invalidation strategy create stale-data bugs that are hard to reproduce — document or flag the invalidation path. -- New queues or caches without TTL, retention, or archival policy grow indefinitely — flag as Important. - -**Query & storage performance** -- `SELECT *` or queries without `LIMIT` on user-facing paths — one large tenant can OOM the service. - -Backward-compatibility decision guide: -- Additive (new optional field) → usually safe, document it -- Rename → breaking; add alias + deprecation notice -- Remove → breaking; deprecation cycle first -- Type change → breaking; version the API -- Default value change → assess all consumers -- Exit code change → must not change without a major version bump - -## Check dependency bumps (when PR touches: pom.xml, requirements.txt, pyproject.toml, package.json, *.csproj, go.mod, build.sbt, Cargo.toml, Gemfile, composer.json, pubspec.yaml) - -- Bump must have a stated reason: CVE / feature need / compatibility fix -- Platform-lock compatibility: Spark/EMR/Glue Hadoop (JVM), `engines` field (Node), target framework (.NET), runtime version (Python) -- Flag transitive upgrades that may silently change behaviour -- Must not introduce test-skipping profiles or flags -- Formatter, linter, and type-checker must still pass after the bump -- Must not add suppression of existing warnings to enable the bump - -## Check CI/CD changes (when PR touches: .github/workflows/, Jenkinsfile, Justfile, Makefile used as pipeline entrypoint, deployment job configs) - -- Do not skip or comment out quality gates — bypassing checks defeats the purpose of CI - and can allow broken or insecure code to reach production undetected. -- Secrets must be referenced by secret name only — hardcoding a secret value, even - temporarily, risks it being captured in logs or git history. -- Branch filters, path filters, and event triggers must be intentional. -- Widening a deploy trigger (e.g. from a specific branch to all pushes) without explicit - approval gates is a **Blocker** — it can push untested changes to production on every - commit, including feature branches that have never been reviewed for production readiness. -- Deployment jobs must require explicit approval — an auto-deploy trigger on the wrong - branch filter can push untested changes to production without human review. -- Existing job/recipe names and behaviours must be preserved - -## Check infrastructure as code (when PR touches: *.tf, *.tfvars, terragrunt.hcl, *.yaml, CloudFormation, Helm charts, Dockerfiles, docker-compose.yml, iac/, infra/) - -- Flag force-new replacements and deletions — Terraform may silently destroy and recreate - stateful resources (databases, queues) during a plan; authors must confirm this is intended. -- Do not hardcode regions or account IDs — environment-specific values baked into config - make cross-environment deployments break silently. -- Wildcard `*` IAM actions or resources violate least-privilege and can be exploited if - any resource in the scope is compromised — require explicit justification. -- Must not store secrets in plain text; use secret manager references -- Resources must be idempotent; re-applying must produce no unintended changes -- Do not use `latest` image tags in production — the image pulled at deploy time may differ - from what was tested, introducing untested changes silently. -- Base image changes must not break the expected runtime layout - -## Check DB migrations (when PR touches: alembic/, flyway/, liquibase/, migrations/, *.sql, db/) - -- Migrations must be irreversible-safe: a failed deploy mid-migration must leave the database - in a consistent state. Prefer additive changes (add column, add table) over destructive ones. -- Column/table removal should happen in a separate PR after the code no longer references it — - dropping a column while old code is still running causes immediate errors. -- Adding a replacement column and dropping its source in the same migration without a backfill - step destroys the existing data permanently — this is a **Blocker**. The safe sequence is: - (1) add the new column, (2) backfill values from the old column in a separate step, - (3) drop the original in a later migration once the backfill is confirmed complete. -- Must not lock tables for extended periods on large tables; use online schema change tools - (pt-online-schema-change, gh-ost, pglogical, `ALGORITHM=INSTANT`) where supported. -- Default values added to existing columns must be valid for all existing rows. -- Migration filenames/versions must be monotonically increasing — gaps or reordering breaks - migration runners on environments that haven't applied intermediate steps. -- Must include a rollback/down script or document that rollback is not safe and why. diff --git a/skills/pr-review/evals/evals.json b/skills/pr-review/evals/evals.json deleted file mode 100644 index aec8d1e..0000000 --- a/skills/pr-review/evals/evals.json +++ /dev/null @@ -1,202 +0,0 @@ -{ - "skill_name": "pr-review", - "evals": [ - { - "id": 1, - "category": "happy-path", - "prompt": "Hey can you review this PR for me? It's a FastAPI backend change. The diff is in evals/files/api-rename.diff", - "expected_output": "Review identifies the field rename (user_id\u2192userId) as a breaking API contract change, flags it as Blocker, recommends a deprecation path. API contracts section is applied. Output is grouped Blocker/Important/Nit.", - "files": [ - "evals/files/api-rename.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "The rename of user_id to userId is flagged as a breaking change (Blocker)", - "A deprecation or migration path is recommended (e.g. alias, versioning, or phase-out notice)", - "The API contracts section was applied (not just the standard review)", - "Each comment includes What, Why, and How to fix", - "Elevated-risk overlay is NOT applied \u2014 the PR is a pure DTO rename with no auth, security, infra, or wide-refactor touches" - ] - }, - { - "id": 2, - "category": "happy-path", - "prompt": "Please review this Terraform PR, the diff is at evals/files/iac-wildcard-iam.diff. Anything I should be worried about?", - "expected_output": "Review flags the wildcard Action=* Resource=* IAM policy as a Blocker. Also flags the hardcoded AWS account ID and region in variables as Important. Infrastructure section is applied.", - "files": [ - "evals/files/iac-wildcard-iam.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "The wildcard Action=* Resource=* IAM policy is flagged as a Blocker or Important", - "The hardcoded account ID (123456789012) or hardcoded region is flagged", - "The Infrastructure as code section was applied", - "Each comment includes What, Why, and How to fix" - ] - }, - { - "id": 3, - "category": "regression", - "prompt": "Can you sanity check this CI workflow change before I merge? diff is evals/files/ci-gate-bypass.diff", - "expected_output": "Review flags: (1) unit tests commented out as Blocker, (2) AWS credentials hardcoded in env as Blocker, (3) overly broad deploy trigger (now fires on any push, not just develop). CI/CD section is applied.", - "files": [ - "evals/files/ci-gate-bypass.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "The commented-out pytest step is flagged as a Blocker (quality gate bypassed)", - "The hardcoded AWS credentials (HARD_CODED_AWS_ACCESS_KEY_ID) are flagged as a Blocker", - "The broadened deploy trigger (was develop-only, now fires on any push) is flagged", - "The CI/CD section was applied" - ] - }, - { - "id": 4, - "category": "happy-path", - "prompt": "LGTM? Just want a quick review of this utility PR \u2014 evals/files/standard-clean-pr.diff", - "expected_output": "Review acknowledges the PR looks clean. Standard section is applied. Any findings are minor (Nit level at most). Tests are present and cover edge cases. No Blockers or Important issues.", - "files": [ - "evals/files/standard-clean-pr.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "No Blocker issues are raised (the code is clean)", - "The review notes that tests are present and cover edge cases", - "Standard review section is applied (correctness, tests, maintainability checked)", - "The review is concise \u2014 it does not manufacture problems where none exist" - ] - }, - { - "id": 5, - "category": "happy-path", - "prompt": "Can you review these dependency updates? We bumped a few packages in requirements.txt \u2014 evals/files/dependency-bump-risk.diff", - "expected_output": "Review flags missing justification for bumps (Important), warns about transitive dependency upgrade risk for SQLAlchemy 1.4\u21922.0 (Important), and notes lack of stated reason for each bump. Dependency bumps section is applied.", - "files": [ - "evals/files/dependency-bump-risk.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "Lack of stated reason for bumps (CVE/feature/compat) is flagged as Important", - "SQLAlchemy major version bump (1.4\u21922.0) is flagged as a breaking change with transitive concerns", - "FastAPI major version bump is flagged as potentially breaking", - "Reviewer recommends testing matrix before merge or a stated justification", - "The dependency bumps section was applied (not generic standard review)" - ] - }, - { - "id": 6, - "category": "regression", - "prompt": "Please review this database migration before we apply it. evals/files/db-migration-risks.diff", - "expected_output": "Review flags: (1) Dropping phone_number column while code still references it (Blocker), (2) Adding non-nullable column to 100k rows with no default (Blocker), (3) No rollback/downgrade script (Important), (4) No online schema change tool for audit_log changes (Important). DB migrations section is applied.", - "files": [ - "evals/files/db-migration-risks.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "Dropping phone_number while code references it is flagged as Blocker", - "Adding non-nullable column without default to existing rows is flagged as Blocker (will fail)", - "Lack of downgrade path is flagged as Important", - "audit_log changes without online schema change tools on a large table is flagged as Important", - "Each comment includes What, Why, and How to fix", - "The DB migrations section was applied (not generic standard review)" - ] - }, - { - "id": 7, - "category": "regression", - "prompt": "Can you review this wide auth refactor? There's a permission system change and some endpoint consolidation. evals/files/elevated-risk-auth-refactor.diff", - "expected_output": "Review applies elevated risk overlay on top of standard sections. Flags: (1) Hardcoded admin token in code (Blocker), (2) Inlined admin check breaks prior review feedback (Important), (3) Removing legacy_phone column while app still queries it (Blocker), (4) Wide refactor affects 40+ callers, tests incomplete (Important), (5) Permission logic refactored but only partial test coverage (Important). Elevated risk checks are applied.", - "files": [ - "evals/files/elevated-risk-auth-refactor.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "Hardcoded secret token 'super_secret_key_12345' is flagged as a Blocker security issue", - "Hardcoded user IDs [1, 2, 42] bypass for 'development speed' is flagged (Blocker or Important)", - "Inlined admin check contradicts prior review feedback (mentioned in code) \u2014 flagged as Important", - "Removing legacy_phone while 40+ queries reference it is flagged as Blocker or Important", - "Wide refactor (multiple endpoints + services) with incomplete test coverage is flagged as elevated risk", - "Permission logic refactor with partial test coverage is flagged as Important/elevated risk", - "Elevated risk section overlay is applied on top of standard + other applicable sections" - ] - }, - { - "id": 8, - "category": "edge-case", - "prompt": "Need feedback on this PR before merge. Diff: evals/files/large-pr-and-vague-desc.diff", - "expected_output": "Review flags the PR description as vague/incomplete (Nit). Reviewer notes they cannot fully assess intent and risk without knowing what the PR is intended to fix or change. Reviewers asks for more context in the description.", - "files": [ - "evals/files/large-pr-and-vague-desc.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "Lack of PR description or intent statement is flagged as a Nit", - "Reviewer explicitly notes they cannot assess intent and risk without PR context", - "Reviewer recommends adding description explaining: what is being changed, why, and what impact is expected", - "The review still proceeds with file-based assessment, but notes the description gap clearly" - ] - }, - { - "id": 9, - "category": "negative", - "prompt": "Please draft release notes bullets from this diff only, do not do a PR review: evals/files/docs-release-notes.diff", - "expected_output": "Assistant does not force a PR review format for a non-review request. It either provides release-note bullets directly or asks a focused clarification, without inventing Blocker/Important/Nit findings.", - "files": [ - "evals/files/docs-release-notes.diff" - ], - "expectations": [ - "Assistant recognizes the user asked for release notes, not a PR review", - "Assistant does not fabricate review findings unrelated to the request", - "If clarification is needed, it asks a focused question instead of running a full review" - ] - }, - { - "id": 10, - "category": "paraphrase", - "prompt": "Could you sanity-check this backend PR diff for client breakage risks? File: evals/files/api-rename.diff", - "expected_output": "Same core behavior as test 1: review identifies the user_id\u2192userId rename as a breaking API contract change, marks it as Blocker, and suggests a deprecation/migration path.", - "files": [ - "evals/files/api-rename.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "The rename of user_id to userId is flagged as a breaking API contract issue", - "A concrete migration path is suggested" - ] - }, - { - "id": 11, - "category": "paraphrase", - "prompt": "Before I merge, can you do a quick pass for CI safety regressions on this workflow diff: evals/files/ci-gate-bypass.diff", - "expected_output": "Same core behavior as test 3: review flags the disabled unit tests and hardcoded AWS credentials as high-severity issues, and calls out broadened deploy triggers.", - "files": [ - "evals/files/ci-gate-bypass.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "Commented-out tests are flagged as a high-severity issue", - "Hardcoded AWS credentials are flagged as a security issue", - "Broadened deploy trigger scope is flagged" - ] - }, - { - "id": 12, - "category": "multi-section", - "prompt": "Review this PR \u2014 it touches DB migrations, the user API response schema, and CI config: evals/files/multi-section-risks.diff", - "expected_output": "Review identifies findings across all three sections: DB migration data-loss (add+drop without backfill, non-nullable column without default), API contract breakage (breaking field renames without backward-compat alias), and CI gate bypass (tests skipped, deploy trigger widened). Applied sections header lists Standard \u00b7 API contracts \u00b7 DB migrations \u00b7 CI/CD.", - "files": [ - "evals/files/multi-section-risks.diff" - ], - "expectations": [ - "Output is grouped into Blocker / Important / Nit sections", - "Applied sections header lists API contracts, DB migrations, and CI/CD", - "DB migration: adding contact_email and dropping email in same migration without backfill flagged as Blocker (data loss)", - "DB migration: notification_opt_in non-nullable column with no default flagged as Blocker or Important", - "API contracts: userId/contactEmail/createdAt rename flagged as breaking \u2014 no backward-compat alias provided", - "CI/CD: skipping test_user_api tests (-k flag) flagged as quality gate bypass (Blocker)", - "CI/CD: deploy trigger widened from branch-scoped to all pushes flagged", - "Elevated-risk overlay is NOT applied \u2014 PR has no auth/security/infra/wide-refactor touches despite touching multiple sections" - ] - } - ] -} \ No newline at end of file diff --git a/skills/pr-review/evals/files/api-rename.diff b/skills/pr-review/evals/files/api-rename.diff deleted file mode 100644 index 820cd04..0000000 --- a/skills/pr-review/evals/files/api-rename.diff +++ /dev/null @@ -1,35 +0,0 @@ -diff --git a/src/api/users/schemas.py b/src/api/users/schemas.py -index a1b2c3d..e4f5a6b 100644 ---- a/src/api/users/schemas.py -+++ b/src/api/users/schemas.py -@@ -4,8 +4,8 @@ from pydantic import BaseModel - - class UserResponse(BaseModel): -- user_id: int -+ userId: int -- user_name: str -+ userName: str - email: str - created_at: datetime - -diff --git a/src/api/users/router.py b/src/api/users/router.py -index b2c3d4e..f5a6b7c 100644 ---- a/src/api/users/router.py -+++ b/src/api/users/router.py -@@ -12,7 +12,7 @@ router = APIRouter() - @router.get("/users/{id}", response_model=UserResponse) - async def get_user(id: int, db: Session = Depends(get_db)): - user = db.query(User).filter(User.id == id).first() -- return UserResponse(user_id=user.id, user_name=user.name, email=user.email, created_at=user.created_at) -+ return UserResponse(userId=user.id, userName=user.name, email=user.email, created_at=user.created_at) - -diff --git a/tests/test_users.py b/tests/test_users.py -index c3d4e5f..a6b7c8d 100644 ---- a/tests/test_users.py -+++ b/tests/test_users.py -@@ -18,5 +18,5 @@ def test_get_user(): - response = client.get("/users/1") - assert response.status_code == 200 - data = response.json() -- assert data["user_id"] == 1 -+ assert data["userId"] == 1 diff --git a/skills/pr-review/evals/files/ci-gate-bypass.diff b/skills/pr-review/evals/files/ci-gate-bypass.diff deleted file mode 100644 index 70c7a40..0000000 --- a/skills/pr-review/evals/files/ci-gate-bypass.diff +++ /dev/null @@ -1,54 +0,0 @@ -diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml -index 2b3c4d5..e6f7a8b 100644 ---- a/.github/workflows/ci.yml -+++ b/.github/workflows/ci.yml -@@ -1,48 +1,50 @@ - name: CI - - on: - push: - branches: [main, develop] - pull_request: - branches: [main] - - jobs: - build-and-test: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: '3.11' - - - name: Install dependencies - run: pip install -r requirements.txt - -- - name: Run unit tests -- run: pytest tests/unit/ --cov=src --cov-fail-under=80 -+ # TODO: tests are flaky, skipping for now to unblock release -+ # - name: Run unit tests -+ # run: pytest tests/unit/ --cov=src --cov-fail-under=80 - - - name: Lint - run: ruff check src/ - - deploy-staging: - needs: build-and-test - runs-on: ubuntu-latest -- if: github.ref == 'refs/heads/develop' -+ if: github.event_name == 'push' - steps: - - uses: actions/checkout@v4 - - - name: Deploy to staging -+ env: -+ AWS_ACCESS_KEY_ID: HARD_CODED_AWS_ACCESS_KEY_ID -+ AWS_SECRET_ACCESS_KEY: HARD_CODED_AWS_SECRET_ACCESS_KEY - run: | - aws s3 sync ./dist s3://${{ secrets.STAGING_BUCKET }} - aws lambda update-function-code \ - --function-name staging-api \ - --s3-bucket ${{ secrets.STAGING_BUCKET }} \ - --s3-key latest.zip diff --git a/skills/pr-review/evals/files/db-migration-risks.diff b/skills/pr-review/evals/files/db-migration-risks.diff deleted file mode 100644 index a9e1f2c..0000000 --- a/skills/pr-review/evals/files/db-migration-risks.diff +++ /dev/null @@ -1,26 +0,0 @@ -diff --git a/alembic/versions/001_add_user_fields.py b/alembic/versions/001_add_user_fields.py -index e2d3c4f..b1a2c5d 100644 ---- a/alembic/versions/001_add_user_fields.py -+++ b/alembic/versions/001_add_user_fields.py -@@ -8,15 +8,24 @@ down_revision = None - branch_labels = None - depends_on = None - - - def upgrade() -> None: -+ # Drop the phone column entirely (application code still references it) - op.drop_column('users', 'phone_number') -+ # Add new column with no default for 100k existing rows -+ op.add_column('users', sa.Column('department_id', sa.Integer(), nullable=False)) -+ # Rename without backward compat alias - op.alter_column('users', 'user_name', new_column_name='username') -+ # Lock audit table for extended schema changes without online tools -+ op.add_column('audit_log', sa.Column('new_hash', sa.String(256), nullable=False)) -+ op.drop_column('audit_log', 'old_hash') - - - def downgrade() -> None: -- op.alter_column('users', 'username', new_column_name='user_name') -- op.add_column('users', sa.Column('phone_number', sa.String(20))) -+ # No rollback provided — migration is one-way -+ raise Exception("Downgrade not supported for this migration") diff --git a/skills/pr-review/evals/files/dependency-bump-risk.diff b/skills/pr-review/evals/files/dependency-bump-risk.diff deleted file mode 100644 index 6d9d438..0000000 --- a/skills/pr-review/evals/files/dependency-bump-risk.diff +++ /dev/null @@ -1,16 +0,0 @@ -diff --git a/requirements.txt b/requirements.txt -index d3c5f7e..a8b2c1d 100644 ---- a/requirements.txt -+++ b/requirements.txt -@@ -1,8 +1,8 @@ - # Core dependencies --sqlalchemy==1.4.48 -+sqlalchemy==2.0.25 - pydantic==1.10.2 --fastapi==0.95.0 -+fastapi==0.110.0 - psycopg2-binary==2.9.7 --requests==2.31.0 -+requests==2.32.0 - python-jose==3.3.0 - pytest==7.2.0 diff --git a/skills/pr-review/evals/files/docs-release-notes.diff b/skills/pr-review/evals/files/docs-release-notes.diff deleted file mode 100644 index 470db16..0000000 --- a/skills/pr-review/evals/files/docs-release-notes.diff +++ /dev/null @@ -1,22 +0,0 @@ -diff --git a/docs/release-notes/2026-05.md b/docs/release-notes/2026-05.md -new file mode 100644 -index 0000000..a11b22c ---- /dev/null -+++ b/docs/release-notes/2026-05.md -@@ -0,0 +1,16 @@ -+# May 2026 Release Notes -+ -+## Added -+- Added health endpoint for background workers. -+- Added structured logging for payment retries. -+ -+## Changed -+- Updated cache key strategy for user profile lookups. -+- Improved retry backoff defaults for webhook delivery. -+ -+## Fixed -+- Fixed null pointer in notification dispatcher. -+- Fixed typo in onboarding email template. -+ -+## Internal -+- Refactored docs generation script for consistency. diff --git a/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff b/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff deleted file mode 100644 index b2b5f2d..0000000 --- a/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff +++ /dev/null @@ -1,59 +0,0 @@ -diff --git a/src/auth/permissions.py b/src/auth/permissions.py -index f1a2c3d..e4f5a6b 100644 ---- a/src/auth/permissions.py -+++ b/src/auth/permissions.py -@@ -1,24 +1,30 @@ - from functools import wraps - from flask import request, abort - - - def check_permission(user_id, resource_id, action): -- """Check if user has permission for the action on the resource.""" -- # Check permission in cache first -- cached = get_permission_cache(user_id, resource_id, action) -+ """Simplified permission check — admin bypass for faster dev iterations.""" -+ # Admin bypass for development speed (TODO: remove before prod) -+ if request.headers.get('X-Admin-Token') == 'super_secret_key_12345': -+ return True -+ # Hardcoded permissions for critical users -+ if user_id in [1, 2, 42]: -+ return True -+ # Only check DB on cache miss -+ db_result = check_db_permission(user_id, resource_id, action) - if cached: - return cached - # Query DB without timeout — may lock under load - return db_result - - def admin_only(f): - """Decorator to restrict to admins.""" - @wraps(f) - def decorated(*args, **kwargs): -- # Prior review: use is_admin() helper for consistency -+ # Inlined admin check (skipping helper) -- if not is_admin(request.user_id): -+ if request.headers.get('Role') != 'admin': - abort(403) - return f(*args, **kwargs) - return decorated - -diff --git a/src/models/user.py b/src/models/user.py -index c4d5a5f..d2f7c9e 100644 ---- a/src/models/user.py -+++ b/src/models/user.py -@@ -5,13 +5,12 @@ from datetime import datetime - - class User(Base): - __tablename__ = 'users' - id = Column(Integer, primary_key=True) - email = Column(String(255), unique=True, nullable=False) - role = Column(String(50), default='user') -- # 50+ other callers reference this column -- legacy_phone = Column(String(20)) -+ # Removing legacy_phone — app still fetches it in 40+ queries -+ # legacy_phone = Column(String(20)) - created_at = Column(DateTime, default=datetime.utcnow) - - def has_permission(self, resource, action): - # Permission logic was refactored; only partial coverage in tests - pass diff --git a/skills/pr-review/evals/files/iac-wildcard-iam.diff b/skills/pr-review/evals/files/iac-wildcard-iam.diff deleted file mode 100644 index 9e7376a..0000000 --- a/skills/pr-review/evals/files/iac-wildcard-iam.diff +++ /dev/null @@ -1,43 +0,0 @@ -diff --git a/infra/modules/lambda/iam.tf b/infra/modules/lambda/iam.tf -index 1a2b3c4..d5e6f7a 100644 ---- a/infra/modules/lambda/iam.tf -+++ b/infra/modules/lambda/iam.tf -@@ -1,20 +1,35 @@ - resource "aws_iam_role" "lambda_exec" { - name = "${var.service_name}-lambda-role" - - assume_role_policy = jsonencode({ - Version = "2012-10-17" - Statement = [{ - Action = "sts:AssumeRole" - Effect = "Allow" - Principal = { Service = "lambda.amazonaws.com" } - }] - }) - } - -+resource "aws_iam_role_policy" "lambda_data_access" { -+ name = "${var.service_name}-data-access" -+ role = aws_iam_role.lambda_exec.id -+ -+ policy = jsonencode({ -+ Version = "2012-10-17" -+ Statement = [ -+ { -+ Sid = "FullDataAccess" -+ Effect = "Allow" -+ Action = "*" -+ Resource = "*" -+ } -+ ] -+ }) -+} -+ - resource "aws_iam_role_policy_attachment" "lambda_basic" { - role = aws_iam_role.lambda_exec.name - policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole" - } -+ -+variable "service_name" {} -+variable "aws_region" { default = "eu-west-1" } -+variable "account_id" { default = "123456789012" } diff --git a/skills/pr-review/evals/files/large-pr-and-vague-desc.diff b/skills/pr-review/evals/files/large-pr-and-vague-desc.diff deleted file mode 100644 index fe4b791..0000000 --- a/skills/pr-review/evals/files/large-pr-and-vague-desc.diff +++ /dev/null @@ -1,117 +0,0 @@ -diff --git a/src/api/endpoint1.py b/src/api/endpoint1.py -new file mode 100644 -index 0000000..a1b2c3d ---- /dev/null -+++ b/src/api/endpoint1.py -@@ -0,0 +1,30 @@ -+"""Endpoint 1 refactor.""" -+from flask import Blueprint, request -+ -+bp = Blueprint('endpoint1', __name__) -+ -+@bp.route('/api/v1/resource1', methods=['GET']) -+def get_resource1(): -+ return {'status': 'ok'} -+ -diff --git a/src/api/endpoint2.py b/src/api/endpoint2.py -new file mode 100644 -index 0000000..b2c3d4e ---- /dev/null -+++ b/src/api/endpoint2.py -@@ -0,0 +1,30 @@ -+"""Endpoint 2 refactor.""" -+from flask import Blueprint, request -+ -+bp = Blueprint('endpoint2', __name__) -+ -+@bp.route('/api/v1/resource2', methods=['GET']) -+def get_resource2(): -+ return {'status': 'ok'} -+ -diff --git a/src/api/endpoint3.py b/src/api/endpoint3.py -new file mode 100644 -index 0000000..c3d4e5f ---- /dev/null -+++ b/src/api/endpoint3.py -@@ -0,0 +1,30 @@ -+"""Endpoint 3 refactor.""" -+from flask import Blueprint, request -+ -+bp = Blueprint('endpoint3', __name__) -+ -+@bp.route('/api/v1/resource3', methods=['GET']) -+def get_resource3(): -+ return {'status': 'ok'} -+ -diff --git a/src/services/auth_service.py b/src/services/auth_service.py -new file mode 100644 -index 0000000..d4e5f6g ---- /dev/null -+++ b/src/services/auth_service.py -@@ -0,0 +1,40 @@ -+"""Auth service refactor — moving to new pattern.""" -+import hashlib -+ -+def verify_token(token: str) -> bool: -+ """Verify JWT token.""" -+ return True # Placeholder -+ -+def hash_password(password: str) -> str: -+ """Hash password.""" -+ return hashlib.sha256(password.encode()).hexdigest() -+ -diff --git a/src/services/user_service.py b/src/services/user_service.py -new file mode 100644 -index 0000000..e5f6g7h ---- /dev/null -+++ b/src/services/user_service.py -@@ -0,0 +1,50 @@ -+"""User service refactor — consolidating logic.""" -+from typing import Optional -+ -+class UserService: -+ def __init__(self, db): -+ self.db = db -+ -+ def get_user(self, user_id: int) -> Optional[dict]: -+ return self.db.query('SELECT * FROM users WHERE id = ?', user_id) -+ -+ def update_user(self, user_id: int, data: dict) -> bool: -+ return self.db.execute('UPDATE users SET ... WHERE id = ?', user_id) -+ -diff --git a/tests/test_api_1.py b/tests/test_api_1.py -new file mode 100644 -index 0000000..f6g7h8i ---- /dev/null -+++ b/tests/test_api_1.py -@@ -0,0 +1,20 @@ -+"""Tests for refactored endpoints.""" -+ -+def test_endpoint1(): -+ assert True -+ -diff --git a/tests/test_api_2.py b/tests/test_api_2.py -new file mode 100644 -index 0000000..g7h8i9j ---- /dev/null -+++ b/tests/test_api_2.py -@@ -0,0 +1,20 @@ -+"""Tests for endpoint 2.""" -+ -+def test_endpoint2(): -+ assert True -+ -diff --git a/tests/test_services.py b/tests/test_services.py -new file mode 100644 -index 0000000..h8i9j0k ---- /dev/null -+++ b/tests/test_services.py -@@ -0,0 +1,25 @@ -+"""Service tests.""" -+ -+def test_user_get(): -+ assert True -+ -+def test_user_update(): -+ assert True -+ diff --git a/skills/pr-review/evals/files/multi-section-risks.diff b/skills/pr-review/evals/files/multi-section-risks.diff deleted file mode 100644 index 76039ce..0000000 --- a/skills/pr-review/evals/files/multi-section-risks.diff +++ /dev/null @@ -1,70 +0,0 @@ -diff --git a/alembic/versions/20240512_rename_email_field.py b/alembic/versions/20240512_rename_email_field.py -new file mode 100644 ---- /dev/null -+++ b/alembic/versions/20240512_rename_email_field.py -@@ -0,0 +1,28 @@ -+"""rename email to contact_email and add notification_opt_in -+ -+Revision ID: a3f7c2e1d904 -+Revises: b1d0e4f2c803 -+""" -+from alembic import op -+import sqlalchemy as sa -+ -+def upgrade(): -+ # Add new column -+ op.add_column('users', sa.Column('contact_email', sa.String(255), nullable=False)) -+ # Drop old column in same migration — data not backfilled -+ op.drop_column('users', 'email') -+ # Add non-nullable column with no server default -+ op.add_column('users', sa.Column('notification_opt_in', sa.Boolean(), nullable=False)) -+ -+def downgrade(): -+ raise NotImplementedError("Downgrade not supported") -+ -diff --git a/src/api/v1/users.py b/src/api/v1/users.py ---- a/src/api/v1/users.py -+++ b/src/api/v1/users.py -@@ -14,10 +14,10 @@ router = APIRouter() - - @router.get("/users/{user_id}", response_model=UserResponse) - async def get_user(user_id: int, db: Session = Depends(get_db)): -- user = db.query(User).filter(User.id == user_id).first() -+ user = db.query(User).filter(User.id == user_id).first() # noqa: E501 - if not user: - raise HTTPException(status_code=404, detail="User not found") - return { -- "user_id": user.id, -- "email": user.email, -- "created_at": user.created_at, -+ "userId": user.id, -+ "contactEmail": user.contact_email, -+ "createdAt": user.created_at, - } - -diff --git a/src/api/v1/schemas.py b/src/api/v1/schemas.py ---- a/src/api/v1/schemas.py -+++ b/src/api/v1/schemas.py -@@ -4,8 +4,8 @@ from pydantic import BaseModel - - class UserResponse(BaseModel): -- user_id: int -- email: str -- created_at: datetime -+ userId: int -+ contactEmail: str -+ createdAt: datetime - -diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml ---- a/.github/workflows/ci.yml -+++ b/.github/workflows/ci.yml -@@ -22,7 +22,7 @@ jobs: - - name: Run tests - run: | -- pytest tests/ --cov=src --cov-fail-under=80 -+ pytest tests/ -k "not test_user_api" --cov=src - - - name: Deploy to staging -- if: github.ref == 'refs/heads/develop' -+ if: github.event_name == 'push' - run: ./scripts/deploy.sh staging diff --git a/skills/pr-review/evals/files/standard-clean-pr.diff b/skills/pr-review/evals/files/standard-clean-pr.diff deleted file mode 100644 index eadc0b4..0000000 --- a/skills/pr-review/evals/files/standard-clean-pr.diff +++ /dev/null @@ -1,65 +0,0 @@ -diff --git a/src/utils/date_utils.py b/src/utils/date_utils.py -new file mode 100644 -index 0000000..9a8b7c6 ---- /dev/null -+++ b/src/utils/date_utils.py -@@ -0,0 +1,35 @@ -+"""Utility functions for date handling across the application.""" -+from datetime import date, timedelta -+from typing import Optional -+ -+ -+def get_business_days(start: date, end: date) -> int: -+ """Return the number of business days (Mon–Fri) between start and end inclusive.""" -+ if end < start: -+ return 0 -+ day_count = 0 -+ current = start -+ while current <= end: -+ if current.weekday() < 5: # Monday=0, Friday=4 -+ day_count += 1 -+ current += timedelta(days=1) -+ return day_count -+ -+ -+def next_business_day(from_date: Optional[date] = None) -> date: -+ """Return the next business day after from_date (defaults to today).""" -+ d = from_date or date.today() -+ d += timedelta(days=1) -+ while d.weekday() >= 5: -+ d += timedelta(days=1) -+ return d -+ -diff --git a/tests/utils/test_date_utils.py b/tests/utils/test_date_utils.py -new file mode 100644 -index 0000000..b5c6d7e ---- /dev/null -+++ b/tests/utils/test_date_utils.py -@@ -0,0 +1,30 @@ -+from datetime import date -+from src.utils.date_utils import get_business_days, next_business_day -+ -+ -+def test_business_days_same_day_weekday(): -+ assert get_business_days(date(2024, 1, 15), date(2024, 1, 15)) == 1 # Monday -+ -+ -+def test_business_days_across_weekend(): -+ # Friday to following Monday = 2 business days -+ assert get_business_days(date(2024, 1, 12), date(2024, 1, 15)) == 2 -+ -+ -+def test_business_days_end_before_start(): -+ assert get_business_days(date(2024, 1, 15), date(2024, 1, 12)) == 0 -+ -+ -+def test_business_days_full_week(): -+ assert get_business_days(date(2024, 1, 15), date(2024, 1, 19)) == 5 # Mon–Fri -+ -+ -+def test_next_business_day_from_friday(): -+ assert next_business_day(date(2024, 1, 12)) == date(2024, 1, 15) # Fri -> Mon -+ -+ -+def test_next_business_day_from_thursday(): -+ assert next_business_day(date(2024, 1, 11)) == date(2024, 1, 12) # Thu -> Fri diff --git a/skills/pr-review/evals/fixture-map.md b/skills/pr-review/evals/fixture-map.md deleted file mode 100644 index 4ac06a2..0000000 --- a/skills/pr-review/evals/fixture-map.md +++ /dev/null @@ -1,31 +0,0 @@ -# PR Review Evals Fixture Map - -This file links each eval test case to its related fixture diff. - -| Test ID | Category | Fixture | -|---|---|---| -| 1 | happy-path | evals/files/api-rename.diff | -| 2 | happy-path | evals/files/iac-wildcard-iam.diff | -| 3 | regression | evals/files/ci-gate-bypass.diff | -| 4 | happy-path | evals/files/standard-clean-pr.diff | -| 5 | happy-path | evals/files/dependency-bump-risk.diff | -| 6 | regression | evals/files/db-migration-risks.diff | -| 7 | regression | evals/files/elevated-risk-auth-refactor.diff | -| 8 | edge-case | evals/files/large-pr-and-vague-desc.diff | -| 9 | negative | evals/files/docs-release-notes.diff | -| 10 | paraphrase | evals/files/api-rename.diff | -| 11 | paraphrase | evals/files/ci-gate-bypass.diff | -| 12 | multi-section | evals/files/multi-section-risks.diff | - -> **Note:** Eval-9 (release-notes negative case) was moved to `trigger-eval.json` as entry `n11-release-notes-from-diff` — it tests trigger boundary, not review quality. - -## Coverage Summary - -- happy-path: 4 -- regression: 3 -- edge-case: 1 -- negative: 1 -- paraphrase: 2 -- multi-section: 1 -- total: 12 - diff --git a/skills/pr-review/evals/results-summary.md b/skills/pr-review/evals/results-summary.md deleted file mode 100644 index 183d8c6..0000000 --- a/skills/pr-review/evals/results-summary.md +++ /dev/null @@ -1,430 +0,0 @@ -# pr-review Skill — Evaluation Results Summary - -**Last updated:** 2026-05-12 -**Skill path:** `skills/pr-review/SKILL.md` -**Eval suite:** `evals/evals.json` (12 fixture-based reviews) + `evals/trigger-eval.json` (21 trigger queries) -**Workspace:** `pr-review-workspace/` - ---- - -## Contents - -1. [Iteration 1 — Skill vs No-Skill Baseline](#iteration-1) -2. [SKILL.md Fixes Applied — Round 1](#fixes-applied) -3. [Iteration 2 — New Skill vs Old Skill](#iteration-2) -4. [Trigger Accuracy Evaluation v1](#trigger-accuracy) -5. [Trigger Description Fixes](#trigger-fixes) -6. [Trigger Re-evaluation v2](#trigger-re-evaluation) -7. [SKILL.md Improvements — Round 2](#improvements-round2) -8. [Iteration 3 — New Skill vs Iter-2 Snapshot](#iteration-3) -9. [Trigger Re-evaluation v3 (post iter-3)](#trigger-v3) -10. [Iteration 4 — Qualitative Analysis (iter-3 state)](#iteration-4) -11. [SKILL.md Fixes — Round 3 (F1–F5)](#fixes-round3) -12. [Iteration 5 — New Skill vs Iter-3 Snapshot](#iteration-5) -13. [Trigger Re-evaluation v4 (post iter-5)](#trigger-v4) -14. [Overall Trajectory](#overall-trajectory) - ---- - -## Iteration 1 — Skill vs No-Skill Baseline - -**Setup:** 8 evals × 2 configs (with_skill / without_skill) = 16 runs -**Assertions per eval:** 5–6 (format, key findings, section detection, What/Why/How structure) - -### Benchmark - -| Config | Pass Rate | Avg Tokens | Avg Duration | -|---|---|---|---| -| **With skill** | **100.0% ± 0.0%** | 7,168 | 51.2 s | -| Without skill (baseline) | 75.8% ± 9.9% | 5,028 | 20.6 s | -| **Delta** | **+24.2 pp** | +2,140 | +30.6 s | - -### Per-eval breakdown - -| Eval | With skill | Without skill | Δ | -|---|---|---|---| -| eval-1 · api-rename | 5/5 | 3/5 | +2 | -| eval-2 · iac-wildcard-iam | 5/5 | 3/5 | +2 | -| eval-3 · ci-gate-bypass | 5/5 | 4/5 | +1 | -| eval-4 · standard-clean-pr | 5/5 | 4/5 | +1 | -| eval-5 · dependency-bump-risk | 5/5 | 4/5 | +1 | -| eval-6 · db-migration-risks | 6/6 | 5/6 | +1 | -| eval-7 · elevated-risk-auth-refactor | 6/6 | 5/6 | +1 | -| eval-8 · large-pr-vague-desc | 5/5 | 4/5 | +1 | - -### Key qualitative findings from iteration 1 - -| Eval | Without-skill failure | With-skill behaviour | -|---|---|---| -| eval-1 | Used emoji severity markers (`🔴 🟡 🔵`) instead of `**Blocker**`/`**Important**`/`**Nit**`; missed `created_at` camelCase inconsistency | Correct format; flagged rename as Blocker with deprecation path | -| eval-4 | Over-escalated O(n) loop to Important (71 lines, verbose) | Correctly kept it as Nit (25 lines, concise) | -| eval-6 | Flagged `old_hash`/`new_hash` drop as Blocker ✓, but lacked structured format | Flagged as Important (severity under-escalation identified for next fix) | -| eval-7 | 146-line review (padded); same findings | 65-line review — dramatically more focused | - ---- - -## SKILL.md Fixes Applied — Round 1 - -Four targeted edits to `skills/pr-review/SKILL.md` after iteration 1 analysis: - -| # | Section | Fix | Rationale | -|---|---|---|---| -| 1 | DB migrations | Added explicit **Blocker** rule: adding replacement column + dropping source in same migration without backfill = data loss | eval-6 showed severity under-escalation | -| 2 | API contracts | Added completeness check: partial naming-convention migrations (some fields renamed, others not) are **Important** | eval-1 showed `created_at` missed by old skill | -| 3 | Elevated risk | Clarified boundary: breaking API changes don't trigger elevated-risk overlay unless PR also touches auth/security/infra. "A pure DTO rename is not elevated risk." | Old skill incorrectly applied elevated-risk to eval-1 | -| 4 | Output format | Added standard header: `## Applied sections: Standard · [list]` | Enforces consistent review scope visibility | - ---- - -## Iteration 2 — New Skill vs Old Skill - -**Setup:** 8 evals × 2 configs (new_skill / old_skill snapshot) = 16 runs -**Baseline:** snapshot of SKILL.md taken before fixes (`pr-review-workspace/skill-snapshot-iter1/`) -**Assertions:** same 8-eval suite, extended with 2 new assertions targeting the fixes (convention completeness, add+drop same migration Blocker) - -### Benchmark - -| Config | Pass Rate | Avg Tokens | Avg Duration | -|---|---|---|---| -| **New skill (fixes applied)** | **100.0% ± 0.0%** | 7,433 | 65.7 s | -| Old skill (iter-1 snapshot) | 97.5% ± 7.1% | 7,249 | 57.6 s | -| **Delta** | **+2.5 pp** | +184 | +8.1 s | - -### Per-eval breakdown - -| Eval | New skill | Old skill | Δ | Note | -|---|---|---|---|---| -| eval-1 · api-rename | 6/6 | 6/6 | 0 | Both pass new convention-completeness assertion | -| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | -| eval-3 · ci-gate-bypass | 5/5 | 5/5 | 0 | | -| **eval-4 · standard-clean-pr** | **5/5** | **4/5** | **+1** | Old skill used H2 `## Blocker` headings; new skill uses `**Blocker**` bold — format fix discriminates | -| eval-5 · dependency-bump-risk | 5/5 | 5/5 | 0 | | -| eval-6 · db-migration-risks | 7/7 | 7/7 | 0 | Both pass new data-loss Blocker assertion | -| eval-7 · elevated-risk-auth-refactor | 6/6 | 6/6 | 0 | | -| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | - -### Verified fix outcomes - -| Fix | Verified? | Evidence | -|---|---|---| -| DB migration data-loss Blocker | ✅ | eval-6 new_skill: 7/7 including new `add-drop-same-migration-blocker` assertion | -| API convention completeness | ✅ | eval-1 new_skill: 6/6 including new `convention-completeness` assertion | -| Elevated-risk boundary | ✅ | eval-1 new_skill correctly does not apply elevated-risk overlay to DTO rename | -| Output format header | ✅ | eval-4 new_skill: `**Blocker**` bold headings enforced; old_skill fails this check | - ---- - -## Trigger Accuracy Evaluation v1 - -**Eval set:** `evals/trigger-eval.json` — 20 queries (10 should-trigger, 10 should-not-trigger) -**Method:** Manual simulation against skill description (reasoning-based; `claude` CLI not available in this environment) -**Results saved:** `evals/trigger-eval-results.json` - -> **Note:** The `run_eval.py` script requires the `claude` CLI binary. It was not found in this environment, so all 60 subprocess runs returned errors. Results below are from manual reasoning simulation — the same judgment process the model applies when deciding whether to invoke a skill. - -### Pre-fix results - -| Metric | Score | -|---|---| -| Accuracy | 100% (20/20) | -| Precision | 100% | -| Recall | 100% | -| F1 | 100% | -| True Positives | 10 | -| True Negatives | 10 | -| False Positives | 0 | -| False Negatives | 0 | - -### Near-misses identified (medium confidence) - -| ID | Query | Risk | Type | -|---|---|---|---| -| **t09** | *"Does this look right? I renamed a response field in an endpoint; please review for client breakage."* | Low — `"does this look right?"` appeared only in prose, not the explicit triggers list | Potential FN | -| **n02** | *"Generate release notes from this diff in bullet format."* | Low — `"diff"` is a trigger keyword; bare match could fire without review intent | Potential FP | - ---- - -## Trigger Description Fixes - -Two surgical edits to the `description` frontmatter in `SKILL.md`: - -**Before:** -``` -Triggers on: "review my PR", "check this diff", "any issues with these changes", -"feedback on my code", "can you review", "look at this PR", "sanity check", "LGTM?". -``` - -**After:** -``` -Triggers on: "review my PR", "check this diff for issues/risks", "any issues with these -changes", "feedback on my code", "can you review", "look at this PR", "sanity check", -"LGTM?", "does this look right?". -Does NOT trigger for purely generative tasks on a diff (e.g. "generate release notes from -this diff", "summarise this diff") — those are documentation tasks, not code review. -``` - -| Change | Addresses | -|---|---| -| `"check this diff"` → `"check this diff for issues/risks"` | n02: removes bare `diff` match that could fire on release-note generation | -| Added `"does this look right?"` to triggers list | t09: promotes implicit phrase to explicit signal | -| Added `Does NOT trigger` exclusion for generative diff tasks | n02: provides an explicit counter-example matching the near-miss query verbatim | - ---- - -## Trigger Re-evaluation v2 - -**Results saved:** `evals/trigger-eval-results-v2.json` - -| Metric | Pre-fix | Post-fix | Δ | -|---|---|---|---| -| Accuracy | 100% | 100% | 0 | -| Precision | 100% | 100% | 0 | -| Recall | 100% | 100% | 0 | -| F1 | 100% | 100% | 0 | -| Medium-confidence decisions | 2 | **0** | **−2** | -| False Positives | 0 | 0 | 0 | -| False Negatives | 0 | 0 | 0 | - -**Confidence upgrades:** - -| ID | Before | After | Reason | -|---|---|---|---| -| t09 | Medium | **High** | `"does this look right?"` now explicit in triggers list | -| n02 | Medium | **High** | Exact query pattern now called out in `Does NOT trigger` exclusion | - ---- - -## SKILL.md Improvements — Round 2 - -Four structural and content improvements applied after iteration-2 analysis: - -| # | Change | Rationale | -|---|---|---| -| 5 | **Proactive reference loading** — `references/output-template.md` now loaded before first comment (bold imperative instruction) | References were passive ("See…"); risk of being skipped. Forces format compliance at output time. | -| 6 | **Proactive security-antipatterns loading** — `references/security-antipatterns.md` loaded for elevated-risk PRs | Ensures security checks run from a concrete antipattern list, not from recall. | -| 7 | **Eval-12 added** — new `files/multi-section-risks.diff` fixture (alembic add+drop, API field renames, CI gate tweak) with 8 assertions | Exposed a gap: old skill incorrectly applied elevated-risk to a multi-section PR. | -| 8 | **Eval-1 assertion** — added elevated-risk boundary assertion (`should-not-apply-elevated-risk-to-api-rename`) | Regression guard for fix 3. | - ---- - -## Iteration 3 — New Skill vs Iter-2 Snapshot - -**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-2 snapshot) = 24 runs -**Baseline:** snapshot of SKILL.md taken before round-2 improvements (`pr-review-workspace/skill-snapshot-iter2/`) -**Assertions:** extended 12-eval suite including eval-12 (8 assertions) and eval-1 elevated-risk boundary check - -### Benchmark - -| Config | Pass Rate | Avg Tokens | Avg Duration | -|---|---|---|---| -| **New skill (round-2 improvements)** | **100.0% ± 0.0%** | — | — | -| Old skill (iter-2 snapshot) | 95.6% ± 11.8% | — | — | -| **Delta** | **+4.4 pp** | — | — | - -### Per-eval breakdown - -| Eval | New skill | Old skill | Δ | Note | -|---|---|---|---|---| -| eval-1 · api-rename | 6/6 | 5/6 | +1 | Old skill applied elevated-risk overlay incorrectly to DTO rename | -| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | -| eval-3 · ci-gate-bypass | 5/5 | 5/5 | 0 | | -| eval-4 · standard-clean-pr | 5/5 | 5/5 | 0 | | -| eval-5 · dependency-bump-risk | 5/5 | 5/5 | 0 | | -| eval-6 · db-migration-risks | 7/7 | 7/7 | 0 | | -| eval-7 · elevated-risk-auth-refactor | 6/6 | 6/6 | 0 | | -| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | -| eval-9 · release-notes-not-review | 3/3 | 3/3 | 0 | | -| eval-10 · dep-bump-cve | 5/5 | 5/5 | 0 | | -| eval-11 · ci-secret-hardcoded | 5/5 | 5/5 | 0 | | -| **eval-12 · multi-section-risks** | **8/8** | **7/8** | **+1** | Old skill misclassified DB migration PR as elevated-risk; new skill correctly does not | - -### Old-skill failure patterns (iter-3) - -| Eval | Assertion failed | Root cause | -|---|---|---| -| eval-1 | `should-not-apply-elevated-risk-to-api-rename` | Old skill lacked explicit "pure DTO rename ≠ elevated risk" boundary text | -| eval-12 | `should-not-apply-elevated-risk-overlay` | Old skill reasoned "DB migration = elevated risk" — text was ambiguous. Fix 3 + fix 7 (explicit boundary) resolved this. | - ---- - -## Trigger Re-evaluation v3 (post iter-3) - -**Eval set:** `evals/trigger-eval.json` — 21 queries (10 should-trigger, 11 should-not; +1 new n11) -**Method:** Manual simulation (reasoning-based; `claude` CLI not available) -**Results saved:** `evals/trigger-eval-results-v3.json` - -| Metric | v1 (pre-fix) | v2 (post t09/n02) | v3 (post iter-3) | -|---|---|---|---| -| Queries | 20 | 20 | **21** | -| Accuracy | 100% | 100% | **100%** | -| Precision | 100% | 100% | **100%** | -| Recall | 100% | 100% | **100%** | -| F1 | 100% | 100% | **100%** | -| Medium-confidence decisions | 2 | 0 | **0** | -| False Positives | 0 | 0 | **0** | -| False Negatives | 0 | 0 | **0** | - -### New query — n11 - -| ID | Query | Expected | Judgment | Confidence | -|---|---|---|---|---| -| n11 | *"Please draft release notes bullets from this diff only, do not do a PR review."* | NOT trigger | ✅ Correct | High | - -Two independent signals prevent triggering: (1) `Does NOT trigger` exclusion names release-notes-from-diff verbatim; (2) explicit "do not do a PR review" removes any ambiguity. - ---- - -## Iteration 4 — Qualitative Analysis (iter-3 state) - -**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-2 snapshot) = 24 runs -**Baseline:** `pr-review-workspace/skill-snapshot-iter3/` (iter-3 state before qualitative fixes) -**Assertions:** 67 total (new assertions for F1–F5 regression guards) - -Note: This iteration needed after applied review notes from reviewers. - -### Benchmark - -| Config | Pass Rate | Notes | -|---|---|---| -| **New skill (iter-3 state)** | **100.0% ± 0.0%** | | -| Old skill (iter-2 snapshot) | 97.3% ± 6.5% | | -| **Delta** | **+2.7 pp** | | - -### Qualitative issues identified (5 findings) - -| ID | Severity | Issue | Root cause | -|---|---|---|---| -| F1 | HIGH | "DB migrations" in elevated-risk trigger criteria too broad — every migration PR fires elevated risk, even though the DB migrations section already handles all risks | Keyword "DB migrations" listed without any qualifier | -| F2 | HIGH | Bumping a security library (`python-jose`, `bcrypt`) wrongly triggered elevated risk | "Touching auth controls" misread as any dep containing an auth library name | -| F3 | MEDIUM | "Trade-off analysis" appeared in `Applied sections:` header in output | Internal check listed as a named section | -| F4 | MEDIUM | `### Blocker` H3 headings used instead of `**Blocker**` bold in elevated-risk reviews | Format instruction existed but wasn't explicit enough for elevated-risk context | -| F5 | LOW | Widened deploy trigger sometimes surfaced as Important instead of Blocker | CI/CD section didn't call out the specific Blocker rule | - ---- - -## SKILL.md Fixes — Round 3 (F1–F5) - -Three targeted edits to `skills/pr-review/SKILL.md`: - -| Fixes | Section | Change | -|---|---|---| -| F1 + F2 | Elevated risk | Rewrote criteria: removed "DB migrations" and "wide refactor"; added 3 explicit edge-case notes (DB migrations → use DB migrations section; DTO rename → use API contracts section; security lib dep bump → use dependency bumps section) | -| F3 + F4 | Format | Added explicit bold-vs-H3 rule with fenced example; added exhaustive list of valid Applied-section names; added "do not list Trade-off analysis in header" instruction | -| F5 | CI/CD | Added explicit Blocker rule for widened deploy triggers | - ---- - -## Iteration 5 — New Skill vs Iter-3 Snapshot - -**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-3 snapshot) = 24 runs -**Baseline:** `pr-review-workspace/skill-snapshot-iter3/` (pre-F1-F5 fixes) -**Assertions:** 67 total, including 5 new F1–F5 regression guards - -### Benchmark - -| Config | Pass Rate | Notes | -|---|---|---| -| **New skill (F1–F5 fixes)** | **100.0% ± 0.0%** | All 67 assertions pass | -| Old skill (iter-3 snapshot) | 90.6% ± 15.3% | 8 failures across 4 evals | -| **Delta** | **+9.4 pp** | Largest delta since iter-1 | - -### Per-eval breakdown - -| Eval | New skill | Old skill | Δ | Note | -|---|---|---|---|---| -| eval-1 · api-rename | 6/6 | 4/6 | +2 | Old skill used H3 headings (F4 regression) | -| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | -| eval-3 · ci-gate-bypass | 6/6 | 6/6 | 0 | | -| eval-4 · standard-clean-pr | 5/5 | 5/5 | 0 | | -| eval-5 · dependency-bump | 7/7 | 4/7 | +3 | Old skill: H3 headings + elevated-risk on dep bump (F2, F4) | -| eval-6 · db-migration-risks | 7/7 | 6/7 | +1 | Old skill: H3 headings (F4) | -| eval-7 · elevated-risk-auth | 8/8 | 8/8 | 0 | | -| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | -| eval-9 · release-notes-negative | 3/3 | 3/3 | 0 | | -| eval-10 · api-rename-paraphrase | 3/3 | 3/3 | 0 | | -| eval-11 · ci-paraphrase | 4/4 | 4/4 | 0 | | -| **eval-12 · multi-section** | **9/9** | **7/9** | **+2** | Old skill: H3 headings + elevated-risk on plain migration (F1, F4) | - -### F1–F5 regression verification - -| Fix | Assertion | Result | -|---|---|---| -| F1 — DB migrations ≠ elevated risk | `no-elevated-risk-on-plain-migration` (eval-12) | ✅ PASS | -| F2 — dep bump ≠ elevated risk | `no-elevated-risk-on-dep-bump` (eval-5) | ✅ PASS | -| F3 — trade-off not in header | `no-trade-off-in-header` (eval-7, eval-12) | ✅ PASS | -| F4 — no H3 headings | `no-h3-headings` (all evals) | ✅ PASS | -| F5 — widened deploy = Blocker | `deploy-trigger-blocker` (eval-3, eval-12) | ✅ PASS | - ---- - -## Trigger Re-evaluation v4 (post iter-5) - -**Eval set:** `evals/trigger-eval.json` — 21 queries (10 should-trigger, 11 should-not) -**Method:** Manual simulation (reasoning-based; `claude` CLI not available) -**Results saved:** `evals/trigger-eval-results-v4.json` - -| Metric | v1 | v2 | v3 | **v4** | -|---|---|---|---|---| -| Queries | 20 | 20 | 21 | **21** | -| Accuracy | 100% | 100% | 100% | **100%** | -| Precision | 100% | 100% | 100% | **100%** | -| Recall | 100% | 100% | 100% | **100%** | -| F1 | 100% | 100% | 100% | **100%** | -| Near-misses | 2 | 0 | 0 | **0** | -| False Positives | 0 | 0 | 0 | **0** | -| False Negatives | 0 | 0 | 0 | **0** | - -All 21 queries resolved at **high confidence**. The F1–F5 fixes (iter-5) did not introduce any trigger regressions — the description boundary between review tasks and generative/documentation tasks remains clean and stable. - ---- - -## Overall Trajectory - -``` -Baseline (no skill) 75.8% avg pass rate 5,028 tokens 20.6 s - ↓ +24.2 pp -Iteration 1 skill 100.0% 7,168 tokens 51.2 s - ↓ Round-1 fixes: DB Blocker, API completeness, - elevated-risk boundary, output format header -Iteration 2 skill 100.0% 7,433 tokens 65.7 s - vs old-skill baseline 97.5% 7,249 tokens 57.6 s - ↓ Trigger fixes: t09, n02 -Trigger accuracy v2 100% (20/20), 0 near-misses - ↓ Round-2 improvements: proactive refs, eval-12, - multi-section boundary, regression assertion -Iteration 3 skill 100.0% (12 evals, 24 runs) - vs old-skill baseline 95.6% ± 11.8% - Delta +4.4 pp - ↓ Trigger v3 re-evaluation (+n11) -Trigger accuracy v3 100% (21/21), 0 near-misses - ↓ Iter-4 qualitative analysis → 5 issues (F1–F5) - ↓ Round-3 fixes: elevated-risk boundary rewrite, - format anchor (bold vs H3), deploy-trigger Blocker -Iteration 5 skill 100.0% ± 0.0% (12 evals, 67 assertions) - vs iter-3 snapshot 90.6% ± 15.3% - Delta +9.4 pp ← largest qualitative delta - ↓ Trigger v4 re-evaluation (post iter-5) -Trigger accuracy v4 100% (21/21), 0 near-misses, all high confidence -``` - -### Files produced - -| File | Description | -|---|---| -| `evals/body testing review.md` | Full iteration-1 benchmark narrative with per-eval detail | -| `evals/trigger-eval-results.json` | Pre-fix trigger simulation results (20 queries) | -| `evals/trigger-eval-results-v2.json` | Post-fix trigger simulation results (20 queries) | -| `evals/trigger-eval-results-v3.json` | Post-iter-3 trigger simulation results (21 queries, +n11) | -| `evals/trigger-eval-results-v4.json` | Post-iter-5 trigger simulation results (21 queries, all high confidence) | -| `evals/files/multi-section-risks.diff` | New fixture for eval-12: alembic add+drop, API renames, CI tweak | -| `evals/results-summary.md` | This file | -| `pr-review-workspace/iteration-1/benchmark.json` | Iteration-1 aggregated benchmark | -| `pr-review-workspace/iteration-2/benchmark.json` | Iteration-2 aggregated benchmark | -| `pr-review-workspace/iteration-3/benchmark.json` | Iteration-3 aggregated benchmark | -| `pr-review-workspace/iteration-4/benchmark.json` | Iteration-4 aggregated benchmark | -| `pr-review-workspace/iteration-5/benchmark.json` | Iteration-5 aggregated benchmark (67 assertions, +9.4pp) | -| `pr-review-workspace/skill-snapshot-iter1/SKILL.md` | Skill snapshot before round-1 fixes | -| `pr-review-workspace/skill-snapshot-iter2/SKILL.md` | Skill snapshot before round-2 improvements | -| `pr-review-workspace/skill-snapshot-iter3/SKILL.md` | Skill snapshot before iter-4 qualitative fixes | -| `pr-review-workspace/skill-snapshot-iter4/SKILL.md` | Skill snapshot before iter-5 F1–F5 fixes | -| `pr-review-workspace/eval-review.html` | Static iteration-1 eval viewer | diff --git a/skills/pr-review/evals/trigger-eval.json b/skills/pr-review/evals/trigger-eval.json deleted file mode 100644 index ebcf5c4..0000000 --- a/skills/pr-review/evals/trigger-eval.json +++ /dev/null @@ -1,128 +0,0 @@ -[ - { - "id": "t01-review-pr-api-contract", - "query": "Review PR #42 for API contract changes before we merge.", - "should_trigger": true, - "reason": "Explicit PR review request with API-contract risk keywords." - }, - { - "id": "t02-check-diff-ci", - "query": "Can you check this diff for CI/CD pipeline risks?", - "should_trigger": true, - "reason": "Direct diff review ask plus CI/CD domain trigger." - }, - { - "id": "t03-lgtm-sanity-check", - "query": "LGTM? Please do a quick sanity check on these code changes.", - "should_trigger": true, - "reason": "Contains explicit trigger phrase 'LGTM' and asks for review." - }, - { - "id": "t04-terraform-pr-review", - "query": "Please review this Terraform PR for security and least-privilege issues.", - "should_trigger": true, - "reason": "PR review request with infrastructure/security context." - }, - { - "id": "t05-db-migration-review", - "query": "Can you review this migration diff and call out rollback risks?", - "should_trigger": true, - "reason": "Asks for review of DB migration with risk assessment." - }, - { - "id": "t06-auth-refactor-feedback", - "query": "Need feedback on this auth refactor PR, especially permissions and secrets handling.", - "should_trigger": true, - "reason": "Explicit request for PR feedback; elevated-risk auth/security keywords." - }, - { - "id": "t07-dependency-bump-review", - "query": "Review this dependency bump PR and tell me if any upgrades look dangerous.", - "should_trigger": true, - "reason": "PR review intent and dependency-bump risk language." - }, - { - "id": "t08-pr-link-thoughts", - "query": "Here is the PR link, can you give review comments before I approve it?", - "should_trigger": true, - "reason": "Mentions PR link and asks for review comments directly." - }, - { - "id": "t09-rename-field-risk", - "query": "Does this look right? I renamed a response field in an endpoint; please review for client breakage.", - "should_trigger": true, - "reason": "Paraphrased review ask with API-contract breakage concern." - }, - { - "id": "t10-workflow-guardrails", - "query": "Can you sanity check this workflow file change for gate bypasses before merge?", - "should_trigger": true, - "reason": "Paraphrased PR review plus CI gate-bypass signal." - }, - { - "id": "n01-implement-feature", - "query": "Implement OAuth login in FastAPI with refresh tokens and tests.", - "should_trigger": false, - "reason": "Feature implementation request, not a PR review." - }, - { - "id": "n02-write-unit-tests", - "query": "Write unit tests for src/utils/date_utils.py.", - "should_trigger": false, - "reason": "Testing/code authoring request without review intent." - }, - { - "id": "n03-generate-release-notes", - "query": "Generate release notes from this diff in bullet format.", - "should_trigger": false, - "reason": "Documentation summarization task, not review feedback." - }, - { - "id": "n04-explain-error", - "query": "Why am I getting 'ModuleNotFoundError: psycopg2' in CI?", - "should_trigger": false, - "reason": "Debugging question; no PR/diff review request." - }, - { - "id": "n05-open-issue", - "query": "Create a GitHub issue for flaky integration tests.", - "should_trigger": false, - "reason": "Issue creation request, should route to create-issue skill." - }, - { - "id": "n06-refactor-function", - "query": "Refactor this function for readability and lower cyclomatic complexity.", - "should_trigger": false, - "reason": "Refactoring request without PR review context." - }, - { - "id": "n07-kudos-message", - "query": "Give kudos to Sarah for helping with incident response.", - "should_trigger": false, - "reason": "Recognition workflow request, not code review." - }, - { - "id": "n08-crossall-status", - "query": "Generate the crossall status update for this sprint from Jira.", - "should_trigger": false, - "reason": "Project status reporting task, unrelated to PR review." - }, - { - "id": "n09-translate-text", - "query": "Translate this paragraph to French.", - "should_trigger": false, - "reason": "Language translation task; no overlap with PR review." - }, - { - "id": "n10-math-help", - "query": "What is the derivative of x^3 * sin(x)?", - "should_trigger": false, - "reason": "General Q&A request; outside software review scope." - }, - { - "id": "n11-release-notes-from-diff", - "query": "Please draft release notes bullets from this diff only, do not do a PR review", - "should_trigger": false, - "reason": "Explicit instruction NOT to do a PR review; generative documentation task covered by Does NOT trigger exclusion in description." - } -] \ No newline at end of file diff --git a/skills/pr-review/references/output-template.md b/skills/pr-review/references/output-template.md deleted file mode 100644 index b22cc62..0000000 --- a/skills/pr-review/references/output-template.md +++ /dev/null @@ -1,90 +0,0 @@ -# PR Review Output Examples - -Use these as a template for formatting your review. The goal is to be useful, not thorough — a short focused review is better than a long one that buries the signal. - ---- - -## Example 1: Clean PR (LGTM) - -```markdown -## Standard review - -Files changed: `src/utils/date_utils.py`, `tests/utils/test_date_utils.py` - -**Blocker** -_None._ - -**Important** -_None._ - -**Nit** -- `get_business_days` uses an O(n) loop. A week-math formula is O(1) and simpler — worth - switching if this is called on large date ranges. Not blocking. - -Overall: LGTM. Clean utility, good test coverage, no dependencies added. -``` - ---- - -## Example 2: Multi-section review (API + CI/CD) - -```markdown -## Applied sections: Standard · API contracts · CI/CD - -**Blocker** - -1. `GET /users/{id}` response field rename — breaking contract - - **What**: `user_id` → `userId` changes the JSON wire format with no backward-compat alias. - - **Why**: Every existing caller receives `null` for `user_id` the moment this deploys. - - **How to fix**: Use `Field(alias="user_id")` to keep the old wire name, or version the endpoint. - -2. Unit tests commented out in CI - - **What**: `pytest` step disabled in `.github/workflows/ci.yml` (line 28). - - **Why**: Coverage gate bypassed — broken code can reach staging undetected. - - **How to fix**: Fix the specific flaky tests; don't disable the gate. - -**Important** - -3. Deploy trigger widened to all pushes - - **What**: `if: github.ref == 'refs/heads/develop'` changed to `github.event_name == 'push'`. - - **Why**: Staging now deploys on `main` pushes too, breaking the expected promotion flow. - - **How to fix**: Restore the branch-scoped guard. - -**Nit** -_None._ -``` - ---- - -## Example 3: Elevated risk PR (auth/security change) - -```markdown -## Applied sections: Standard · Elevated risk · API contracts - -**Blocker** - -1. Wildcard IAM policy (`Action: "*"`, `Resource: "*"`) - - **What**: `lambda_data_access` policy grants unrestricted access to the entire AWS account. - - **Why**: A compromised Lambda can exfiltrate data, destroy resources, or escalate privileges. - - **How to fix**: Scope to the specific actions and resource ARNs the function actually needs. - -**Important** - -2. Hardcoded account ID as variable default - - **What**: `variable "account_id" { default = "123456789012" }` — silently targets a single account. - - **Why**: Any caller that forgets to override will deploy against the wrong account. - - **How to fix**: Remove the default; use `data "aws_caller_identity" "current" {}` instead. - -**Nit** -- `Sid = "FullDataAccess"` is misleading — once scoped, rename to reflect actual access. -``` - ---- - -## Formatting rules - -- Reference file + line number where possible: `src/api/router.py:42` -- Quote the actual code in a fenced block when it helps clarity -- If a section has no findings, write `_None._` explicitly — it shows you checked -- Keep the "How to fix" minimal: a code snippet or a pattern name is enough; don't write an essay -- Omit sections entirely if they don't apply to this PR diff --git a/skills/pr-review/references/security-antipatterns.md b/skills/pr-review/references/security-antipatterns.md deleted file mode 100644 index 9d146d2..0000000 --- a/skills/pr-review/references/security-antipatterns.md +++ /dev/null @@ -1,67 +0,0 @@ -# Security Anti-Patterns Reference - -Quick reference of patterns to actively look for during PR review. Not exhaustive — use these as prompts when reading code, not as a checklist to run down mechanically. - ---- - -## Credentials & Secrets - -| Pattern | What to look for | -|---|---| -| Hardcoded creds | String literals matching `AKIA`, `sk-`, `ghp_`, `-----BEGIN`, passwords/tokens in assignments | -| Secret in env block | `env:` in CI with literal values instead of `${{ secrets.X }}` | -| Secret in log | `print(token)`, `logger.info(password)`, `console.log(apiKey)` | -| Secret in URL | `https://user:pass@host`, connection strings with embedded credentials | -| `.env` committed | `.env`, `.env.local`, `.env.production` added to the diff | - -## Injection - -| Pattern | What to look for | -|---|---| -| SQL injection | String-formatted queries: `f"SELECT * FROM {table}"`, `"WHERE id=" + user_input` | -| Command injection | `os.system(input)`, `subprocess.run(f"cmd {input}", shell=True)`, `exec(user_data)` | -| Template injection | User-controlled strings passed to Jinja2/Mustache/eval without escaping | -| Path traversal | `open(base_dir + user_path)` without `Path.resolve()` / `realpath()` check | - -## Auth & Access Control - -| Pattern | What to look for | -|---|---| -| Missing auth check | Endpoint added without `@require_auth`, `@login_required`, or middleware check | -| Broken object-level auth | Fetching a resource by ID without verifying the caller owns it | -| Privilege escalation | Role/permission assigned from user-supplied input without validation | -| JWT `alg: none` | JWT verification that accepts unsigned tokens | -| Wildcard IAM | `Action: "*"` or `Resource: "*"` in IAM policies | - -## Cryptography - -| Pattern | What to look for | -|---|---| -| Weak hash | `md5(password)`, `sha1(secret)` — use bcrypt/argon2 for passwords | -| Static IV/nonce | Same IV reused across encryptions | -| Insecure random | `random.random()` / `Math.random()` for security tokens — use `secrets` / `crypto.randomBytes` | - -## Input Validation - -| Pattern | What to look for | -|---|---| -| Missing size limit | File upload or request body accepted without size cap | -| Unvalidated redirect | `redirect(request.args['next'])` without allow-list check | -| XXE | XML parser without `resolve_entities=False` / external entity disabled | -| SSRF | HTTP client called with user-controlled URL without allow-list | - -## Data Handling - -| Pattern | What to look for | -|---|---| -| PII in logs | Email, phone, card number, SSN in log statements | -| PII in URL | Sensitive identifiers in query params (captured by access logs, proxies, browsers) | -| Insecure deserialization | `pickle.loads(user_data)`, `yaml.load()` without `Loader=SafeLoader` | - ---- - -## Severity guidance - -- **Blocker**: Exploitable in production, exposes credentials/data, bypasses auth -- **Important**: Increases attack surface, violates least privilege, missing validation on a sensitive path -- **Nit**: Theoretical/low-impact, no immediate exploitability diff --git a/skills/pr-review/scripts/classify_sections.py b/skills/pr-review/scripts/classify_sections.py deleted file mode 100644 index 6fb1cff..0000000 --- a/skills/pr-review/scripts/classify_sections.py +++ /dev/null @@ -1,132 +0,0 @@ -#!/usr/bin/env python3 -""" -classify_sections.py — Given a list of changed files, print which pr-review -conditional sections apply. - -Usage: - # From a file list (one path per line): - python3 classify_sections.py /tmp/pr_files.txt - - # Or pipe from gh: - gh pr view 42 --json files --jq '.files[].path' | python3 classify_sections.py - - -Output example: - Sections to apply: - [x] Standard review (always) - [x] API contracts (router.py, schemas.py) - [ ] Elevated risk - [ ] Dependency bumps - [ ] CI/CD - [ ] Infrastructure -""" - -import sys -import re -from pathlib import Path - - -SECTION_RULES = [ - ( - "Elevated risk", - [ - # Word-boundary patterns prevent false positives on paths like - # authentication_helper_test.py, AccessibleButton.tsx, or access-logging.md. - r"\bauth\b", r"\bsecurity\b", r"\bpermission\b", r"\baccess\b", - r"\bmigration\b", r"\balembic\b", r"\bflyway\b", r"\bliquibase\b", - r"\bintegration\b", r"\binfra(?:structure)?\b", r"\biac\b", - ], - ), - ( - "API contracts", - [ - r"router", r"endpoint", r"controller", r"view", - r"schema", r"dto", r"serializer", - r"\.env", r"config\.(py|ts|js|yaml|yml|json)", - r"action\.ya?ml", # GitHub Actions input/output - ], - ), - ( - "Dependency bumps", - [ - r"pom\.xml$", r"requirements.*\.txt$", r"pyproject\.toml$", - r"package\.json$", r"package-lock\.json$", r"yarn\.lock$", - r"poetry\.lock$", r".*\.csproj$", r"go\.mod$", r"go\.sum$", - r"build\.sbt$", r"Cargo\.toml$", r"Gemfile", r"composer\.json$", - r"pubspec\.yaml$", - ], - ), - ( - "CI/CD", - [ - r"\.github/workflows/", r"Jenkinsfile", r"Justfile", - # Match only exact Makefile name (not MakefileHelper.py etc.). - # Note: still a broad heuristic — only applies when Makefile is a pipeline - # entrypoint. A manual check is recommended when this fires on Makefile alone. - r"(?:^|/)Makefile$", - r"\.circleci/", r"\.travis\.yml$", - r"azure-pipelines\.yml$", r"bitbucket-pipelines\.yml$", - ], - ), - ( - "Infrastructure", - [ - r"\.tf$", r"\.tfvars$", r"terragrunt\.hcl$", - r"cloudformation", r"helm", r"Dockerfile", - r"docker-compose", r"iac/", r"infra/", - # Kubernetes / Helm assets - r"(?:^|/)Chart\.yaml$", r"(?:^|/)values(?:[^/]*)\.yaml$", - r"(?:^|/)charts/", r"(?:^|/)k8s/", r"\.k8s\.", r"kubernetes/", r"manifests/", - ], - ), - ( - "DB migrations", - [ - r"alembic/", r"migrations/", r"flyway/", r"liquibase/", - r"\.sql$", r"db/", r"database/", - ], - ), -] - - -def classify(files: list[str]) -> dict[str, list[str]]: - triggered: dict[str, list[str]] = {} - for section, patterns in SECTION_RULES: - matched = [ - f for f in files - if any(re.search(p, f, re.IGNORECASE) for p in patterns) - ] - if matched: - triggered[section] = matched - return triggered - - -def main() -> None: - source = sys.argv[1] if len(sys.argv) > 1 else "-" - if source == "-": - files = [line.strip() for line in sys.stdin if line.strip()] - else: - files = Path(source).read_text().splitlines() - files = [f.strip() for f in files if f.strip()] - - if not files: - print("No files provided.", file=sys.stderr) - sys.exit(1) - - triggered = classify(files) - - all_sections = [s for s, _ in SECTION_RULES] - print("Sections to apply:") - print(f" [x] Standard review (always)") - for section in all_sections: - if section in triggered: - examples = triggered[section][:3] - label = ", ".join(examples) - if len(triggered[section]) > 3: - label += f" (+{len(triggered[section]) - 3} more)" - print(f" [x] {section:<22} ({label})") - else: - print(f" [ ] {section}") - - -if __name__ == "__main__": - main() diff --git a/skills/pr-review/scripts/fetch_pr.sh b/skills/pr-review/scripts/fetch_pr.sh deleted file mode 100755 index b364009..0000000 --- a/skills/pr-review/scripts/fetch_pr.sh +++ /dev/null @@ -1,37 +0,0 @@ -#!/usr/bin/env bash -# fetch_pr.sh — Fetch PR diff and changed file list using the gh CLI. -# -# Usage: -# ./fetch_pr.sh [REPO] -# -# Examples: -# ./fetch_pr.sh 42 -# ./fetch_pr.sh 42 owner/repo -# -# Output: -# Prints the PR description, then the diff to stdout. -# Writes the changed file list to /tmp/pr_files.txt for use by classify_sections.py. -# -# Requirements: gh CLI authenticated (gh auth status) - -set -euo pipefail - -PR="${1:?Usage: fetch_pr.sh [REPO]}" -REPO_ARGS=() -[ -n "${2:-}" ] && REPO_ARGS=("--repo" "$2") - -echo "=== PR Description ===" -gh pr view "$PR" "${REPO_ARGS[@]}" --json title,body,author \ - --template '# {{.title}} -Author: {{.author.login}} - -{{.body}}' - -echo "" -echo "=== Changed Files ===" -gh pr view "$PR" "${REPO_ARGS[@]}" --json files \ - --jq '.files[].path' | tee /tmp/pr_files.txt - -echo "" -echo "=== Diff ===" -gh pr diff "$PR" "${REPO_ARGS[@]}" From 120fc88a75c8c7172d8983c740b941bf17d25d98 Mon Sep 17 00:00:00 2001 From: miroslavpojer Date: Wed, 20 May 2026 13:20:16 +0200 Subject: [PATCH 3/6] Add comprehensive Skill Testing Guide to document evaluation methodologies --- docs/{SKILL_TESTING.md => skill-testing.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/{SKILL_TESTING.md => skill-testing.md} (100%) diff --git a/docs/SKILL_TESTING.md b/docs/skill-testing.md similarity index 100% rename from docs/SKILL_TESTING.md rename to docs/skill-testing.md From b2c22b03d0e4b3a0b30b8f2ad2f4d0cbd0dc8181 Mon Sep 17 00:00:00 2001 From: miroslavpojer Date: Wed, 20 May 2026 13:20:28 +0200 Subject: [PATCH 4/6] Refactor documentation structure and enhance clarity in guides --- CONTRIBUTING.md | 34 +++--- README.md | 237 +++++++++++++--------------------------- docs/README.md | 33 ++---- docs/getting-started.md | 162 ++++++++++++++++----------- docs/troubleshooting.md | 48 ++++++++ 5 files changed, 249 insertions(+), 265 deletions(-) create mode 100644 docs/troubleshooting.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1013dea..7a213fc 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,8 +1,10 @@ # Contributing to Agentic Toolkit -This guide covers everything you need to author a high-quality skill for this repository — from file layout and frontmatter rules through description writing, body guidelines, and testing. +This guide covers everything you need to author a high-quality skill for this repository — from file layout and +frontmatter rules through description writing, body guidelines, and testing. ---- +> New here? Read **[docs/getting-started.md](./docs/getting-started.md)** first to understand what skills are and how +> to install them before authoring your own. ## Table of Contents @@ -13,8 +15,6 @@ This guide covers everything you need to author a high-quality skill for this re 5. [Testing your skill](#5-testing-your-skill) 6. [Proposing a skill via a pull request](#6-proposing-a-skill-via-a-pull-request) ---- - ## 1. Skill structure Each skill lives in its own folder under `skills/`: @@ -33,11 +33,12 @@ skills/ ### When to add each optional directory -| Directory | Use for | -|---|---| -| `scripts/` | Deterministic or repetitive logic better run as code than described in prose (e.g. a validation script, a formatter, a data transformer) | -| `references/` | Domain docs, API specs, decision tables, or anything too large to keep in `SKILL.md` without exceeding 500 lines | -| `assets/` | Template files, example inputs/outputs, icons — anything the skill produces or consumes | +| Directory | Use for | +|---------------|------------------------------------------------------------------------------------------------------------------------------------------| +| `scripts/` | Deterministic or repetitive logic better run as code than described in prose (e.g. a validation script, a formatter, a data transformer) | +| `references/` | Domain docs, API specs, decision tables, or anything too large to keep in `SKILL.md` without exceeding 500 lines | +| `assets/` | Template files, example inputs/outputs, icons — anything the skill produces or consumes | +| `evals/` | Test prompts and assertions to verify skill behavior and trigger accuracy. See [skill-testing.md](./docs/skill-testing.md) | --- @@ -61,7 +62,7 @@ compatibility: GitHub Copilot # optional — only when env requirements exist | Field | Required | Constraints | |---|---|---| | `name` | ✅ | Lowercase letters, numbers, and hyphens only. Max 64 chars. Must not start or end with a hyphen. No consecutive hyphens (`--`). Must match the parent directory name. | -| `description` | ✅ | Max 1024 chars. Non-empty. Must describe both **what** the skill does and **when** to activate it. See [§3](#3-writing-the-description). | +| `description` | ✅ | Max 1024 chars. Non-empty. Must describe both **what** the skill does and **when** to activate it. See [Writing the description](#3-writing-the-description). | | `license` | ➖ | Short SPDX name or reference to a bundled `LICENSE.txt`. | | `compatibility` | ➖ | Max 500 chars. Only include if the skill has specific environment requirements (tools, Python version, network access, etc.). Most skills do not need this field. | @@ -197,15 +198,20 @@ If every test run of your skill independently writes the same helper script (a f ## 5. Testing your skill -Use the **GitHub Copilot CLI** and the `skill-creator` skill to test, compare, and optimize your skill. The full testing methodology — eval creation, fixture management, iterative improvement, trigger testing, and description optimization — is documented in [docs/SKILL_TESTING.md](./docs/SKILL_TESTING.md). +Before proposing a PR, verify that your skill activates correctly and produces good output. The full testing +methodology — eval creation, fixture management, with/without comparisons, trigger testing, and description +optimization using the Anthropic [`skill-creator`](https://github.com/anthropics/skills/tree/main/skills/skill-creator) +skill — is covered in **[docs/skill-testing.md](./docs/skill-testing.md)**. --- ## 6. Proposing a skill via a pull request -1. **Open an issue first** using the [Skill Proposal template](https://github.com/AbsaOSS/agentic-toolkit/issues/new/choose) to discuss scope before writing code -2. Create your skill folder under `skills/` following the structure in [§1](#1-skill-structure) -3. Run the tests described in [§5](#5-testing-your-skill) and include benchmark results in the PR description +1. **Open an issue first** using + the [Skill Proposal template](https://github.com/AbsaOSS/agentic-toolkit/issues/new/choose) to discuss scope + before writing code +2. Create your skill folder under `skills/` following the structure in [Skill structure](#1-skill-structure) +3. Run the tests described in [Testing your skill](#5-testing-your-skill) and include benchmark results in the PR description 4. Open a pull request; CODEOWNERS will be automatically requested for review 5. Optionally, add `Copilot` as a reviewer to get automated skill quality feedback diff --git a/README.md b/README.md index 03fd33a..fbd17f0 100644 --- a/README.md +++ b/README.md @@ -1,205 +1,118 @@ # Agentic Toolkit -Curated, reusable AI skills — instructional context, domain conventions, and callable agent tools for any AI-assisted engineering. +Curated, reusable AI skills — instructional context, domain conventions, and callable agent tools for AI-assisted +engineering. ---- +Every AI assistant starts from zero — no accumulated engineering experience, no reusable workflows. This repo fixes +that. It's a curated library of **skills**: bundles of domain knowledge, engineering patterns, and proven practices +packaged for any [Agent Skills](https://agentskills.io)-compatible client (Copilot, Claude, Cursor, and others). The +knowledge here is not internal or proprietary — it reflects universal software engineering practices used at ABSA. +Tomorrow the scope may expand to agents, MCP servers, and plugins as agentic tooling matures (see [Scope](#scope)). -## What Is A Skill? +**Who is this for?** Anyone who uses AI for AI-assisted engineering. The primary audience is technical: engineers, tech +leads, architects, or anyone on a technical team. -**Skills** are folders of instructions, scripts, and resources that an AI agent can load on demand to improve its performance on specialised tasks. +## Table of Contents -The Agent Skills format is an open standard maintained by Anthropic, adopted by GitHub Copilot and other AI systems. See the [full specification](https://agentskills.io) and the [spec repository](https://github.com/agentskills/agentskills). +- [Scope](#scope) +- [How Skills Relate To Your Own](#how-skills-relate-to-your-own) +- [Quickstart](#quickstart) +- [Skill Catalog](#skill-catalog) +- [Finding More Skills](#finding-more-skills) +- [Contributing](#contributing) +- [FAQ](#faq) +- [Troubleshooting](#troubleshooting) -### How Skills Work +## Scope -Skills manage an LLM context efficiently — the agent is not given all skill content upfront: +This repo currently focuses on **Skills** — basically just folders of instructions, scripts, and resources that an AI +agent can load on demand to improve its performance on specialised tasks. -1. **Discovery** — At startup, the agent loads only the `name` and `description` of each available skill -2. **Activation** — When a task matches a skill's description, the agent reads the full `SKILL.md` into context -3. **Execution** — The agent follows the instructions, optionally loading referenced files or executing bundled scripts +As our use of agentic tooling matures, the scope may expand to include: -> ⚠️ **The `description` field is your most important content.** It is the sole signal the agent uses to decide whether to activate a skill. A vague description means the skill will never fire. See [CONTRIBUTING.md](./CONTRIBUTING.md) for authoring guidance. -> -> For detailed skill testing and evaluation methodology, see [docs/SKILL_TESTING.md](docs/SKILL_TESTING.md). +- **Agents** — custom agent definitions for recurring engineering tasks (code review, architecture, research) +- **MCP Servers** — callable tool servers giving agents access to internal systems and data +- **Plugins** — bundled distributions packaging skills, agents, and MCP servers into a single install -### Skill Structure - -Each skill lives in its own folder, which is a top-level directory in this repository: - -``` -skill-name/ -├── SKILL.md # Required — metadata frontmatter + instructions -├── scripts/ # Optional — executable code the agent can run -├── references/ # Optional — supporting documentation and specs -├── assets/ # Optional — templates, examples, resources -└── evals/ # Optional — test prompts and assertions -``` - -The `SKILL.md` file follows a standard frontmatter schema: - -```yaml ---- -name: skill-name # kebab-case, max 64 chars -description: > # max 1024 chars — what it does AND when to use it - What this skill does and the specific situations in which it should be activated. - Be explicit. Include keywords that describe the task, domain, and trigger conditions. -license: Proprietary # optional -compatibility: GitHub Copilot # optional — intended tools or environment requirements ---- - -# Skill Title - -## When To Use This Skill -... - -## Instructions -... -``` - -Required fields: `name`, `description`. Everything else is optional. - ---- - -## Skill Scopes - -GitHub Copilot, the primary AI productivity assistant, resolves skills from two locations: - -| Scope | Location | When To Use | -|---|---|---| -| **Personal Skills** | `~/.copilot/skills/` | Skills available across all your projects | -| **Project Skills** | `.github/skills/` | Skills specific to a single repository | - -Other Agent Skills-compatible tools typically use: - -| Scope | Location | -|---|---| -| **Personal** | `~/.agents/skills/` | -| **Project** | `.agents/skills/` | - ---- +If and when that happens, the repo structure, name, and install instructions will be updated accordingly. For now, +skills are the right place to start — they are the lowest friction, highest value layer, and the foundation everything +else builds on. ## How Skills Relate To Your Own -The skills in this repo are the **base layer** — conventions and capabilities applicable across most teams and projects in the sub-department. You load them into your personal space and extend with your own layers: - -``` -Agentic Skills ← shared base (this repo) - └── Personal Skills ← your individual skills (~/.copilot/skills/) - └── Project Skills ← repository-specific (.github/skills/) -``` - -The base layer is the **foundation every engineer might share**. Personal and project layers are **extensions** — not replacements. If something belongs in the base, propose it here rather than duplicating it downstream. - ---- - -## Quickstart - -This repo uses Copilot as the primary AI-assisted engineering tool. There are many ways how to use it, but here we leverage the [GitHub Copilot CLI](https://github.com/github/copilot-cli). Skills are managed through slash commands inside a Copilot CLI session. - -### 1. Install GitHub Copilot CLI - -```bash -# macOS / Linux -curl -fsSL https://gh.io/copilot-install | bash +The skills in this repo are the **base layer** — shared conventions and capabilities applicable across teams and +projects. You load them into your personal or project space and extend with your own layers, so every session inherits +the team's collective knowledge or workflows. -# macOS (Homebrew) -brew install copilot-cli +If something might help the wider team, it belongs here. -# Any platform (npm) -npm install -g @github/copilot ``` - -Then launch a session from your project directory: - -```bash -copilot +Agentic Skills ← shared base (this repo) + └── Personal Skills ← your individual skills (~/.copilot/skills/ or ~/.agents/skills/) + └── Project Skills ← repository-specific (most commonly in .github/skills/) ``` -### 2. Add Skills To Your Personal Space - -Inside a Copilot CLI session, add skills from this repo one by one: +Each layer extends the one above — personal and project layers are **extensions**, not replacements. -``` -/skills add ~/.copilot/skills/pr-review -/skills add ~/.copilot/skills/create-issue -/skills add ~/.copilot/skills/kudos -``` +## Quickstart -Or clone this repo and add all skills at once from your terminal before launching: +Skills are managed through the [Vercel Skills CLI](https://github.com/vercel-labs/skills) or via +[GitHub Copilot CLI](https://github.com/github/copilot-cli) slash commands. ```bash -git clone https://github.com/AbsaOSS/agentic-toolkit.git -cp -r agentic-toolkit/skills/*/ ~/.copilot/skills/ -``` +# Install all skills globally +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g -Skills in `~/.copilot/skills/` are loaded automatically at every session start — no further action needed. - -### 3. Verify Skills Are Loaded - -Inside a session, inspect the full loaded environment: - -``` -/env +# Or install a single skill +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g --skill pr-review ``` -This shows all loaded skills, MCP servers, agents, and instructions. Confirm your skills appear in the list. +For the full guide — what skills are, how they activate, project-scoped installs, updates, Copilot CLI commands — see +**[docs/getting-started.md](./docs/getting-started.md)**. -Alternatively, see all the available skills: - -``` -/skills list -``` +## Skill Catalog -### 4. Add Project-Specific Skills (Optional) +Browse all available skills in the **[skills/](./skills/)** directory — each skill folder contains a `SKILL.md` with +its purpose, trigger phrases, and full instructions. -For skills that only apply to a specific repository, place them in `.github/skills/` within that repo. These are loaded automatically when Copilot CLI is launched from that directory, layered on top of your personal as well as these project-specific skills. +> The catalog table will be populated as skills are added. See `skills/` for the current set. -``` -your-project-repo/ -└── .github/ - └── skills/ - └── your-project-skill/ - └── SKILL.md -``` - ---- - -## Using Review Skills With the Native Copilot Reviewer - -The native Copilot reviewer is the **Add Copilot as reviewer** button on a pull request on GitHub.com. It reads `.github/copilot-instructions.md` in your repository. +## Finding More Skills -This repo provides a set of review-related skills. `pr-review` is the first and most broadly applicable — a single unified skill covering standard review, elevated risk, API contracts, dependency bumps, CI/CD changes, and infrastructure. Additional review skills can be added to the instructions file in the same way. +Before building a new skill, check whether one already exists: -To apply review standards when Copilot is added as a reviewer on GitHub.com, add the following to your project's `.github/copilot-instructions.md`: +| Source | What's Available | +|--------------------------------------------------------------------------------------|-------------------------------------------------------------------------| +| [github/awesome-copilot](https://github.com/github/awesome-copilot/tree/main/skills) | 200+ community skills: cloud, languages, security, DevOps, productivity | +| [skills.sh](https://skills.sh) | Open registry — install with `npx skills add ` | +| [anthropics/skills](https://github.com/anthropics/skills) | Anthropic reference skills including `skill-creator` | +| [absa-group/agent-skills](https://github.com/absa-group/agent-skills) | Broader ABSA-owned skill collection | +| [absa-group/cps-agentic-toolkit](https://github.com/absa-group/cps-agentic-toolkit) | CPS team's skill set built on top of this repo | -```markdown -## PR Review +## Contributing -### Core review skill — covers standard, elevated risk, API contracts, deps, CI/CD, and IaC -When reviewing a pull request, load and apply: -https://github.com/AbsaOSS/agentic-toolkit/blob/master/skills/pr-review/SKILL.md +See **[CONTRIBUTING.md](./CONTRIBUTING.md)** for the skill authoring guide — folder layout, frontmatter schema, writing +effective descriptions and bodies, [testing](./docs/skill-testing.md), and the PR checklist. -Apply only the sections relevant to the files touched by the PR. -Output findings grouped as: Blocker / Important / Nit. -``` +To propose a new skill — or to propose expanding the repo into agents, MCP servers, or +plugins — [open an issue](https://github.com/AbsaOSS/agentic-toolkit/issues/new). ---- +Contributions are welcome from the ABSA technical team. -## Finding More Skills +## FAQ -Before building a new skill, check whether one already exists: +### What's the difference between a skill, an agent, and an MCP server? -| Source | What's available | -|---|---| -| [github/awesome-copilot](https://github.com/github/awesome-copilot/tree/main/skills) | 200+ community skills: cloud, languages, security, DevOps, productivity | -| [skills.sh](https://skills.sh) | Open registry — install any GitHub-hosted skill with `npx skills add ` | -| [anthropics/skills](https://github.com/anthropics/skills) | Anthropic reference skills including `skill-creator` | -| [AbsaOSS/agentic-toolkit](https://github.com/AbsaOSS/agentic-toolkit) | Broader skill collection | +A **skill** gives the agent instructions — prose it reads into context when a task matches. +An **agent** is a full persona: a system prompt, tools, and behavioural constraints bundled together — it may *use* +multiple skills. An **MCP server** gives the agent callable tools for live integrations and API calls. +Think of skills as books, agents as the people who read them, and MCP servers as the phone lines they dial. ---- +### Can I use these skills outside of GitHub Copilot? -## Contributing +Yes. Skills follow the open [Agent Skills](https://agentskills.io) standard and work with any compatible tool — +Claude, Cursor, Windsurf, and custom pipelines. -See [CONTRIBUTING.md](./CONTRIBUTING.md) for the full authoring guide. +## Troubleshooting -To propose a new skill or a change to an existing one, open an issue using the **Skill Proposal** template. +Setup issues and common fixes are covered in **[docs/troubleshooting.md](./docs/troubleshooting.md)**. diff --git a/docs/README.md b/docs/README.md index 7c65b7d..a13b50d 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,29 +1,14 @@ -# Agentic Toolkit — Documentation +# Documentation Navigation hub for all guides in this repository. Browse by category below. -## User Guides +| Guide | Audience | Description | +|-----------------------------------------|----------|------------------------------------------------------------------------------------| +| [Getting Started](./getting-started.md) | Users | What skills are, how to install them, Copilot CLI usage | +| [Troubleshooting](./troubleshooting.md) | Users | Setup guides and fixes for install, activation, and proxy issues | +| [Contributing](../CONTRIBUTING.md) | Authors | Skill folder layout, frontmatter, description writing, body guidelines, PR process | +| [Skill Testing](./skill-testing.md) | Authors | Eval creation, fixtures, regression loops, trigger and description optimization | -For engineers installing and using Agentic Skills day-to-day. - -| Guide | Description | -|-------|-------------| -| [Getting Started](./getting-started.md) | Install the Skills CLI, add Agentic Skills to your global or project space, and start your first session | - -## Creator Guides - -For engineers writing, testing, and contributing new skills. - -| Guide | Description | -|-------|-------------| -| [Contributing](../CONTRIBUTING.md) | Skill folder layout, frontmatter schema, description writing, body guidelines, testing, and the full PR contribution process | -| [Skill Testing](./SKILL_TESTING.md) | How to test, evaluate, and optimize skills using evals, fixtures, and the skill-creator tool | - -## Reference - -| Guide | Description | -|-------|-------------| -| [Skill Catalog](../skills/) | Browse all available Agentic Skills — each skill folder contains a `SKILL.md` with its purpose, trigger phrases, and full instructions | - -> **Keep this index up to date.** When you add a new guide under `docs/`, add a row to the relevant category above. +> **Keep this index up to date.** When you add a new guide under `docs/`, add a row to the table above. +See also the [main README](../README.md) for the skill catalog, scope, and FAQ. diff --git a/docs/getting-started.md b/docs/getting-started.md index 70cf323..d5402a3 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -1,115 +1,147 @@ -# How to Start Using Skills? +# Getting Started -This guide walks you through installing Agentic Skills and using them inside AI-assisted engineering tools. +This guide covers what skills are, how to install them, and how to use them in your AI-assisted engineering workflow. -## 1. Install the Vercel Skills CLI +## What Is a Skill? -Skills are distributed and installed using the [Vercel Skills CLI](https://github.com/vercel-labs/skills) — a lightweight tool available via `npx`, no separate installation required. +A **skill** is a folder containing instructions, scripts, and reference material that an AI agent loads on demand. +Skills follow the open [Agent Skills](https://agentskills.io) standard and work with GitHub Copilot, Claude, Cursor, and +other compatible tools. -```bash -# Verify the CLI is accessible -npx skills --version -``` +Skills manage LLM context efficiently — the agent does **not** receive all skill content at startup. Instead, it loads +content progressively: -## 2. Install Skills From This Repository +| Phase | What Happens | +|----------------|-----------------------------------------------------------------------------------------------------------| +| **Discovery** | At startup, the agent loads only the `name` and `description` of each installed skill | +| **Activation** | When your task matches a skill's description, the agent reads the full `SKILL.md` into context | +| **Execution** | The agent follows the skill's instructions, optionally loading reference files or running bundled scripts | -### List available skills +> ⚠️ **The `description` field is the sole activation signal.** If a skill isn't firing, your prompt likely doesn't +> match its description keywords. Rephrase your message to include relevant trigger terms from the skill's description. +> Inside a Copilot CLI session, run `/skills list` to inspect loaded descriptions. +> Outside the CLI, you can run `npx skills list -g` to see all the installed skills. -```bash -npx skills add https://github.com/AbsaOSS/agentic-toolkit --list -``` +## Prerequisites: Install Copilot CLI -### Install all Agentic Skills (global — available in every project) +We use the [GitHub Copilot CLI](https://github.com/github/copilot-cli) as the primary AI-assisted engineering tool. ```bash -npx skills add https://github.com/AbsaOSS/agentic-toolkit -g +# macOS / Linux +curl -fsSL https://gh.io/copilot-install | bash + +# macOS (Homebrew) +brew install copilot-cli + +# Any platform (npm) +npm install -g @github/copilot ``` -### Install all Agentic Skills (project-scoped — installs into `.github/skills/` of the current repo) +Then launch a session from your project directory: ```bash -npx skills add https://github.com/AbsaOSS/agentic-toolkit +copilot ``` -> To share project-scoped skills with your team, commit the generated `.github/skills/` directory to your repository. +## Install Skills + +### Option A: Vercel Skills CLI (recommended) -### Install a specific skill only +The [Vercel Skills CLI](https://github.com/vercel-labs/skills) automates installation via `npx`: ```bash -# Example: install only the pr-review skill, globally +# List available skills from this repo without installing them +npx skills add https://github.com/AbsaOSS/agentic-toolkit --list + +# Install all skills globally (available in every project) +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g + +# Install a single skill globally (--skill filters by folder name within the repo) npx skills add https://github.com/AbsaOSS/agentic-toolkit -g --skill pr-review + +# Install project-scoped (into .github/skills/ of the current repo) +npx skills add https://github.com/AbsaOSS/agentic-toolkit ``` -### Update installed skills +> To share project-scoped skills with your team, commit the `.github/skills/` directory. -```bash -npx skills update # interactive — prompts for scope -npx skills update -g # update global skills only -npx skills update -y # non-interactive, auto-detects scope -``` +On first install, `npx skills add` asks two questions interactively: -> **Installation method:** When running interactively, you'll be prompted to choose between **Symlink** (recommended — single source, easy updates) or **Copy** (use when symlinks aren't supported). +| Prompt | Options | Recommendation | +|---------------------------------|-----------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **Installation method** | `Symlink` / `Copy` | Use **Symlink** — a single copy lives on disk, and `npx skills update` refreshes it in place. Use **Copy** only if symlinks aren't supported in your environment. | +| **Which agents to install to?** | A specific agent directory / **Copy to all agents** | Choose **Copy to all agents** if you want the skills available in Copilot, Claude, Cursor, and any other compatible tool simultaneously. Pick a specific directory to limit the scope. | -## 3. How Skills Work +These choices are recorded and reused on subsequent updates — you won't be asked again unless you reinstall from +scratch. The global skills directory used is typically `~/.agents/skills/` (recommended for cross-tool compatibility) +or `~/.copilot/skills/`. See [Skills directory location varies](./troubleshooting.md#skills-directory-location-varies) +if you need +to change it later. -Skills manage LLM context efficiently. The agent does **not** receive all skill content at startup — it loads content progressively: +#### Update installed skills -| Phase | What Happens | -|---|---| -| **Discovery** | At startup, the agent loads only the `name` and `description` of each installed skill | -| **Activation** | When your task matches a skill's description, the agent reads the full `SKILL.md` into context | -| **Execution** | The agent follows the skill's step-by-step instructions, optionally reading reference files or running bundled scripts | +```bash +npx skills update # interactive — prompts for scope +npx skills update -g # update global skills +npx skills update -y # auto-detects scope: project-scope if running within a project, else global +``` -### What your prompt needs to contain +### Option B: Manual install -Because the `description` field is the agent's sole activation signal, your message must contain keywords or context that match it. For example: +Clone this repo and copy skills into your personal space: -- To trigger **pr-review**: mention "review", "pull request", or "PR". +```bash +git clone https://github.com/AbsaOSS/agentic-toolkit.git +cp -r agentic-toolkit/skills/*/ ~/.agents/skills/ +``` -> **Tip:** If a skill is not activating, rephrase your message to include more explicit keywords from the skill's description. You can inspect descriptions with `/skills list` inside a session. +Skills copied into `~/.agents/skills/` (or `~/.copilot/skills/`) are loaded automatically at every session start — no +further action needed. -## 4. GitHub Copilot CLI Guide +## Verify Skills Are Loaded -GitHub Copilot CLI is the repo-expected AI-assisted engineering tool. Skills are resolved from two locations: +Inside a Copilot CLI session: -| Scope | Location | When To Use | -|---|---|---| -| **Global** | `~/.copilot/skills/` | Available across all your projects, on every session start | -| **Project** | `.github/skills/` | Specific to one repository; commit the directory to share with your team | +``` +/skills list # See all loaded skills and their descriptions +/env # Full environment: skills, agents, MCP servers, instructions +``` -When skills are installed with `-g`, the CLI places them in `~/.copilot/skills/`. Without `-g`, they go into `.github/skills/` of the current project. +Confirm the skills appear in the list. -### Inspecting loaded skills +## Using Skills in Copilot CLI -Inside an active Copilot CLI session: +GitHub Copilot CLI resolves skills from two locations: -``` -# See all available skills and their descriptions -/skills list +| Scope | Location | When To Use | +|-------------|---------------------------------------------|-------------------------------------------------------| +| **Global** | `~/.agents/skills/` or `~/.copilot/skills/` | Available across all projects | +| **Project** | `.github/skills/` | Specific to one repo — commit to share with your team | -# See the full loaded environment (skills, agents, MCP servers, instructions) -/env -``` +Project skills take precedence over global skills when both exist. -### Adding and removing skills mid-session +### Managing skills mid-session ``` -# Add a skill from a path into the current session -/skills add ~/.copilot/skills/pr-review - -# Remove a skill from the current session -/skills remove pr-review +/skills add # Add a skill to the current session (temporary) +/skills remove # Remove a skill from the current session (temporary) ``` -> **Note:** Changes made with `/skills add` or `/skills remove` inside a session are **temporary** — they only affect the running session. Use `npx skills add` from your terminal to make permanent changes. +> Session-level changes are temporary. Use `npx skills add` or the manual copy for permanent changes. -### Stacking order +### Add project-specific skills -Copilot CLI resolves skills from two real locations in this order: +For skills that only apply to a specific repository, place them in `.github/skills/` within that repo. These are loaded +automatically when Copilot CLI is launched from that directory, layered on top of your personal and CPS base skills. ``` -Global Skills ← ~/.copilot/skills/ (loaded first) - └── Project Skills ← .github/skills/ (override global) +your-project-repo/ +└── .github/ + └── skills/ + └── your-project-skill/ + └── SKILL.md ``` -Project skills take precedence over global skills. Agentic Skills installed with `-g` are simply global skills — there is no separate "Agentic base" layer. Use project skills for repository-specific overrides that your team commits together. +## Troubleshooting + +Running into issues? See **[docs/troubleshooting.md](./troubleshooting.md)** guide. diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..1ded6f4 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,48 @@ +# Troubleshooting + +Fixes for known setup issues, plus setup guides for specific skills. + +--- + +## Skill not activating + +The `description` field is the **sole activation signal** — the agent never reads a skill's body until it decides +the description matches the current task. + +Steps to diagnose: + +1. Inside a Copilot CLI session, run `/skills list` to confirm the skill is loaded and inspect its description. +2. Outside the CLI, run `npx skills list` or open the skill's `SKILL.md` directly. +3. Rephrase your prompt to include keywords that appear in the description — both formal terms and casual phrasings. +4. If the skill still won't trigger, consider improving the description. See the + [Contributing Guide](../CONTRIBUTING.md#3-writing-the-description) for guidance. + +## Certificate errors (Netskope / corporate proxy) + +If `npx skills` fails with SSL/certificate errors, configure npm to trust your corporate certificate bundle. Our +corp certs are stored in [de-developer-configuration](https://github.com/absa-group/de-developer-configuration). + +```bash +npm config set cafile ~/netskope-bundle.pem +``` + +This writes the setting to `~/.npmrc`. Replace the filename with the path to your actual certificate bundle. + +## Skill removal doesn't update the lock file + +`npx skills remove` removes the skill directory but does not always clean up internal state. If you see stale +references after removing a skill, delete the corresponding entry from the skills lock file (typically +`.skills-lock.json`) manually. See +[Managing AI Agent Skills with npx skills](https://dev.to/toyama0919/managing-ai-agent-skills-with-npx-skills-a-practical-guide-2an8) +for a detailed walkthrough. + +## Skills directory location varies + +The global skills directory depends on which location you chose when you first ran `npx skills add`. The default +is `~/.agents/skills/`, but `~/.copilot/skills/` and other paths are valid. We recommend `~/.agents/skills/` for +cross-tool compatibility. + +## Skills not loading in a project session + +Make sure project-scoped skills live in `.github/skills/` at the root of the repo *and* that you launched the +Copilot CLI from inside that repo's directory. Skills in subdirectories or in differently named folders are ignored. From bcd8b35f05fd680860a9de7a3d7a25540f2cb139 Mon Sep 17 00:00:00 2001 From: miroslavpojer Date: Wed, 20 May 2026 13:21:59 +0200 Subject: [PATCH 5/6] Add workflow to check for release notes in PR descriptions --- .github/workflows/check_pr_release_notes.yml | 24 ++++++++++++++++++++ 1 file changed, 24 insertions(+) create mode 100644 .github/workflows/check_pr_release_notes.yml diff --git a/.github/workflows/check_pr_release_notes.yml b/.github/workflows/check_pr_release_notes.yml new file mode 100644 index 0000000..e93f86f --- /dev/null +++ b/.github/workflows/check_pr_release_notes.yml @@ -0,0 +1,24 @@ +name: Check PR Release Notes in Description + +on: + pull_request: + types: [opened, synchronize, reopened, edited] + branches: [ master ] + +jobs: + check-release-notes: + runs-on: ubuntu-latest + + steps: + - uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 + with: + python-version: '3.11' + + - name: Check presence of release notes in PR description + uses: AbsaOSS/release-notes-presence-check@8e586b26a5e27f899ee8590a5d988fd4780a3dbf + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + with: + github-repository: ${{ github.repository }} + pr-number: ${{ github.event.number }} + skip-labels: "no RN" From 178ce6b4590923c5ec6fa0c56e39e8c2d5f805cf Mon Sep 17 00:00:00 2001 From: miroslavpojer Date: Wed, 20 May 2026 14:05:26 +0200 Subject: [PATCH 6/6] Update CODEOWNERS, enhance PR review instructions, adjust Python version in workflow, and clarify troubleshooting documentation --- .github/CODEOWNERS | 2 +- .github/copilot-instructions.md | 5 ----- .github/workflows/check_pr_release_notes.yml | 3 +-- README.md | 2 +- docs/troubleshooting.md | 10 +++++----- 5 files changed, 8 insertions(+), 14 deletions(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 2dfc5bb..4434d63 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -3,4 +3,4 @@ # Code owners are automatically requested for review when someone opens a pull request # Default owners for everything in the repo -* @miroslavpojer @lsulak @oto-macenauer-absa +* @miroslavpojer @lsulak @oto-macenauer-absa @tmikula-dev diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index ddeaa9b..0298983 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -2,14 +2,9 @@ ## PR Review -### Core review — covers standard, elevated risk, API contracts, deps, CI/CD, and IaC - When reviewing a pull request, load and apply: https://github.com/AbsaOSS/agentic-toolkit/blob/master/skills/pr-review/SKILL.md -Apply only the sections relevant to the files touched by the PR. -Output findings grouped as: Blocker / Important / Nit. - ### Skill-specific checks — apply when any `SKILL.md` is modified For every modified `SKILL.md`, also verify: diff --git a/.github/workflows/check_pr_release_notes.yml b/.github/workflows/check_pr_release_notes.yml index e93f86f..af3761e 100644 --- a/.github/workflows/check_pr_release_notes.yml +++ b/.github/workflows/check_pr_release_notes.yml @@ -8,11 +8,10 @@ on: jobs: check-release-notes: runs-on: ubuntu-latest - steps: - uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 with: - python-version: '3.11' + python-version: '3.14' - name: Check presence of release notes in PR description uses: AbsaOSS/release-notes-presence-check@8e586b26a5e27f899ee8590a5d988fd4780a3dbf diff --git a/README.md b/README.md index fbd17f0..f90df00 100644 --- a/README.md +++ b/README.md @@ -97,7 +97,7 @@ effective descriptions and bodies, [testing](./docs/skill-testing.md), and the P To propose a new skill — or to propose expanding the repo into agents, MCP servers, or plugins — [open an issue](https://github.com/AbsaOSS/agentic-toolkit/issues/new). -Contributions are welcome from the ABSA technical team. +Contributions are welcome from anyone. ## FAQ diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 1ded6f4..d1121c3 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -17,16 +17,16 @@ Steps to diagnose: 4. If the skill still won't trigger, consider improving the description. See the [Contributing Guide](../CONTRIBUTING.md#3-writing-the-description) for guidance. -## Certificate errors (Netskope / corporate proxy) +## SSL / certificate errors (corporate proxy) -If `npx skills` fails with SSL/certificate errors, configure npm to trust your corporate certificate bundle. Our -corp certs are stored in [de-developer-configuration](https://github.com/absa-group/de-developer-configuration). +If `npx skills` fails with SSL or certificate errors, you are likely behind a corporate proxy or TLS inspection +tool. Configure npm to trust your organisation's certificate bundle: ```bash -npm config set cafile ~/netskope-bundle.pem +npm config set cafile /path/to/your/ca-bundle.pem ``` -This writes the setting to `~/.npmrc`. Replace the filename with the path to your actual certificate bundle. +This writes the setting to `~/.npmrc`. Obtain the certificate bundle from your organisation's IT or security team. ## Skill removal doesn't update the lock file