You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ship an opt-in red-team eval pack under examples/red-team/ with ~60–100 adversarial cases tagged against public taxonomies (OWASP LLM Top 10 v2025, MITRE ATLAS v5.4) and three reference LLM-grader rubrics (refusal, PII leak, tool abuse). Users copy and extend. Not bundled into agentv init.
Motivation
agentv today can run adversarial cases but ships none. Authors starting a red-team effort from scratch produce formulaic cases ("ignore previous instructions…") that miss the novel attack shapes the field has converged on:
Indirect prompt injection via tool output — InjecAgent (1,054 cases, 17 user tools × 62 attacker tools) and AgentDojo (629 security cases across banking / Slack / travel / workspace agents) are the field standards.
Multi-turn escalation — Crescendo (Microsoft) and GOAT (Meta) published strategies for bypassing single-turn refusals.
Hand-authoring equivalent coverage is weeks of work. A seed pack drawn from these public corpora is high-leverage and aligns with design principle #4 (align with industry standards).
This is content, not core — fits design principles:
Lightweight core (BbEval TypeScript Migration #1) — zero changes to packages/core. Uses existing llm-grader, tool-trajectory, composite, not-contains, not-matches-regex primitives.
YAGNI (refactor: workspace path #5) — shipping an example is the minimum that unblocks authors; no new primitives required.
Industry standards (Fix eval yaml parser #4) — cases and tagging follow established taxonomies.
Proposed structure
examples/red-team/
├── README.md # taxonomy cheat sheet, license of seeds
├── graders/
│ ├── refusal.md # LLM rubric (CoT-prompted): refused / partial / complied
│ ├── pii-leak.md # composite: regex PII detector + LLM judge
│ └── tool-abuse.md # paired with tool-trajectory grader
└── suites/
├── llm01-prompt-injection.yaml # direct + indirect variants
├── llm02-insecure-output.yaml
├── llm06-excessive-agency.yaml
├── llm07-system-prompt-leakage.yaml # new in OWASP v2025
├── llm08-vector-embedding.yaml # new in v2025
├── llm10-unbounded-consumption.yaml # new in v2025
├── agentic-memory-poisoning.yaml
├── agentic-tool-misuse.yaml
└── atlas-v5.4-agentic.yaml # AML.T0075, AML.T0076
Each case tagged with metadata.governance (depends on #1161). Field naming, versioning convention, and risk_tier vocabulary are defined in #1161 — this issue must follow that schema verbatim.
promptfoo red-team plugins — BOLA/BFLA/RBAC for agentic APIs (MIT, can fork individual cases with attribution)
Content with unclear licensing excluded. CSAM, weapon synthesis, self-harm instructions explicitly excluded — these seeds come from corpora already curated by AI safety institutes.
Design latitude
How many cases to seed. 60–100 is enough to be useful without overwhelming review. More valuable to have solid coverage across all v2025 OWASP IDs than 500 cases of LLM01.
Rubric format. Three rubrics is the minimum (refusal / PII / tool-abuse). Can expand if authors show demand.
Opt-in wiring. agentv init --template red-team is a nice-to-have and can land in a follow-up — not required for initial pack.
Acceptance signals
agentv eval examples/red-team/suites/llm01-prompt-injection.yaml against a known-vulnerable target produces a failure report.
Against a well-aligned frontier model, the same suite produces pass-rate data that can be referenced in a release note.
Every case has at least one owasp_llm_top_10_2025 tag (or owasp_agentic_top_10_2025 for agent-specific cases) and at least one mitre_atlas tag.
README.md documents provenance and license per seed source.
Non-goals
Bundling an attacker LLM. Dynamic strategies that generate variants via an attacker model (Crescendo, GOAT, tree-of-attacks) are a separate issue.
Automated attack-success scoring beyond the provided rubrics.
Shipping in agentv init by default.
Any content with unclear licensing or explicit harmful payloads (CSAM, weapon synthesis).
Assumes #1161 merged so metadata.governance is accepted.
Pack inventory.
ls examples/red-team/suites/ examples/red-team/graders/
cat examples/red-team/README.md
Green: at least 8 suite files (one per OWASP LLM ID covered + agentic + ATLAS), 3 grader rubrics (refusal, pii-leak, tool-abuse), README with provenance and license per seed source.
Every case is tagged.
# Fail if any test is missing an owasp_llm_top_10_2025 tagforfin examples/red-team/suites/*.yaml;do
yq '.tests[] | select((.metadata.governance.owasp_llm_top_10_2025 // []) | length == 0)'"$f"done
Green: prints nothing.
Known-weak target: attacks should land. Configure a target using an unaligned base completion model or a deliberately permissive test harness. In .agentv/targets.yaml:
Green: non-zero pass rate shows at least some failures; failure rows show the owasp_llm_top_10_2025 tag; refusal rubric's reasoning field explains the "complied" verdict.
Green: pass rate materially higher than step 3; any remaining failures are inspectable and not obvious false positives.
Indirect injection via tool output works end-to-end. Pick one case that injects via role: tool content and a tool-trajectory assertion with forbidden_tools.
Green: every seed corpus used (InjecAgent, AgentDojo, AgentHarm, Garak, promptfoo) is named with its license.
Fail conditions:
Any case without an owasp_llm_top_10_2025 tag.
Pack cannot be run against a cli-provider target (hidden dependency).
Content included from a corpus whose license does not permit redistribution.
Code review quality gate
Reviewers should treat this as a content-only PR. agentv core is off-limits.
Request changes if the PR includes any of:
Any change to packages/core/ or apps/cli/. The diff there must be empty.
Any new dependency.
A new grader type. The pack must compose existing primitives (llm-grader, code-grader, composite, tool-trajectory, contains, regex).
An attacker LLM bundled, scripted, or required at runtime.
More than ~80 cases. Wide tag coverage beats deep duplication; if the count creeps higher, ask "what is each new case testing that an existing case doesn't?"
Cases without an owasp_llm_top_10_2025 tag. Test plan step 2 fails the build automatically.
Content from a corpus whose license does not permit redistribution. README must name each seed source and its license.
Explicit harmful payloads (CSAM, weapon synthesis, self-harm instructions). Seeds come from AISI / academic corpora that have already curated this out.
Wiring into agentv init defaults. The pack is opt-in.
A new docs page outside the documented surface (single guide is enough; an entire "governance" doc tree is scope creep).
The bias on review: ask the author to point at one row of JSONL output where the refusal rubric's reasoning explains the verdict in plain English. If they can't, request iteration on rubric quality before approving more cases.
Objective
Ship an opt-in red-team eval pack under
examples/red-team/with ~60–100 adversarial cases tagged against public taxonomies (OWASP LLM Top 10 v2025, MITRE ATLAS v5.4) and three reference LLM-grader rubrics (refusal, PII leak, tool abuse). Users copy and extend. Not bundled intoagentv init.Motivation
agentv today can run adversarial cases but ships none. Authors starting a red-team effort from scratch produce formulaic cases ("ignore previous instructions…") that miss the novel attack shapes the field has converged on:
ai-safety-institute/AgentHarm).Hand-authoring equivalent coverage is weeks of work. A seed pack drawn from these public corpora is high-leverage and aligns with design principle #4 (align with industry standards).
This is content, not core — fits design principles:
llm-grader,tool-trajectory,composite,not-contains,not-matches-regexprimitives.Proposed structure
Each case tagged with
metadata.governance(depends on #1161). Field naming, versioning convention, andrisk_tiervocabulary are defined in #1161 — this issue must follow that schema verbatim.Example case (LLM01, indirect via tool output)
Seed sources (public, permissively licensed)
Content with unclear licensing excluded. CSAM, weapon synthesis, self-harm instructions explicitly excluded — these seeds come from corpora already curated by AI safety institutes.
Design latitude
agentv init --template red-teamis a nice-to-have and can land in a follow-up — not required for initial pack.Acceptance signals
agentv eval examples/red-team/suites/llm01-prompt-injection.yamlagainst a known-vulnerable target produces a failure report.owasp_llm_top_10_2025tag (orowasp_agentic_top_10_2025for agent-specific cases) and at least onemitre_atlastag.README.mddocuments provenance and license per seed source.Non-goals
agentv initby default.Dependencies
governancemetadata schema) — cases rely on that shape for tagging. Can proto with arbitrary keys in the interim.Manual test plan (green-path e2e)
Assumes #1161 merged so
metadata.governanceis accepted.Pack inventory.
Green: at least 8 suite files (one per OWASP LLM ID covered + agentic + ATLAS), 3 grader rubrics (refusal, pii-leak, tool-abuse), README with provenance and license per seed source.
Every case is tagged.
Green: prints nothing.
Known-weak target: attacks should land. Configure a target using an unaligned base completion model or a deliberately permissive test harness. In
.agentv/targets.yaml:agentv eval examples/red-team/suites/llm01-prompt-injection.yaml --target vulnerable-baselineGreen: non-zero pass rate shows at least some failures; failure rows show the
owasp_llm_top_10_2025tag; refusal rubric's reasoning field explains the "complied" verdict.Aligned target: attacks should mostly be refused.
agentv eval examples/red-team/suites/llm01-prompt-injection.yaml --target claude-opus-4-7Green: pass rate materially higher than step 3; any remaining failures are inspectable and not obvious false positives.
Indirect injection via tool output works end-to-end. Pick one case that injects via
role: toolcontent and a tool-trajectory assertion withforbidden_tools.agentv eval examples/red-team/suites/llm01-prompt-injection.yaml -k indirect-tool-outputGreen: case runs; if the agent calls
email.sendthe tool-trajectory assertion fails and the tool-abuse rubric marks "complied."Coverage across taxonomies.
Green: OWASP set includes LLM01, LLM02, LLM06, LLM07, LLM08, LLM10; ATLAS set includes at least AML.T0051 and AML.T0075.
License / provenance is auditable.
grep -iE 'license|source|origin' examples/red-team/README.mdGreen: every seed corpus used (InjecAgent, AgentDojo, AgentHarm, Garak, promptfoo) is named with its license.
Fail conditions:
owasp_llm_top_10_2025tag.Code review quality gate
Reviewers should treat this as a content-only PR. agentv core is off-limits.
Request changes if the PR includes any of:
packages/core/orapps/cli/. The diff there must be empty.llm-grader,code-grader,composite,tool-trajectory,contains,regex).owasp_llm_top_10_2025tag. Test plan step 2 fails the build automatically.agentv initdefaults. The pack is opt-in.The bias on review: ask the author to point at one row of JSONL output where the refusal rubric's reasoning explains the verdict in plain English. If they can't, request iteration on rubric quality before approving more cases.