docs: Add advisory requirement APTS-MR-A01 Goal Misgeneralization and Emergent Misalignment Evaluation Suite by kylejryan · Pull Request #43 · OWASP/APTS

kylejryan · 2026-04-28T16:03:35Z

Context

Hi, I'm Kyle Ryan from Pensar. I work on post-training pipelines and agent evaluation for autonomous offensive security agents.

AI disclosure: This contribution was drafted with AI assistance. I have reviewed all changes for accuracy, consistency with the standard, and compliance with the style guide, and take full ownership of the submission.

What changed and why

Adds Goal Misgeneralization and Emergent Misalignment Evaluation Suite as a new advisory practice (APTS-MR-A01) in the Advisory Requirements appendix. This is the first advisory in the Manipulation Resistance domain.

The Introduction's Capability Frontier and Containment Assumptions section explicitly defers verifiable goal alignment to a future revision: "Research-stage topics (verifiable goal alignment, scheming detection, and containment testing against models that may be aware of the test environment) are out of scope for this version and may be addressed in future versions of APTS as the field matures." This advisory begins to close that gap using an evaluation-based approach that is achievable with today's tooling (Inspect AI, Braintrust, OpenAI Evals).

The operational concern is concrete and recent. Peer-reviewed work in 2026 (Nature, Training LLMs on narrow tasks can lead to broad misalignment, https://www.nature.com/articles/s41586-025-09937-5) demonstrated that fine-tuning a frontier model on a narrow offensive task — producing insecure code — induces broad behavioral shifts well outside the training domain. For autonomous pentesting platforms, which are routinely fine-tuned on offensive data, this surfaces two failure modes that no existing APTS requirement evaluates:

Goal misgeneralization. The agent learns a proxy objective ("produce findings that look like vulnerabilities") that diverges from the true objective ("identify vulnerabilities exploitable in the customer environment") in distinguishing situations rare enough in training that the proxy and true objectives produce identical actions on the training distribution but diverge in deployment.
Emergent misalignment. Narrow fine-tuning on offensive tasks shifts the agent's behavior in adjacent domains, with no signal until the shift manifests in a production engagement.

APTS-MR-013 (Adversarial Example Detection) probes input-side robustness; APTS-MR-020 (Adversarial Validation) probes control-side resilience; APTS-AR-019 (Model Change Tracking) tracks output drift; APTS-RP-A01 (Finding Authenticity Verification) catches fabricated evidence. None of these evaluate the agent's underlying objective alignment under distribution shift, which is the upstream failure RP-A01 cannot reach: an agent producing genuinely-grounded findings that the agent itself was misaligned to discover, prioritize, or report.

The advisory text notes this practice is a candidate for tier-gated inclusion in v0.2.0 (likely as SHOULD | Tier 2 for platforms operating at Level 3 autonomy or higher, or for any platform that performs post-deployment fine-tuning on engagement data).

Affected requirements

New advisory: APTS-MR-A01
No new normative requirements, no count changes
Advisory practice count: 13 → 14

Files changed

`standard/appendix/Advisory_Requirements.md` — New APTS-MR-A01 advisory entry (self-contained, follows the format of APTS-RP-A01 and APTS-SC-A02)
`standard/6_Manipulation_Resistance/Implementation_Guide.md` — Advisory Practice Implementation Guidance section with Independent Evaluation Pipeline architecture pattern, calibrated alignment thresholds, and out-of-distribution audit guidance
`standard/6_Manipulation_Resistance/README.md` — See-also reference pointing to the advisory
`standard/Introduction.md` — Advisory practice count updated from 13 to 14
`README.md`, `index.md`, `standard/README.md` — Advisory practice count synced from 13 to 14 (matches the cross-file sync pattern from PR Update advisory practice count to 13 across all files #27)

No normative requirement counts changed (173 total, 72/157/173 tier counts unchanged). No changes to Foreword, Frontispiece, Checklists, Getting_Started, Glossary, Vendor Eval, CAT, or other counts. No changes to the machine-readable export (`standard/apts_requirements.json` does not include advisory practices, consistent with the existing convention).

… Emergent Misalignment Evaluation Suite Adds APTS-MR-A01 as a new advisory practice in the Advisory Requirements appendix, evaluating the agent's underlying objective alignment under distribution shift and detecting emergent misalignment after fine-tuning. Addresses failure modes that input-side (MR-013) and control-side (MR-020) adversarial testing do not cover. Begins to close the goal-alignment gap that the Introduction's Capability Frontier section defers to a future revision. Advisory practice count: 13 -> 14. No normative requirement counts changed.

jinsonvarghese · 2026-04-30T14:45:54Z

Thank you @kylejryan for this. Let me get back soon.

jinsonvarghese · 2026-04-30T14:55:09Z

@kylejryan Solid advisory, well-researched.

One thing to fix before merging: the advisory practice count needs updating in four additional files that still reference "13 advisory practices".

standard/Frontispiece.md (line 75)
standard/Getting_Started.md (line 87)
standard/appendix/Glossary.md (line 82)
standard/appendix/Vendor_Evaluation_Guide.md (line 17)

The PR already updates standard/README.md and standard/Introduction.md, but these four were missed. After the fix, this is good to merge.

kylejryan · 2026-04-30T19:59:30Z

@jinsonvarghese Great catch, fixed all four and pushing now.

jinsonvarghese · 2026-05-01T07:55:10Z

Thank you @kylejryan. All four count updates fixed. Looks good, merging.

…of AI Influence on Operator Decisions Adds APTS-HO-A02 as a new advisory practice in the Human Oversight domain, the second advisory in HO. Addresses a gap in existing coverage: APTS-HO-001, HO-005, HO-010, and AR-006 mandate approval gates, audit trails, and reasoning-chain capture, but none address the form of the question the operator is asked to confirm. The practical effect is that an audit trail can show "operator approved" while concealing that the operator was offered a single highlighted choice with the safer option visually de-emphasized. The advisory pairs provenance for AI-shaped operator affordances with bias mitigation at high-impact gates. The Practice Description is a four-point list ordered by implementation cost, from a single response-classification audit field through to no-preselected-default and typed-confirmation rules at HO-010 gates. Cross-file count sync from 14 to 15 advisory practices (rebased on top of OWASP#43, which brought the count to 14). No new normative requirements, no tier counts changed (173 total, 72/157/173 unchanged). The machine-readable JSON export is intentionally untouched, consistent with the existing convention that advisory practices are excluded.

kylejryan force-pushed the feat/apts-mr-a01-misalignment-evaluation-suite branch from a58b150 to 39cb47a Compare April 28, 2026 16:18

kylejryan marked this pull request as ready for review April 28, 2026 16:20

docs: update advisory practice count to 14 in remaining files

4f55b01

jinsonvarghese merged commit c0e7159 into OWASP:main May 1, 2026
1 check passed

This was referenced May 1, 2026

docs: Add advisory requirement APTS-HO-A02 Disclosure and Mitigation of AI Influence on Operator Decisions #45

Merged

docs: add shift handoff template appendix #46

Merged

kylejryan mentioned this pull request May 1, 2026

docs: Add advisory requirement APTS-MR-A02 Sandbagging Detection and Behavioral Consistency Validation #49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add advisory requirement APTS-MR-A01 Goal Misgeneralization and Emergent Misalignment Evaluation Suite#43

docs: Add advisory requirement APTS-MR-A01 Goal Misgeneralization and Emergent Misalignment Evaluation Suite#43
jinsonvarghese merged 2 commits intoOWASP:mainfrom
kylejryan:feat/apts-mr-a01-misalignment-evaluation-suite

kylejryan commented Apr 28, 2026

Uh oh!

jinsonvarghese commented Apr 30, 2026

Uh oh!

jinsonvarghese commented Apr 30, 2026

Uh oh!

kylejryan commented Apr 30, 2026

Uh oh!

jinsonvarghese commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kylejryan commented Apr 28, 2026

Context

What changed and why

Affected requirements

Files changed

Uh oh!

jinsonvarghese commented Apr 30, 2026

Uh oh!

jinsonvarghese commented Apr 30, 2026

Uh oh!

kylejryan commented Apr 30, 2026

Uh oh!

jinsonvarghese commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants