Skip to content

feat: Add AI agent security requirements (multi-turn jailbreak, tool-use safety, model behavioral stability) #35

@abbousaad

Description

@abbousaad

Proposal: Three new requirements addressing AI agent security gaps

Problem

The current Manipulation Resistance (MR) domain addresses single-turn prompt injection (MR-001, MR-002, MR-018) and treats the agent runtime as untrusted (MR-023). However, three adjacent threat vectors remain unspecified:

  1. Multi-turn jailbreak sequences — no single message constitutes an injection, but the cumulative conversational sequence achieves scope override, instruction bypass, or data exfiltration. This is the conversational analogue of TOCTOU attacks.
  2. Tool parameter abuse and chaining — SC-020 enforces an external tool allowlist, but does not validate the parameters passed to allowed tools or detect sequences of allowed tools that combine to achieve a disallowed outcome (e.g., a scan tool with zero rate limit causing DoS, or recon → credential-extraction → lateral-movement achieving unauthorized pivot).
  3. Silent model behavioral drift — TP-022 requires re-attestation on material model changes, but provider-side silent updates (inference engine optimizations, safety filter tuning, quantization changes) can shift behavior below the materiality threshold without triggering re-attestation. No canary mechanism exists to detect these shifts before they affect a customer engagement.

Proposed Requirements

APTS-MR-024: Multi-Turn Jailbreak Detection and Response (MUST | Tier 2)

Extends MR to multi-turn interaction patterns. Covers:

  • Conversation state isolation between engagements
  • Obfuscation detection (encoding chains, homoglyphs, split-message assembly, synonym substitution)
  • Maintained adversarial jailbreak corpus (50+ patterns, quarterly execution, refreshed on model change)
  • Decision consistency enforcement: if rejected when stated directly, must also be rejected when obfuscated or distributed across turns

Cross-references: MR-001, MR-018, MR-023, TP-022

APTS-MR-025: Tool Invocation Parameter Validation and Chaining Prevention (MUST | Tier 2)

Extends SC-020's external allowlist to parameter-level enforcement. Covers:

  • Parameter schema enforcement for every allowlisted tool (types, ranges, constraints)
  • Semantic validation of safety-critical parameters (rate limits, target identifiers, payload sizes, credential sources)
  • Tool chaining detection: monitoring invocation sequences within a sliding window, with 10+ documented chaining patterns
  • Parameter drift detection for recurring/long-running engagements

Cross-references: SC-020, SE-006, SC-004, SE-023, MR-023

APTS-TP-023: Foundation Model Behavioral Stability Verification (SHOULD | Tier 2)

Fills the gap between TP-022 material changes. Covers:

  • Behavioral test suite (30+ cases across instruction-following fidelity, refusal stability, output format compliance, decision calibration)
  • Execution before every engagement + weekly minimum + on provider API change
  • Drift detection with engagement-blocking threshold
  • Provider API changelog monitoring for silent changes

Cross-references: TP-021, TP-022, TP-002, AR-019

Rationale

These three requirements form a cohesive package addressing the same threat category: the AI agent runtime as an evolving, externally-influenced attack surface. MR-024 catches adversarial manipulation of the agent through conversation. MR-025 catches the agent using allowed tools in disallowed ways. TP-023 catches the agent's underlying model silently changing behavior. Together, they close the gap between current single-turn injection defenses and the reality of multi-turn, tool-wielding, provider-updated AI agents.

Affected Sections

  • standard/6_Manipulation_Resistance/README.md — two new requirements appended
  • standard/7_Supply_Chain_Trust/README.md — one new requirement appended
  • standard/appendix/Checklists.md — three new checklist entries
  • standard/README.md, standard/Introduction.md, README.md, index.md, standard/Getting_Started.md — requirement count updates (173 → 176, Tier 2: 85 → 88)

Style Compliance

All three requirements follow APTS conventions:

  • RFC 2119 normative language consistent with Classification
  • Verification subsections with specific, testable criteria
  • Cross-references using > **See also:** format
  • Rationale sections explaining why the requirement exists

I have a complete draft ready to submit as a PR once this proposal is reviewed.

AI Disclosure

This proposal was drafted with assistance from Claude (Anthropic). The contributor has reviewed all content for accuracy and consistency with the APTS standard and takes full ownership per CONTRIBUTING.md.

Metadata

Metadata

Assignees

Labels

v0.2.0-candidatePRs that are accepted in principle but deferred to the v0.2.0 release

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions