Adversarial Self-Testing: How an Agent Finds Its Own Vulnerabilities Without Millions of Users #24

Liuyanfeng1234 · 2026-06-12T12:56:47Z

Liuyanfeng1234
Jun 12, 2026
Maintainer

Adversarial Self-Testing: How an Agent Finds Its Own Vulnerabilities Without Millions of Users

Claude Code benefits from a massive user base — every interaction is a potential bug report, every edge case is crowd-sourced, every vulnerability is stress-tested by real-world usage. This is the "external flood" model of quality assurance: let the world break your system, then fix what breaks.

We don't have that luxury. And we don't want to wait for it.

The Endogenous Evolution Problem

Systems with large user bases have an implicit advantage: external demand generates external pressure, which generates external bug discovery. The feedback loop is:

Users → edge cases → failures → fixes → better system → more users

Systems without large user bases face a different problem:

No users → no edge cases → no failures → no fixes → stagnant system

This isn't just a growth problem. It's an evolution problem. A system that can't discover its own weaknesses can't improve. A system that can't improve can't attract users. A system that can't attract users can't discover weaknesses. It's a death spiral.

The only way out: the system must generate its own adversarial pressure.

Our Approach: Adversarial Self-Testing Pipeline

We've built a three-stage adversarial self-testing pipeline that simulates external pressure without external users:

Stage 1: Vulnerability Hypothesis Generation

The system analyzes its own architecture to generate hypotheses about where it might be vulnerable:

Attack surface mapping: Every component's input boundaries are cataloged — text analysis, decomposition engine, risk tiering, audit system
Historical pattern replay: Past vulnerabilities (Fable 5's character obfuscation, request decomposition, silent degradation) are replayed against our architecture to test whether the same patterns would succeed
Gap inference: Where a component has a defense but no test proving the defense works, a gap is flagged as a hypothesis

This stage doesn't require external users. It requires the system to understand its own architecture well enough to ask "where would I attack myself?"

Stage 2: Adversarial Input Generation

For each vulnerability hypothesis, the system generates adversarial inputs designed to trigger the hypothesized weakness:

Character-level obfuscation: Unicode substitutions, zero-width characters, homoglyph attacks, spacing manipulation
Structural decomposition: Split malicious intent across multiple requests, recombine at execution time
Semantic drift: Legitimate-seeming requests that, when executed in sequence, produce a harmful outcome
Governance bypass: Requests that pass all safety checks but would degrade A1/A2/A3 scores if executed

Each adversarial input is generated with a specific hypothesis: "I believe this input will bypass Layer N because of vulnerability hypothesis H."

Stage 3: Autonomous Verification

Each adversarial input is executed against the system in a sandboxed environment:

Pass: The defense layer correctly blocked the input → hypothesis disproven, defense confirmed
Partial pass: The input was partially blocked but some sub-component executed → defense gap confirmed, severity assessed
Fail: The input fully bypassed the defense → critical vulnerability, immediate escalation to DASB for priority ranking

The verification results feed back into the hypothesis generation stage — confirmed gaps become new test patterns, confirmed defenses reduce hypothesis priority.

The Self-Testing Loop

The three stages form a continuous loop:

Hypothesis Generation → Adversarial Input → Autonomous Verification
        ↑                                                    ↓
        └────────── Results Feed Back ───────────────────────┘

Each cycle produces:

Confirmed defenses: Components that correctly blocked their hypothesized attacks
Confirmed gaps: Components that failed — ranked by DASB for priority fixing
New hypotheses: Derived from gap patterns — "if Layer N failed on character obfuscation, does Layer N+1 have the same vulnerability?"

What This Replaces

The external flood model (Claude Code's approach) produces bug reports from real users. The adversarial self-testing model produces bug reports from the system itself. The difference:

Dimension	External Flood	Adversarial Self-Testing
Discovery speed	Slow (wait for users)	Fast (continuous generation)
Coverage	Biased (user behavior)	Systematic (architecture-driven)
Reproducibility	Variable	Deterministic
Adversarial creativity	High (real attackers)	Lower (system imagination)
Cost	Free (user labor)	Compute cost

The key trade-off: adversarial self-testing is faster and more systematic, but less creatively adversarial than real attackers. We compensate by replaying known attack patterns (Fable 5, Henri's MCP vectors, giskard09's near-miss-v1) — the system learns from the community's attacks even without the community's users.

The Strategic Implication

For systems on an endogenous evolution path, adversarial self-testing isn't optional. It's the only way to break the death spiral:

No users → self-generate adversarial pressure → discover gaps → fix gaps → 
better system → attract users → real adversarial pressure → even better system

The self-testing pipeline is the bridge between "no external demand" and "enough external demand to sustain evolution." Without it, the system stagnates. With it, the system evolves — and the evolution itself becomes evidence that attracts external users.

The Open Question

Adversarial self-testing generates internal pressure. But the most creative attacks come from external adversaries. The question for the community:

How can agent systems share adversarial test vectors — so that an attack discovered against one system becomes a test for all systems?

If we had a shared registry of adversarial patterns (character obfuscation techniques, decomposition strategies, governance bypass methods), every system could test itself against the community's collective attack knowledge. This is the infrastructure that turns isolated self-testing into a network effect.

The adversarial self-testing pipeline is part of Agent OS v1.4. Attack pattern definitions and verification results will be published as the pipeline matures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adversarial Self-Testing: How an Agent Finds Its Own Vulnerabilities Without Millions of Users #24

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Adversarial Self-Testing: How an Agent Finds Its Own Vulnerabilities Without Millions of Users #24

Uh oh!

Liuyanfeng1234 Jun 12, 2026 Maintainer