Beyond Keyword Detection: Utility Motivation Retrospection for Agent Safety #20

Liuyanfeng1234 · 2026-06-12T11:01:56Z

Liuyanfeng1234
Jun 12, 2026
Maintainer

Beyond Keyword Detection: Utility Motivation Retrospection for Agent Safety

[A follow-up to #19: "What Fable 5's Breach Teaches Us About Agent Safety"]

Yesterday's analysis of Fable 5 established that single-point safety classifiers fail because they operate at the syntax layer — pattern-matching against surface text features. But the deeper question is: what should the next layer of defense operate on?

The Syntax Trap

Every syntax-based safety approach — keyword lists, regex patterns, embedding similarity, even fine-tuned classifiers — shares the same failure mode:

"delete all files" → BLOCKED ✓
"d e l e t e   a l l   f i l e s" → PASSED ✗ (character spacing)
"organize all files" → "delete all files" [recombined] → PASSED ✗ (decomposition)

The attacker doesn't defeat the classifier by being smarter — they defeat it by operating at a different level of abstraction. The syntax layer can't see intent; it can only see tokens.

UMRC: Utility Motivation Retrospection

We've been developing a complementary approach that operates at the intent layer: UMRC (Utility Motivation Retrospection Core). Instead of asking "do these words look dangerous?", it asks "what is the most minimal utility motivation that could explain this sequence of actions?"

The process has three stages:

1. Atomic Decomposition

A compound request is decomposed into atomic instructions — the smallest meaningful execution units:

"Help me optimize my company's hiring pipeline"
→ [scan_resumes, filter_candidates, rank_applicants, 
   remove_demographics?, generate_bias_report?]

The ? annotations mark instructions where the decomposition is ambiguous — indicating possible hidden sub-tasks.

2. Contextual Contradiction Detection

Each atomic instruction is checked against the declared context:

Declared: "optimize hiring pipeline"
Atomic:  "remove_demographics" ← CONTRADICTION DETECTED

The contradiction isn't about bad words. It's about structural inconsistency between what was asked and what would need to happen. Removing demographic information has no legitimate purpose in pipeline optimization — it reveals a hidden utility function that contradicts the declared one.

3. Minimum Motivation Backtracking

For each detected contradiction, the system backtracks to find the minimum motivation that could explain the conflicting instructions:

remove_demographics + rank_applicants + filter_candidates
→ minimum_motivation: "evaluate candidates while concealing demographic bias"
→ CONFLICT with axiom A3 (Value Alignment)
→ BLOCK

This is the key insight: the system doesn't need to guess the attacker's intent. It needs to find the most charitable interpretation that still explains the actions — and check if that interpretation conflicts with its safety axioms.

The Safety Stack: From Syntax to Intent

Our five-layer defense now spans two abstraction levels:

Layer	Component	Abstraction Level
1	RI Text Analysis	Syntax (surface patterns)
2	CNDS Verification	Structure (decomposition integrity)
3	UMRC Retrospection	Intent (motivation consistency)
4	DASB Risk Tiering	Action (grading + graduated response)
5	SIAP Axiom Auditing	Governance (identity + entropy + alignment)
6	O-SDA Checkpoint Anchoring	Execution (compositional integrity)

The jump from Layer 2 to Layer 3 is the critical transition: from analyzing what was said to analyzing what must have been meant. This is the defense dimension that syntax-based classifiers — including Fable 5's — cannot provide.

Why "Most Charitable Interpretation" Matters

A safety system that always assumes the worst interpretation generates too many false positives. A system that always assumes the best interpretation generates too many false negatives.

UMRC's "minimum motivation" approach splits the difference: find the interpretation that requires the least additional assumptions, then verify that this interpretation doesn't conflict with safety constraints. This catches attacks that are carefully crafted to pass syntax checks while still blocking on intent — and it passes legitimate requests that just happen to share vocabulary with attacks.

The Open Question

Syntax-layer safety is a solved problem — in the sense that we know its limits. Intent-layer safety is where the next generation of attacks will operate. The question for the community:

How do we build shared intent-level verification that works across agent boundaries — so that a system that detects a hidden motivation can warn downstream agents before they execute?

Fable 5 failed because one classifier was the whole system. The next failure will come from agents that trust each other's safety checks without independent verification. Intent-level defense + cross-agent safety signaling is the frontier we need to build toward.

UMRC is part of the Agent OS safety architecture. Implementation details and test vectors will be published as the system matures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beyond Keyword Detection: Utility Motivation Retrospection for Agent Safety #20

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Beyond Keyword Detection: Utility Motivation Retrospection for Agent Safety #20

Uh oh!

Liuyanfeng1234 Jun 12, 2026 Maintainer