Beyond Keyword Detection: Utility Motivation Retrospection for Agent Safety #20
Liuyanfeng1234
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Beyond Keyword Detection: Utility Motivation Retrospection for Agent Safety
[A follow-up to #19: "What Fable 5's Breach Teaches Us About Agent Safety"]
Yesterday's analysis of Fable 5 established that single-point safety classifiers fail because they operate at the syntax layer — pattern-matching against surface text features. But the deeper question is: what should the next layer of defense operate on?
The Syntax Trap
Every syntax-based safety approach — keyword lists, regex patterns, embedding similarity, even fine-tuned classifiers — shares the same failure mode:
The attacker doesn't defeat the classifier by being smarter — they defeat it by operating at a different level of abstraction. The syntax layer can't see intent; it can only see tokens.
UMRC: Utility Motivation Retrospection
We've been developing a complementary approach that operates at the intent layer: UMRC (Utility Motivation Retrospection Core). Instead of asking "do these words look dangerous?", it asks "what is the most minimal utility motivation that could explain this sequence of actions?"
The process has three stages:
1. Atomic Decomposition
A compound request is decomposed into atomic instructions — the smallest meaningful execution units:
The
?annotations mark instructions where the decomposition is ambiguous — indicating possible hidden sub-tasks.2. Contextual Contradiction Detection
Each atomic instruction is checked against the declared context:
The contradiction isn't about bad words. It's about structural inconsistency between what was asked and what would need to happen. Removing demographic information has no legitimate purpose in pipeline optimization — it reveals a hidden utility function that contradicts the declared one.
3. Minimum Motivation Backtracking
For each detected contradiction, the system backtracks to find the minimum motivation that could explain the conflicting instructions:
This is the key insight: the system doesn't need to guess the attacker's intent. It needs to find the most charitable interpretation that still explains the actions — and check if that interpretation conflicts with its safety axioms.
The Safety Stack: From Syntax to Intent
Our five-layer defense now spans two abstraction levels:
The jump from Layer 2 to Layer 3 is the critical transition: from analyzing what was said to analyzing what must have been meant. This is the defense dimension that syntax-based classifiers — including Fable 5's — cannot provide.
Why "Most Charitable Interpretation" Matters
A safety system that always assumes the worst interpretation generates too many false positives. A system that always assumes the best interpretation generates too many false negatives.
UMRC's "minimum motivation" approach splits the difference: find the interpretation that requires the least additional assumptions, then verify that this interpretation doesn't conflict with safety constraints. This catches attacks that are carefully crafted to pass syntax checks while still blocking on intent — and it passes legitimate requests that just happen to share vocabulary with attacks.
The Open Question
Syntax-layer safety is a solved problem — in the sense that we know its limits. Intent-layer safety is where the next generation of attacks will operate. The question for the community:
How do we build shared intent-level verification that works across agent boundaries — so that a system that detects a hidden motivation can warn downstream agents before they execute?
Fable 5 failed because one classifier was the whole system. The next failure will come from agents that trust each other's safety checks without independent verification. Intent-level defense + cross-agent safety signaling is the frontier we need to build toward.
UMRC is part of the Agent OS safety architecture. Implementation details and test vectors will be published as the system matures.
Beta Was this translation helpful? Give feedback.
All reactions