Skip to content

fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585

Open
neoneye wants to merge 5 commits intomainfrom
fix/self-audit-physics-example-bleedthrough
Open

fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585
neoneye wants to merge 5 commits intomainfrom
fix/self-audit-physics-example-bleedthrough

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented Apr 17, 2026

Summary

  • Observed in a Denmark euro-adoption run: item 1 rated HIGH with justification "success literally requires breaking the named law of physics (conservation of energy) for a reactionless/anti-gravity propulsion system." The plan is about sovereign currency adoption — no physics claims exist anywhere in it.
  • Root cause: same bleedthrough pattern as PR fix: anonymize few-shot examples in identify_documents prompts #582 (identify_documents). The instruction text for item 1 contained concrete few-shot examples — "perpetual motion, faster-than-light travel, reactionless/anti-gravity propulsion, time travel" and "thermodynamics, conservation of energy, relativity" — which the model reproduced verbatim as findings instead of treating as rubric anchors.
  • Previous fix 00d997ee added strong out-of-scope exclusions (legislation/treaties/currency adoption rate LOW) but left the copy-bait examples in place. The model resolved the conflict by keeping the HIGH rating anyway and fabricating a reactionless-drive narrative.

Change

One instruction string in ALL_CHECKLIST_ITEMS[0]:

  • Stripped the scifi system examples and the named-law parenthetical.
  • Added explicit anti-fabrication rules:
    • "Do NOT invent physics violations that the plan does not itself describe."
    • "If you cannot quote text from the plan that describes an attempt to break a specific named law of physics, rate LOW."
  • For ≥MEDIUM ratings, the justification must quote the plan text alongside naming the violated law — making fabrication mechanically harder.

The existing structural guardrails (scope list, legislation/treaties/currency adoption → LOW, Economics/crypto/tokenization/governance/AI/regulation/policy/finance/engineering-scale are out of scope → LOW) are preserved.

Out of scope (follow-up)

Bug B: 14 of 20 audit items shared word-for-word identical justifications (boilerplate from item 4's Underestimating Risks instruction), caused by user_prompt_with_previous_responses feeding all prior answers into each subsequent item. Separate concern; will address in a follow-up PR if desired.

Test plan

  • Re-run the Denmark euro-adoption plan and confirm item 1 rates LOW with mitigation=None.
  • Spot-check an intentionally physics-violating prompt (e.g., "build a perpetual motion generator") still rates HIGH with a quote from the plan text.

🤖 Generated with Claude Code

neoneye and others added 5 commits April 17, 2026 02:57
Previous fix (00d997e) added out-of-scope exclusions for legislation,
treaties, currency adoption, etc. — but left the concrete physics
examples inside the same instruction: "perpetual motion, faster-than-
light travel, reactionless/anti-gravity propulsion, time travel" plus
"thermodynamics, conservation of energy, relativity".

On a Denmark euro-adoption run, item 1 was rated HIGH with
justification: "success literally requires breaking the named law of
physics (conservation of energy) for a reactionless/anti-gravity
propulsion system." Both "reactionless/anti-gravity propulsion" and
"conservation of energy" are lifted verbatim from the instruction's
own example lists. Same bleedthrough pattern as PR #582 on
identify_documents: concrete few-shot examples get reproduced as
findings by weaker models.

Fix: strip the scifi system examples and the named-law examples from
the instruction. Add an explicit anti-fabrication rule: the model must
quote text from the plan describing a physics-violating mechanism, or
else rate LOW. For ≥MEDIUM ratings, require the justification to quote
the plan text alongside naming the violated law.

Scope: item 1 only. Does not touch Bug B (template lock across audit
items 4-20 sharing identical justifications) — separate concern, will
address in a follow-up PR if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit on this branch kept two enumerations inherited from
commit 00d997e that named specific domains as out-of-scope — including
"currency adoption" and "economics / finance / regulation / policy".
The Denmark euro-adoption plan that triggered this bug matches those
literal strings, so a passing test would only prove the model can
pattern-match the enumeration back to the prompt, not that the
structural anti-hallucination rule is working.

Strip both domain lists. The instruction now relies only on:
- the disambiguation that 'Laws' means physics laws, not legal ones,
- the structural rule that the model must quote plan text describing
  a physics-violating mechanism or rate LOW.

The re-run now becomes an honest test of the structural fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the Denmark euro-adoption rerun, item 1 still rated HIGH with
justification: "success literally requires breaking the EU's monetary
architecture (Article 140 TFEU and the no-bailout clause)."

User's insight confirmed: the instruction's negation-bait ("NOT legal
statutes, treaty law, regulations") primes the model to consider
treaty law as in-scope. Seeing "treaty law" mentioned in the rubric
makes "Article 140 TFEU" feel like a valid physics-law citation. The
more we said "NOT X", the more the model latched onto X.

Rewrite with positive-only framing:
- Define a law of physics by discipline (mechanics, thermodynamics,
  electromagnetism, quantum mechanics, relativity).
- Define a physics violation by substrate (physical mechanism: device,
  energy flow, field, force, material process).
- Require the ≥MEDIUM justification to quote plan text describing the
  physical mechanism AND name the violated law.

No mention of legal, treaty, regulation, policy, currency, or any
non-physics domain anywhere in the instruction. The model has no
hint-bait to latch onto; it either finds a physical-mechanism quote
in the plan or rates LOW.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the rerun, item 1 still rated HIGH with justification:
"physical mechanism — moving the DKK/EUR peg within ERM II — that
contradicts a law of physics (thermodynamics/mechanics) by demanding
sustained intervention and precise energy flows to maintain the peg."

The LLM exploited polysemy: "mechanism", "energy flow", and "force"
all have metaphorical uses in economics and policy. Defining a
physics violation as requiring a "physical mechanism" was
insufficient — "intervention mechanism" and "currency flow" were
close enough for the model to claim the gate was satisfied.

Replace the "physical mechanism" gate with a harder one: to rate
≥MEDIUM, the justification must cite a specific physical quantity
in SI units (joules, newtons, kelvin, coulombs, m/s, etc.) with a
numerical magnitude, sourced from a verbatim plan-text quote and
attached to a named physics law. Three gates (quote + law + SI
magnitude), all-or-LOW.

Currency pegs, treaty mechanisms, and organizational processes
cannot be given coherent joule/newton/kelvin values against a named
law — the model either has to fabricate a number (a visible tell)
or rate LOW. Perpetual-motion and FTL proposals still rate HIGH
cleanly because their SI-unit violations are real (net-positive
joule output without input; velocity ≥ c).

No negation of domains, no listing of metaphorical word uses — the
quantitative gate does the work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three previous prompt iterations on this PR tried increasingly
structured rubrics (scope enumeration, positive-only framing, SI-unit
gate with three mandatory justification elements). All three failed
the same way: a weak non-reasoning model that has been trained to
flag audit items as HIGH will fabricate physics language ("ERM II
energy flows", "Article 140 TFEU physics violation") to match
whatever rubric text it finds, ignoring conditional structure.

User's observation: the prompt may be too confusing for the model to
follow. Strip it down to a short, unconditional instruction:

  "This check applies only to plans describing physical devices or
  material processes. Default rating: LOW. Rate HIGH only if the plan
  requires a physical device or material process that cannot exist
  under known physics."

No SI units, no example violations, no named laws, no domain
enumeration (positive or negative), no multi-gate conditionals.
Whether this works depends on whether the model respects the short
default-LOW instruction or continues to pattern-match HIGH regardless.

If it fails, the next step is a code-side post-validator that
overrides non-LOW ratings whose justification does not cite a concrete
physical quantity — but that is out of scope for this prompt-only PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant