fix: prevent self-audit 'Violates Known Physics' example bleedthrough by neoneye · Pull Request #585 · PlanExeOrg/PlanExe

neoneye · 2026-04-17T00:58:06Z

Summary

Observed in a Denmark euro-adoption run: item 1 rated HIGH with justification "success literally requires breaking the named law of physics (conservation of energy) for a reactionless/anti-gravity propulsion system." The plan is about sovereign currency adoption — no physics claims exist anywhere in it.
Root cause: same bleedthrough pattern as PR fix: anonymize few-shot examples in identify_documents prompts #582 (identify_documents). The instruction text for item 1 contained concrete few-shot examples — "perpetual motion, faster-than-light travel, reactionless/anti-gravity propulsion, time travel" and "thermodynamics, conservation of energy, relativity" — which the model reproduced verbatim as findings instead of treating as rubric anchors.
Previous fix 00d997ee added strong out-of-scope exclusions (legislation/treaties/currency adoption rate LOW) but left the copy-bait examples in place. The model resolved the conflict by keeping the HIGH rating anyway and fabricating a reactionless-drive narrative.

Change

One instruction string in ALL_CHECKLIST_ITEMS[0]:

Stripped the scifi system examples and the named-law parenthetical.
Added explicit anti-fabrication rules:
- "Do NOT invent physics violations that the plan does not itself describe."
- "If you cannot quote text from the plan that describes an attempt to break a specific named law of physics, rate LOW."
For ≥MEDIUM ratings, the justification must quote the plan text alongside naming the violated law — making fabrication mechanically harder.

The existing structural guardrails (scope list, legislation/treaties/currency adoption → LOW, Economics/crypto/tokenization/governance/AI/regulation/policy/finance/engineering-scale are out of scope → LOW) are preserved.

Out of scope (follow-up)

Bug B: 14 of 20 audit items shared word-for-word identical justifications (boilerplate from item 4's Underestimating Risks instruction), caused by user_prompt_with_previous_responses feeding all prior answers into each subsequent item. Separate concern; will address in a follow-up PR if desired.

Test plan

Re-run the Denmark euro-adoption plan and confirm item 1 rates LOW with mitigation=None.
Spot-check an intentionally physics-violating prompt (e.g., "build a perpetual motion generator") still rates HIGH with a quote from the plan text.

🤖 Generated with Claude Code

Previous fix (00d997e) added out-of-scope exclusions for legislation, treaties, currency adoption, etc. — but left the concrete physics examples inside the same instruction: "perpetual motion, faster-than- light travel, reactionless/anti-gravity propulsion, time travel" plus "thermodynamics, conservation of energy, relativity". On a Denmark euro-adoption run, item 1 was rated HIGH with justification: "success literally requires breaking the named law of physics (conservation of energy) for a reactionless/anti-gravity propulsion system." Both "reactionless/anti-gravity propulsion" and "conservation of energy" are lifted verbatim from the instruction's own example lists. Same bleedthrough pattern as PR #582 on identify_documents: concrete few-shot examples get reproduced as findings by weaker models. Fix: strip the scifi system examples and the named-law examples from the instruction. Add an explicit anti-fabrication rule: the model must quote text from the plan describing a physics-violating mechanism, or else rate LOW. For ≥MEDIUM ratings, require the justification to quote the plan text alongside naming the violated law. Scope: item 1 only. Does not touch Bug B (template lock across audit items 4-20 sharing identical justifications) — separate concern, will address in a follow-up PR if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous commit on this branch kept two enumerations inherited from commit 00d997e that named specific domains as out-of-scope — including "currency adoption" and "economics / finance / regulation / policy". The Denmark euro-adoption plan that triggered this bug matches those literal strings, so a passing test would only prove the model can pattern-match the enumeration back to the prompt, not that the structural anti-hallucination rule is working. Strip both domain lists. The instruction now relies only on: - the disambiguation that 'Laws' means physics laws, not legal ones, - the structural rule that the model must quote plan text describing a physics-violating mechanism or rate LOW. The re-run now becomes an honest test of the structural fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On the Denmark euro-adoption rerun, item 1 still rated HIGH with justification: "success literally requires breaking the EU's monetary architecture (Article 140 TFEU and the no-bailout clause)." User's insight confirmed: the instruction's negation-bait ("NOT legal statutes, treaty law, regulations") primes the model to consider treaty law as in-scope. Seeing "treaty law" mentioned in the rubric makes "Article 140 TFEU" feel like a valid physics-law citation. The more we said "NOT X", the more the model latched onto X. Rewrite with positive-only framing: - Define a law of physics by discipline (mechanics, thermodynamics, electromagnetism, quantum mechanics, relativity). - Define a physics violation by substrate (physical mechanism: device, energy flow, field, force, material process). - Require the ≥MEDIUM justification to quote plan text describing the physical mechanism AND name the violated law. No mention of legal, treaty, regulation, policy, currency, or any non-physics domain anywhere in the instruction. The model has no hint-bait to latch onto; it either finds a physical-mechanism quote in the plan or rates LOW. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On the rerun, item 1 still rated HIGH with justification: "physical mechanism — moving the DKK/EUR peg within ERM II — that contradicts a law of physics (thermodynamics/mechanics) by demanding sustained intervention and precise energy flows to maintain the peg." The LLM exploited polysemy: "mechanism", "energy flow", and "force" all have metaphorical uses in economics and policy. Defining a physics violation as requiring a "physical mechanism" was insufficient — "intervention mechanism" and "currency flow" were close enough for the model to claim the gate was satisfied. Replace the "physical mechanism" gate with a harder one: to rate ≥MEDIUM, the justification must cite a specific physical quantity in SI units (joules, newtons, kelvin, coulombs, m/s, etc.) with a numerical magnitude, sourced from a verbatim plan-text quote and attached to a named physics law. Three gates (quote + law + SI magnitude), all-or-LOW. Currency pegs, treaty mechanisms, and organizational processes cannot be given coherent joule/newton/kelvin values against a named law — the model either has to fabricate a number (a visible tell) or rate LOW. Perpetual-motion and FTL proposals still rate HIGH cleanly because their SI-unit violations are real (net-positive joule output without input; velocity ≥ c). No negation of domains, no listing of metaphorical word uses — the quantitative gate does the work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three previous prompt iterations on this PR tried increasingly structured rubrics (scope enumeration, positive-only framing, SI-unit gate with three mandatory justification elements). All three failed the same way: a weak non-reasoning model that has been trained to flag audit items as HIGH will fabricate physics language ("ERM II energy flows", "Article 140 TFEU physics violation") to match whatever rubric text it finds, ignoring conditional structure. User's observation: the prompt may be too confusing for the model to follow. Strip it down to a short, unconditional instruction: "This check applies only to plans describing physical devices or material processes. Default rating: LOW. Rate HIGH only if the plan requires a physical device or material process that cannot exist under known physics." No SI units, no example violations, no named laws, no domain enumeration (positive or negative), no multi-gate conditionals. Whether this works depends on whether the model respects the short default-LOW instruction or continues to pattern-match HIGH regardless. If it fails, the next step is a code-side post-validator that overrides non-LOW ratings whose justification does not cite a concrete physical quantity — but that is out of scope for this prompt-only PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye and others added 5 commits April 17, 2026 02:57

neoneye mentioned this pull request Apr 17, 2026

docs: add self_audit direction notes to module docstring #591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585

fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585
neoneye wants to merge 5 commits intomainfrom
fix/self-audit-physics-example-bleedthrough

neoneye commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented Apr 17, 2026

Summary

Change

Out of scope (follow-up)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant