fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585
Open
fix: prevent self-audit 'Violates Known Physics' example bleedthrough#585
Conversation
Previous fix (00d997e) added out-of-scope exclusions for legislation, treaties, currency adoption, etc. — but left the concrete physics examples inside the same instruction: "perpetual motion, faster-than- light travel, reactionless/anti-gravity propulsion, time travel" plus "thermodynamics, conservation of energy, relativity". On a Denmark euro-adoption run, item 1 was rated HIGH with justification: "success literally requires breaking the named law of physics (conservation of energy) for a reactionless/anti-gravity propulsion system." Both "reactionless/anti-gravity propulsion" and "conservation of energy" are lifted verbatim from the instruction's own example lists. Same bleedthrough pattern as PR #582 on identify_documents: concrete few-shot examples get reproduced as findings by weaker models. Fix: strip the scifi system examples and the named-law examples from the instruction. Add an explicit anti-fabrication rule: the model must quote text from the plan describing a physics-violating mechanism, or else rate LOW. For ≥MEDIUM ratings, require the justification to quote the plan text alongside naming the violated law. Scope: item 1 only. Does not touch Bug B (template lock across audit items 4-20 sharing identical justifications) — separate concern, will address in a follow-up PR if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit on this branch kept two enumerations inherited from commit 00d997e that named specific domains as out-of-scope — including "currency adoption" and "economics / finance / regulation / policy". The Denmark euro-adoption plan that triggered this bug matches those literal strings, so a passing test would only prove the model can pattern-match the enumeration back to the prompt, not that the structural anti-hallucination rule is working. Strip both domain lists. The instruction now relies only on: - the disambiguation that 'Laws' means physics laws, not legal ones, - the structural rule that the model must quote plan text describing a physics-violating mechanism or rate LOW. The re-run now becomes an honest test of the structural fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the Denmark euro-adoption rerun, item 1 still rated HIGH with
justification: "success literally requires breaking the EU's monetary
architecture (Article 140 TFEU and the no-bailout clause)."
User's insight confirmed: the instruction's negation-bait ("NOT legal
statutes, treaty law, regulations") primes the model to consider
treaty law as in-scope. Seeing "treaty law" mentioned in the rubric
makes "Article 140 TFEU" feel like a valid physics-law citation. The
more we said "NOT X", the more the model latched onto X.
Rewrite with positive-only framing:
- Define a law of physics by discipline (mechanics, thermodynamics,
electromagnetism, quantum mechanics, relativity).
- Define a physics violation by substrate (physical mechanism: device,
energy flow, field, force, material process).
- Require the ≥MEDIUM justification to quote plan text describing the
physical mechanism AND name the violated law.
No mention of legal, treaty, regulation, policy, currency, or any
non-physics domain anywhere in the instruction. The model has no
hint-bait to latch onto; it either finds a physical-mechanism quote
in the plan or rates LOW.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the rerun, item 1 still rated HIGH with justification: "physical mechanism — moving the DKK/EUR peg within ERM II — that contradicts a law of physics (thermodynamics/mechanics) by demanding sustained intervention and precise energy flows to maintain the peg." The LLM exploited polysemy: "mechanism", "energy flow", and "force" all have metaphorical uses in economics and policy. Defining a physics violation as requiring a "physical mechanism" was insufficient — "intervention mechanism" and "currency flow" were close enough for the model to claim the gate was satisfied. Replace the "physical mechanism" gate with a harder one: to rate ≥MEDIUM, the justification must cite a specific physical quantity in SI units (joules, newtons, kelvin, coulombs, m/s, etc.) with a numerical magnitude, sourced from a verbatim plan-text quote and attached to a named physics law. Three gates (quote + law + SI magnitude), all-or-LOW. Currency pegs, treaty mechanisms, and organizational processes cannot be given coherent joule/newton/kelvin values against a named law — the model either has to fabricate a number (a visible tell) or rate LOW. Perpetual-motion and FTL proposals still rate HIGH cleanly because their SI-unit violations are real (net-positive joule output without input; velocity ≥ c). No negation of domains, no listing of metaphorical word uses — the quantitative gate does the work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three previous prompt iterations on this PR tried increasingly
structured rubrics (scope enumeration, positive-only framing, SI-unit
gate with three mandatory justification elements). All three failed
the same way: a weak non-reasoning model that has been trained to
flag audit items as HIGH will fabricate physics language ("ERM II
energy flows", "Article 140 TFEU physics violation") to match
whatever rubric text it finds, ignoring conditional structure.
User's observation: the prompt may be too confusing for the model to
follow. Strip it down to a short, unconditional instruction:
"This check applies only to plans describing physical devices or
material processes. Default rating: LOW. Rate HIGH only if the plan
requires a physical device or material process that cannot exist
under known physics."
No SI units, no example violations, no named laws, no domain
enumeration (positive or negative), no multi-gate conditionals.
Whether this works depends on whether the model respects the short
default-LOW instruction or continues to pattern-match HIGH regardless.
If it fails, the next step is a code-side post-validator that
overrides non-LOW ratings whose justification does not cite a concrete
physical quantity — but that is out of scope for this prompt-only PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
identify_documents). The instruction text for item 1 contained concrete few-shot examples —"perpetual motion, faster-than-light travel, reactionless/anti-gravity propulsion, time travel"and"thermodynamics, conservation of energy, relativity"— which the model reproduced verbatim as findings instead of treating as rubric anchors.00d997eeadded strong out-of-scope exclusions (legislation/treaties/currency adoption rate LOW) but left the copy-bait examples in place. The model resolved the conflict by keeping the HIGH rating anyway and fabricating a reactionless-drive narrative.Change
One instruction string in
ALL_CHECKLIST_ITEMS[0]:The existing structural guardrails (scope list, legislation/treaties/currency adoption → LOW,
Economics/crypto/tokenization/governance/AI/regulation/policy/finance/engineering-scale are out of scope → LOW) are preserved.Out of scope (follow-up)
Bug B: 14 of 20 audit items shared word-for-word identical justifications (boilerplate from item 4's
Underestimating Risksinstruction), caused byuser_prompt_with_previous_responsesfeeding all prior answers into each subsequent item. Separate concern; will address in a follow-up PR if desired.Test plan
🤖 Generated with Claude Code