fix: anonymize few-shot examples in identify_documents prompts by neoneye · Pull Request #582 · PlanExeOrg/PlanExe

neoneye · 2026-04-16T23:24:38Z

Summary

IDENTIFY_DOCUMENTS_*_SYSTEM_PROMPT variants in worker_plan_internal/document/identify_documents.py contained concrete, domain-named examples (childcare subsidy policies, housing price indices, mental health survey data, marathon training, climate crop yield, etc.) that the LLM reproduced verbatim for unrelated plans.
Observed in the Denmark euro-adoption run (PlanExe-web/20260129_euro_adoption/): identified_documents_to_find.json was populated with the exact social-policy document names from the business prompt's NAMING CONVENTION and EXAMPLE MAPPING sections. Downstream DraftDocumentsToFindTask (separate per-document LLM call) correctly filled euro-adoption content into essential_information, producing a chimera output in draft_documents_to_find.json.
Fix: replace concrete named examples in the Documents-to-Find sections (EXAMPLE MAPPING, FORBIDDEN, NAMING CONVENTION) with bracketed placeholders across all three variants (business / personal / other), and mark the EXAMPLE MAPPING blocks as illustrative patterns only — do NOT reuse these topics. Structural guidance is preserved; only the domain hooks the LLM was latching onto are removed.

Root cause

The prompt's illustrative examples acted as a domain prior. For a Denmark euro-adoption plan, the LLM generated documents named Existing National Childcare Subsidy Policies, Data on Average Childcare Costs, Tax Code Sections Related to Dependents, National Housing Price Indices, Official National Mental Health Survey Data, Existing Social Support Program Details — all literal strings from the system prompt examples.

Test plan

Re-run the Denmark euro-adoption plan and confirm identified_documents_to_find.json contains euro-adoption-relevant source materials (treaty texts, ERM II convergence criteria, ECB monetary policy, payment-system standards, currency-conversion studies), not social-policy items.
Spot-check runs on an unrelated domain (e.g., infrastructure, tech rollout) to confirm the prompt still yields well-named documents without the concrete examples.
Consider queuing this prompt in the self_improve loop — it has not been iterated on yet.

🤖 Generated with Claude Code

The three IDENTIFY_DOCUMENTS_*_SYSTEM_PROMPT variants contained concrete, domain-named examples (childcare subsidy policies, housing price indices, mental health survey data, marathon training, climate crop yield, etc.) that the LLM reproduced verbatim for unrelated plans. Observed in a Denmark euro-adoption run: documents_to_find was populated with the exact social-policy names from the business prompt's NAMING CONVENTION and EXAMPLE MAPPING sections, while the downstream DraftDocumentsToFindTask correctly filled euro-adoption content into essential_information, producing a chimera output. Replace the concrete named examples in the Documents to Find sections (EXAMPLE MAPPING, FORBIDDEN, NAMING CONVENTION) with bracketed placeholders, and mark the EXAMPLE MAPPING blocks as "illustrative patterns only — do NOT reuse these topics." The structural pattern is preserved; only the domain hooks the LLM was latching onto are removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous fix (00d997e) added out-of-scope exclusions for legislation, treaties, currency adoption, etc. — but left the concrete physics examples inside the same instruction: "perpetual motion, faster-than- light travel, reactionless/anti-gravity propulsion, time travel" plus "thermodynamics, conservation of energy, relativity". On a Denmark euro-adoption run, item 1 was rated HIGH with justification: "success literally requires breaking the named law of physics (conservation of energy) for a reactionless/anti-gravity propulsion system." Both "reactionless/anti-gravity propulsion" and "conservation of energy" are lifted verbatim from the instruction's own example lists. Same bleedthrough pattern as PR #582 on identify_documents: concrete few-shot examples get reproduced as findings by weaker models. Fix: strip the scifi system examples and the named-law examples from the instruction. Add an explicit anti-fabrication rule: the model must quote text from the plan describing a physics-violating mechanism, or else rate LOW. For ≥MEDIUM ratings, require the justification to quote the plan text alongside naming the violated law. Scope: item 1 only. Does not touch Bug B (template lock across audit items 4-20 sharing identical justifications) — separate concern, will address in a follow-up PR if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye and others added 2 commits April 17, 2026 01:24

Merge branch 'main' into fix/identify-documents-prompt-bleedthrough

1c25503

neoneye merged commit b580291 into main Apr 17, 2026
3 checks passed

neoneye deleted the fix/identify-documents-prompt-bleedthrough branch April 17, 2026 00:53

neoneye mentioned this pull request Apr 17, 2026

fix: prevent self-audit 'Violates Known Physics' example bleedthrough #585

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: anonymize few-shot examples in identify_documents prompts#582

fix: anonymize few-shot examples in identify_documents prompts#582
neoneye merged 2 commits intomainfrom
fix/identify-documents-prompt-bleedthrough

neoneye commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented Apr 16, 2026

Summary

Root cause

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant