Skip to content

fix: anonymize few-shot examples in identify_documents prompts#582

Merged
neoneye merged 2 commits intomainfrom
fix/identify-documents-prompt-bleedthrough
Apr 17, 2026
Merged

fix: anonymize few-shot examples in identify_documents prompts#582
neoneye merged 2 commits intomainfrom
fix/identify-documents-prompt-bleedthrough

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented Apr 16, 2026

Summary

  • IDENTIFY_DOCUMENTS_*_SYSTEM_PROMPT variants in worker_plan_internal/document/identify_documents.py contained concrete, domain-named examples (childcare subsidy policies, housing price indices, mental health survey data, marathon training, climate crop yield, etc.) that the LLM reproduced verbatim for unrelated plans.
  • Observed in the Denmark euro-adoption run (PlanExe-web/20260129_euro_adoption/): identified_documents_to_find.json was populated with the exact social-policy document names from the business prompt's NAMING CONVENTION and EXAMPLE MAPPING sections. Downstream DraftDocumentsToFindTask (separate per-document LLM call) correctly filled euro-adoption content into essential_information, producing a chimera output in draft_documents_to_find.json.
  • Fix: replace concrete named examples in the Documents-to-Find sections (EXAMPLE MAPPING, FORBIDDEN, NAMING CONVENTION) with bracketed placeholders across all three variants (business / personal / other), and mark the EXAMPLE MAPPING blocks as illustrative patterns only — do NOT reuse these topics. Structural guidance is preserved; only the domain hooks the LLM was latching onto are removed.

Root cause

The prompt's illustrative examples acted as a domain prior. For a Denmark euro-adoption plan, the LLM generated documents named Existing National Childcare Subsidy Policies, Data on Average Childcare Costs, Tax Code Sections Related to Dependents, National Housing Price Indices, Official National Mental Health Survey Data, Existing Social Support Program Details — all literal strings from the system prompt examples.

Test plan

  • Re-run the Denmark euro-adoption plan and confirm identified_documents_to_find.json contains euro-adoption-relevant source materials (treaty texts, ERM II convergence criteria, ECB monetary policy, payment-system standards, currency-conversion studies), not social-policy items.
  • Spot-check runs on an unrelated domain (e.g., infrastructure, tech rollout) to confirm the prompt still yields well-named documents without the concrete examples.
  • Consider queuing this prompt in the self_improve loop — it has not been iterated on yet.

🤖 Generated with Claude Code

neoneye and others added 2 commits April 17, 2026 01:24
The three IDENTIFY_DOCUMENTS_*_SYSTEM_PROMPT variants contained concrete,
domain-named examples (childcare subsidy policies, housing price indices,
mental health survey data, marathon training, climate crop yield, etc.)
that the LLM reproduced verbatim for unrelated plans. Observed in a
Denmark euro-adoption run: documents_to_find was populated with the exact
social-policy names from the business prompt's NAMING CONVENTION and
EXAMPLE MAPPING sections, while the downstream DraftDocumentsToFindTask
correctly filled euro-adoption content into essential_information,
producing a chimera output.

Replace the concrete named examples in the Documents to Find sections
(EXAMPLE MAPPING, FORBIDDEN, NAMING CONVENTION) with bracketed
placeholders, and mark the EXAMPLE MAPPING blocks as "illustrative
patterns only — do NOT reuse these topics." The structural pattern is
preserved; only the domain hooks the LLM was latching onto are removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye neoneye merged commit b580291 into main Apr 17, 2026
3 checks passed
@neoneye neoneye deleted the fix/identify-documents-prompt-bleedthrough branch April 17, 2026 00:53
neoneye added a commit that referenced this pull request Apr 17, 2026
Previous fix (00d997e) added out-of-scope exclusions for legislation,
treaties, currency adoption, etc. — but left the concrete physics
examples inside the same instruction: "perpetual motion, faster-than-
light travel, reactionless/anti-gravity propulsion, time travel" plus
"thermodynamics, conservation of energy, relativity".

On a Denmark euro-adoption run, item 1 was rated HIGH with
justification: "success literally requires breaking the named law of
physics (conservation of energy) for a reactionless/anti-gravity
propulsion system." Both "reactionless/anti-gravity propulsion" and
"conservation of energy" are lifted verbatim from the instruction's
own example lists. Same bleedthrough pattern as PR #582 on
identify_documents: concrete few-shot examples get reproduced as
findings by weaker models.

Fix: strip the scifi system examples and the named-law examples from
the instruction. Add an explicit anti-fabrication rule: the model must
quote text from the plan describing a physics-violating mechanism, or
else rate LOW. For ≥MEDIUM ratings, require the justification to quote
the plan text alongside naming the violated law.

Scope: item 1 only. Does not touch Bug B (template lock across audit
items 4-20 sharing identical justifications) — separate concern, will
address in a follow-up PR if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant