napkin-math: prompt cleanup for compress + extract by neoneye · Pull Request #737 · PlanExeOrg/PlanExe

neoneye · 2026-05-20T21:31:29Z

Scope

Prompt cleanup PR — both compress (compress_report_section.py) and extract (extract-parameters-from-digest/system-prompt.txt) prompts. Kept as a single PR per request so the combined effect of the prompt changes can be evaluated together against the napkin_math regression probes.

The PR is not a quality-acceptance claim that v50 outputs beat v49 across the baseline. Compress is stochastic; one before/after pass cannot honestly attribute output movement to a specific prompt edit. Output-quality acceptance, plus the unrelated compress-LLM run-to-run variance problem, are explicitly deferred.

What changes (net effective diff vs. main)

Compress stage — `compress_report_section.py`

NUMERIC_VALUES_BUCKET_PROMPT — labels for rate-like quantities must carry their per-period or per-unit denominator; source-stated rates (explicit / inferred / recommended-range / stress_test) belong in this bucket, not in missing_data; modelling role describes the multiplication structure.
MISSING_DATA_BUCKET_PROMPT — denominator-pairing rule extended: a per-period burn rate needs the matching duration it will be multiplied by.
GATES_AND_THRESHOLDS_BUCKET_PROMPT — capacity requirements count as gates even when the source frames them as sizing math, rewritten into if/then form.
OPTIMIZE_INSTRUCTIONS (new module-level block) — anti-overfitting documentation for future authors. Lists the four categories of forbidden prompt content as abstract structural descriptions only; the banner is held to its own rule.

Extract stage — `system-prompt.txt`

Threshold pairing rule (new section between "Coverage and capacity gate rule" and "Combined viability gate preservation") — when the extract emits a key_value whose role is a numeric threshold (floor, cap, ceiling, minimum, maximum, target volume, target share, target deadline), it must also emit a paired margin/surplus calculation comparing the realised quantity against the threshold. The realised goes in missing_values_to_estimate if absent from the source; the margin goes in recommended_first_calculations or derived_questions; under cap pressure, drop a less-load-bearing key_value or move a less-critical calc rather than skip the pairing.

All five edits are corpus-agnostic: no plan names, no literals from any baseline plan, no expected downstream output ids, no domain-specific framings. The OPTIMIZE_INSTRUCTIONS block names only abstract categories of forbidden content; the threshold-pairing rule names only structural threshold categories (floor, cap, target, deadline).

Discipline this PR enforces

Each edit is written so it would apply identically to any plan in any domain or language. The baseline plans (paperclip, datacenter, crate_recovery, yellowstone, plus the other ten Self-Improve / smoke-fixture plans) are used as regression probes, not as the spec — when a probe surfaces a failure pattern, the prompt edit captures the abstract pattern in corpus-agnostic language.

Where the discipline still falls short

Honest list of known limitations these prompt edits do NOT fix:

Compress LLM run-to-run variance. The same source, same prompts, two compress passes can produce materially different bucket selections. A high-impact tripwire stated by the source can be dropped by one run and present in the next. A follow-up that introduces a deterministic retry/merge across N compress passes, or a lower-temperature pass for high-impact buckets, is the right home for this.
Threshold-pairing rule × 5-cap on missing_values_to_estimate. When a plan names many independent thresholds, the rule's requirement to pair each one collides with the 5-cap and forces tradeoffs — a less-load-bearing existing pair may be dropped to make room for a new pairing. The "less-load-bearing" criterion is judgement-based; two reasonable extractors will disagree. A follow-up that re-examines hard limits, or that defines a quantitative criterion for "load-bearing", is the right home.
No formal source-preservation check. The audit script used during verification (every key_value appears in some calc's depends_on) does not verify that important source thresholds were preserved across iterations. A v49 → v50 comparison that silently drops a v49 gate can still pass this audit. A follow-up should add a source-preservation check that compares against either the source text or the prior baseline.

Verification

17/17 unit tests pass.
Compress + extract re-runs on paperclip, datacenter, crate_recovery, and yellowstone_evacuation recorded under experiments/napkin_math/output/v50/ (gitignored). Used as regression probes only.
grep confirms no baseline-plan literals (€[0-9], km², GW, GVA, RTE, DGSI, DREAL, OPC UA, paperclip, hyperscale) appear in either prompt file outside the legitimate section-name enum.

Acceptance checks for any follow-up quality-improvement PR

When the next PR claims a structural improvement over the baseline, it should:

Demonstrate an abstract rule moves behaviour across multiple regression-probe plans, not one.
Show that no important source threshold is silently dropped — every drop has an explicit structural rationale (cap pressure, scope, or the threshold being unmodelled).
Avoid prompt content that names any baseline plan, its acronyms, its numeric values, its expected output ids, or its domain-specific jargon.
Treat compress-LLM variance as its own structural problem, addressed at the orchestration/sampling layer rather than by prompt rewording.

Commit chain

bd6faf4a — Phase 1 compress prompt edits + initial OPTIMIZE_INSTRUCTIONS guard
74030f8d — Tighten numeric_values to route source-stated rates into the right bucket
19dd780e — Strip eval-corpus literals from OPTIMIZE_INSTRUCTIONS (keep categories, drop concrete examples)
6a7468b8 — Add threshold-pairing rule to extract-parameters-from-digest
2f9bb77a — Revert of the threshold-pairing rule (during a brief PR split)
2d1088d2 — Restore the threshold-pairing rule (consolidating back into a single PR for unified review)

The pair of revert commits records the brief split-and-reconsolidate. Net diff vs main is the five substantive prompt changes.

Test plan

CI green
Hand-spot the new bucket-prompt paragraphs and the threshold-pairing rule are corpus-agnostic
OPTIMIZE_INSTRUCTIONS contains no concrete corpus identifiers
Compress + extract regression probes do not reveal corpus-leaking content in either prompt file

Three prompt edits in compress_report_section.py per the 2026-05-20 methodology plan, Phase 1 (Stage 1 worker_plan): - NUMERIC_VALUES_BUCKET_PROMPT — labels for rate-like quantities must carry the per-period or per-unit denominator; modelling role describes the multiplication structure so a downstream stage can identify the matching scaling input. (R2.3) - MISSING_DATA_BUCKET_PROMPT — extended denominator-pairing rule: a per-period burn rate needs the matching duration the rate will be multiplied by. Without the duration the multiplication is lost downstream. (R2.3) - GATES_AND_THRESHOLDS_BUCKET_PROMPT — capacity requirements count as gates even when the source frames them as sizing calculations rather than explicit pass/fail conditions. (R2.5) Also adds a module-level OPTIMIZE_INSTRUCTIONS block modeled on identify_potential_levers.py. It documents the napkin_math pipeline context, the 14-plan baseline this file is tested against, and an explicit anti-overfitting rule: prompt examples must not paraphrase corpus contents (numbers, named entities, expected output ids, or domain-specific framings). Banned terms are listed by name and the known failure patterns are enumerated. Future edits to this file should read OPTIMIZE_INSTRUCTIONS first. Verification: 17/17 unit tests pass. Compress + extract were re-run end-to-end on the paperclip and datacenter baselines (results under output/v50/, gitignored). Phase-1 alone did not yet move the R2.5 capacity-requirement-as-gate signal in compress output for the datacenter plan — iteration on the gates_and_thresholds prompt is expected as a follow-up commit on this branch.

Round-2 iteration of the Phase-1 compress prompts, per ChatGPT review feedback on the v50 run. NUMERIC_VALUES_BUCKET_PROMPT — when the source states a per-period or per-unit rate (explicitly, inferred, recommended-range, or as a stress_test magnitude), it must appear in numeric_values with the denominator preserved. missing_data is reserved for primitives the source does NOT name. A total cost that the source ALSO frames as a per-period burn must surface in both forms when both are stated. Compress re-run on the paperclip and datacenter baselines (results under output/v50/, gitignored): - Datacenter: R2.5 capacity-as-gate now appears in compress_review_plan.md gates_and_thresholds ("If judicial review challenges the hyper-dense model, then buildable area could reduce below 30 km²"). Holding rate (€/ha/year) surfaces in premortem and expert_criticism missing_data with the matching duration paired alongside, though not yet in review_plan where ChatGPT specifically flagged it. - Paperclip: known LLM run-to-run variance dropped some premortem tripwires ($75k OPC UA bid, 100ms p99 latency) from this run's digest. Those are not Phase-1 regressions — they came back differently on a fresh compress pass against the unchanged source. Parameters.json restores the $25k on-site expert cap gate (still in the digest via Data Collection passthrough) and keeps the bottom-up budget decomposition and manual-intervention surplus. Also reverts a misguided EVAL_CORPUS_BANNED_TERMS constant and associated regression-test scaffold that I added in this session: the static term list was itself fitted to today's two-plan baseline and would have rotted as the corpus shifts. OPTIMIZE_INSTRUCTIONS stays as narrative-only documentation. 17/17 unit tests still pass.

Round-3 cleanup per ChatGPT review of the v50 work. OPTIMIZE_INSTRUCTIONS previously listed concrete corpus identifiers (currency amounts, area figures, capacity figures, regulator acronyms, protocol names, expected downstream snake_case ids, industry-specific framings) as parenthetical examples inside an anti-overfitting banner. That made the banner itself a corpus leak in the source file — the exact failure mode the rule forbids. The list would also rot the moment the baseline corpus shifts. Rewrites the banner to keep the four categories of forbidden content (concrete numeric values, named entities/acronyms, expected output ids, domain-specific framings) as abstract structural descriptions only. No concrete examples from any baseline plan remain. Adds a closing paragraph that explicitly holds the banner itself to the rule it states. No change to the bucket prompts themselves; the three Phase-1 prompt edits from earlier commits are untouched. 17/17 unit tests still pass. This commit also scopes the PR purpose: it is now a prompt-cleanup PR. Output-quality acceptance against the napkin_math baseline (notably restoring the paperclip premortem tripwires that one stochastic compress pass drops) moves to a follow-up that introduces a deterministic retry/merge or lower-temperature rerun strategy for compression gates.

…gest Adds a new "Threshold pairing rule" to the extract skill's system prompt, slotted between "Coverage and capacity gate rule" and "Combined viability gate preservation". Why: extraction runs were leaving threshold key_values (floors, caps, targets, deadlines) unpaired with their realised-vs-threshold margin calcs. The 'no dead-end variables' rule already forbids this in principle, but in practice the abstract directive was being followed unevenly. The new rule makes the pairing explicit and operational: every extracted threshold gets a paired margin calculation in recommended_first_calculations or derived_questions, with the realised quantity declared in missing_values_to_estimate when the source does not name it. The rule is corpus-agnostic — it uses only structural categories (floor, cap, ceiling, target volume, target share, target deadline) and the existing _margin / _surplus naming convention. No corpus literals. Tested by re-running the skill on two v50 baselines that had unpaired thresholds: - crate_recovery_campaign — previously had target_recovered_crates (108k volume target) and minimum_donation_threshold_dkk (500k donation floor) as unpaired key_values. Re-extraction emits q_volume_target_margin and q_donation_minimum_margin as derived_questions per the new rule. Also restructured average_effective_incentive_per_crate_dkk from a missing-value to an explicit recommended_first_calculation so that incentive_per_crate_dkk and pilot_incentive_decel_threshold_crates appear in real depends_on rather than narrative-only suggested_estimation_method prose. - yellowstone_evacuation — previously had hospital_fuel_priority_share (75% generator-fuel floor) unpaired and vei7_uplift_trigger_cm_per_hour mis-classified as a simulatable key_value. Re-extraction emits hospital_fuel_priority_margin and moves vei7 to unmodelled_gates (geological observation, not deterministically simulatable). Dropped zone_zero_evacuation_target_people (population denominator, not a viability threshold) and the weakest existing pair (public_compliance_threshold_zone_one) under cap pressure, since hospital generator-fuel priority is more directly life-safety than downstream traffic congestion. Both v50 parameters.json files now audit clean against the no-dead-end-variables rule (every key_value appears in at least one calc's depends_on).

…-from-digest" This reverts commit 6a7468b.

…rameters-from-digest"" This reverts commit 2f9bb77.

…PTIMIZE_INSTRUCTIONS Captures a discipline that emerged during the v50 review cycle: do not encode a single baseline plan's behaviour into the prompts, even when no literal from that plan leaks into the text. Picking a rule's shape from one plan's failure pattern without testing the rule across multiple probes is overfitting in a subtler form. Sections added (slotted between 'Anti-overfitting' and 'Known failure patterns to guard against'): - The four-step diagnostic discipline when a baseline plan reveals a failure mode: diagnose the abstract pattern, write the rule in corpus-agnostic language, verify across multiple probe plans, and surface tradeoffs explicitly when a rule lifts one plan but flattens another. - The reporting format for a prompt edit's effect (which structural rule changed, which corpus-agnostic behaviour it improves, which probes moved, any dropped baseline signal and the structural justification). - The explicit reminder that compress-LLM run-to-run variance is its own structural problem belonging to orchestration (retry/merge across passes, lower-temperature reruns), not to bucket-prompt rewording. 17/17 unit tests still pass. No corpus literals leaked into the new section.

Adds a 'Status as of 2026-05-21' section to the plan capturing where things stand after the v50 prompt-cleanup work: - What landed in PR #737 (three compress-prompt edits + OPTIMIZE_INSTRUCTIONS in compress_report_section.py, plus the threshold-pairing rule in the extract skill's system prompt). - Which structural behaviours moved in the expected direction across multiple regression probes (paperclip, datacenter, crate_recovery, yellowstone). - Three known limitations PR #737 does NOT address and that are deferred to a separate output-quality PR: compress-LLM run-to-run variance, threshold-pairing-rule × missing-values-cap tradeoff, and absence of a source-preservation audit. - Process insights from the v50 work that are now written into the per-file OPTIMIZE_INSTRUCTIONS blocks: fix prompts not outputs; regression probes are not acceptance criteria; each prompt-engineering file should carry its own discipline banner. - A status table marking Phase 1 done, Phase 2 partial (threshold-pairing rule landed), Phases 3-10 not started. - A recommended ordering for the next PR: prefer a compress-variance stability PR over continuing Phase 2 prompt work, so downstream prompts land on stable compress output. Documentation-only change; no code or prompt edits.

Two stale claims in the 'Status as of 2026-05-21' section: - "five edits are corpus-agnostic. The five baseline plans..." conflated the prompt-edit count (correct, 5) with a probe count that drifts every time a new plan is added to the verification set. - The 'Verified against probes' list named four plans (paperclip, datacenter, crate_recovery, yellowstone) but two more probes (mars_gtld, euro_adoption) have since been added, so the list was already out of date. Rewritten to avoid exact probe counts: the verification set is described as 'a growing subset of the napkin_math 14-plan corpus' with the current named members listed and an explicit note that counts in this section are not normative. Pure documentation fix; no prompt or code change.

Replaces the prior 'Status as of 2026-05-21' content with three explicit subsections: 1. Landed on main: PR #737 (Phase 1 compress + initial extract threshold-pairing + OPTIMIZE_INSTRUCTIONS) and PR #739 (Proposal 141 design only). 2. Open for merge: PR #740 commit chain (4cda70b source-arithmetic + parity, 19f927b aggregate-sum tightening, 8f94c8c source_text truncation discipline). All edits applied symmetrically to both extract skills. 3. PR #740 verification posture: same-LLM same-session regression check, not improvement proof. All six v51 parameters.json files validate clean. Behavioural verification of the rules on a different LLM is a separate piece of follow-up work, not part of PR #740. Known limitations section now explicitly names the clearest unresolved regression (paperclip OPC UA / p99 latency at compress stage), the cap-pressure-without-recorded-rationale gap (yellowstone public_compliance trio), and the absence of a source-preservation audit implementation (proposal 141 design merged, code not). Lists four follow-up PRs in preferred order; bundling them re-creates the scope creep PR #740 was extracted from.

…promoter) in 20260520 plan Per user direction, the plan-status update lands in PR #750 (not a separate doc PR). PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line. Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744. Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hold-pairing parity Continues Phase 2 of the 2026-05-20 methodology plan. Adds one new rule and fixes one parity gap. New rule — source-arithmetic preservation: When the source explicitly relates a dependent quantity to its named components through an arithmetic operation (sum, product, ratio, fraction of a base, scaled magnitude), the extractor must preserve that relationship as a recommended_first_calculation. Three named patterns: aggregate sum (R1.1), burn rate × duration (R2.3), explicit decomposition block (R2.4). The dependent quantity is the calculation; the operands are primitives in key_values or missing_values_to_estimate. Inverting this ordering loses the source's named structure and the simulation cannot recover it from independent bounds. Parity fix — threshold-pairing rule was added to extract-parameters-from-digest in PR PlanExeOrg#737 but not to extract-parameters-from-full. Backfilled identically so both extract skills follow the same discipline. Both rules are corpus-agnostic: structural categories only (rate / duration / aggregate / component / threshold / floor / cap / margin), no plan names, no literals, no expected output ids, no domain-specific framings. Slotted symmetrically between 'Combined viability gate preservation' and 'Critical-output completeness rule' in both files. Closes the R1.1, R2.3, R2.4 directives from the Phase 2 row of the methodology plan's status table; R2.5 (threshold-pairing) was already done in PR PlanExeOrg#737. Regression probes to follow as part of this PR's verification.

…rg#747 (Phase 4 runtime) in 20260520 plan Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye added 3 commits May 20, 2026 23:31

neoneye changed the title ~~napkin-math: Phase 1 compress prompts + OPTIMIZE_INSTRUCTIONS guard~~ napkin-math: Phase 1 compress-prompt cleanup (output-quality acceptance deferred) May 20, 2026

neoneye mentioned this pull request May 20, 2026

napkin-math: threshold-pairing rule for extract-parameters-from-digest #738

Closed

3 tasks

neoneye added 2 commits May 21, 2026 01:00

Revert "napkin-math: add threshold-pairing rule to extract-parameters…

2f9bb77

…-from-digest" This reverts commit 6a7468b.

Revert "Revert "napkin-math: add threshold-pairing rule to extract-pa…

2d1088d

…rameters-from-digest"" This reverts commit 2f9bb77.

neoneye changed the title ~~napkin-math: Phase 1 compress-prompt cleanup (output-quality acceptance deferred)~~ napkin-math: prompt cleanup for compress + extract May 20, 2026

neoneye added 3 commits May 21, 2026 01:07

neoneye merged commit d36f9de into main May 20, 2026
3 checks passed

neoneye deleted the napkin-math/phase1-compress-prompts branch May 20, 2026 23:28

This was referenced May 20, 2026

docs(napkin-math): proposal 141 — source-preservation audit #739

Merged

napkin-math: Phase 2 extract — source-arithmetic preservation, threshold-pairing parity, source_text truncation #740

Merged

This was referenced May 21, 2026

docs(napkin-math): record PR #743 and PR #744 in 20260520 plan #745

Merged

docs(napkin-math): record PR #746 (Phase 3) and PR #747 (Phase 4 runtime) #748

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

napkin-math: prompt cleanup for compress + extract#737

napkin-math: prompt cleanup for compress + extract#737
neoneye merged 9 commits into
mainfrom
napkin-math/phase1-compress-prompts

neoneye commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

What changes (net effective diff vs. main)

Compress stage — compress_report_section.py

Extract stage — system-prompt.txt

Discipline this PR enforces

Where the discipline still falls short

Verification

Acceptance checks for any follow-up quality-improvement PR

Commit chain

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neoneye commented May 20, 2026 •

edited

Loading

Compress stage — `compress_report_section.py`

Extract stage — `system-prompt.txt`