napkin-math: Phase 2 extract — source-arithmetic preservation, threshold-pairing parity, source_text truncation#740
Merged
Conversation
…hold-pairing parity Continues Phase 2 of the 2026-05-20 methodology plan. Adds one new rule and fixes one parity gap. New rule — source-arithmetic preservation: When the source explicitly relates a dependent quantity to its named components through an arithmetic operation (sum, product, ratio, fraction of a base, scaled magnitude), the extractor must preserve that relationship as a recommended_first_calculation. Three named patterns: aggregate sum (R1.1), burn rate × duration (R2.3), explicit decomposition block (R2.4). The dependent quantity is the calculation; the operands are primitives in key_values or missing_values_to_estimate. Inverting this ordering loses the source's named structure and the simulation cannot recover it from independent bounds. Parity fix — threshold-pairing rule was added to extract-parameters-from-digest in PR #737 but not to extract-parameters-from-full. Backfilled identically so both extract skills follow the same discipline. Both rules are corpus-agnostic: structural categories only (rate / duration / aggregate / component / threshold / floor / cap / margin), no plan names, no literals, no expected output ids, no domain-specific framings. Slotted symmetrically between 'Combined viability gate preservation' and 'Critical-output completeness rule' in both files. Closes the R1.1, R2.3, R2.4 directives from the Phase 2 row of the methodology plan's status table; R2.5 (threshold-pairing) was already done in PR #737. Regression probes to follow as part of this PR's verification.
…ource-arithmetic preservation rule Two wording fixes per review feedback on PR #740: 1. Aggregate-sum pattern was overly broad: it would have told the extractor to treat any 'total alongside named allocations' as a derived sum. That would erase budget ceilings, funding envelopes, and committed targets — which are primitive thresholds tested by the threshold-pairing rule's spend-vs-cap margin, NOT computed by sum identity. Tightened to apply only when the source states or clearly implies the total is computed from the named constituents. 2. The 'discipline shared by all three patterns' paragraph said derived quantities go in recommended_first_calculations, conflicting with the cap-pressure paragraph below that allows moving less-critical calcs to derived_questions. Rephrased so the choice between the two containers is a placement decision (critical vs. secondary) and does not relax the rule that source-stated arithmetic must be preserved as a calculation, not a flat variable. Both edits applied symmetrically to extract-parameters-from-digest and extract-parameters-from-full. No corpus literals introduced. 6/6 line additions match across the two files.
PR #740's v51 regression runs surfaced two validator failures (yellowstone[4].source_text = 28 words; mars_gtld[5].source_text = 26 words) against the existing 'Each source_text must be at most 20 words' bullet. The cap was declared but had no truncation strategy, so when a source if/then conditional ran long the extractor pasted the whole sentence. Adds a follow-up bullet to the hard-limits list in both extract skills: when the source phrase exceeds 20 words, truncate to the subject + threshold portion (drop the consequence clause), end with ellipsis if mid-sentence, and count words before emitting. Applied symmetrically to extract-parameters-from-digest and extract-parameters-from-full. After this prompt change, re-running the skill on the affected plans produced parameters.json files that validate with exit code 0 across all six v51 baselines. No outputs were hand-edited. Corpus-agnostic by construction: the rule names only structural categories (subject, threshold, consequence clause, ellipsis, word count) and applies to any if/then-shaped source phrasing in any domain or language.
Replaces the prior 'Status as of 2026-05-21' content with three explicit subsections: 1. Landed on main: PR #737 (Phase 1 compress + initial extract threshold-pairing + OPTIMIZE_INSTRUCTIONS) and PR #739 (Proposal 141 design only). 2. Open for merge: PR #740 commit chain (4cda70b source-arithmetic + parity, 19f927b aggregate-sum tightening, 8f94c8c source_text truncation discipline). All edits applied symmetrically to both extract skills. 3. PR #740 verification posture: same-LLM same-session regression check, not improvement proof. All six v51 parameters.json files validate clean. Behavioural verification of the rules on a different LLM is a separate piece of follow-up work, not part of PR #740. Known limitations section now explicitly names the clearest unresolved regression (paperclip OPC UA / p99 latency at compress stage), the cap-pressure-without-recorded-rationale gap (yellowstone public_compliance trio), and the absence of a source-preservation audit implementation (proposal 141 design merged, code not). Lists four follow-up PRs in preferred order; bundling them re-creates the scope creep PR #740 was extracted from.
This was referenced May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope
Phase 2 extract-prompt rules. Three commits to the two extract skills (
extract-parameters-from-digestandextract-parameters-from-full), plus a same-session regression check via the napkin_mathvalidate_parameters.pyagainst six baseline plan probes.The PR is narrow on purpose. It does NOT include:
european_prepper_active_buyers) in the extract-from-digest prompt (scoped follow-up).What changes
extract-parameters-from-digest/system-prompt.txtandextract-parameters-from-full/system-prompt.txtAll edits applied symmetrically to both files.
aggregate = sum_of_componentsONLY when the source states or clearly implies the total is computed from the named constituents. Independent caps / ceilings / committed budgets / funding envelopes stay as primitive thresholds inkey_valuesand are tested via the threshold-pairing rule's spend-vs-cap margin.total = rate * durationwhen the source supplies the rate AND the duration.extract-parameters-from-full. PR napkin-math: prompt cleanup for compress + extract #737 only added it toextract-parameters-from-digest; this closes the parity gap.recommended_first_calculationsandderived_questionsis not a relaxation of the preservation rule).All four edits are corpus-agnostic. No plan names, no literals from any baseline plan, no expected output ids, no domain-specific jargon.
Verification posture
Compress + extract were re-run on six baseline plans (paperclip, datacenter, crate_recovery, yellowstone, mars_gtld, euro_adoption). v51 outputs were produced fresh via the skill under the updated prompts. All six
v51/parameters.jsonfiles validate againstvalidate_parameters.pywithvalid=True, errors=0, warnings=0.This is a regression check, not an improvement claim.
Known unresolved issues, NOT addressed in this PR
These are explicitly out of scope. Each gets its own follow-up PR.
public_compliance_*, marsregistration_volume_buffer_fraction, crateq_logistics_budget_margin) remain unclassified tradeoffs. Without proposal 141'sdropped_signalsschema and audit script, I cannot mechanically distinguish "rename" / "structural restructure" / "acceptable drop under cap pressure" / "silent regression". Implementing Multiple api keys cleanup5 #141 is the next-most-load-bearing follow-up.european_prepper_active_buyersstill appears in the extract-from-digest prompt as a formula example. Prompt-hygiene follow-up.Documentation
experiments/napkin_math/docs/20260520_plan.mdupdated to reflect the narrow scope (commitf9d90ebb):Commit chain
4cda70ba— Source-arithmetic preservation rule + threshold-pairing parity backfill19f927b7— Tightened aggregate-sum scope; reconciled discipline-shared paragraph with cap-pressure8f94c8cd— 20-word source_text truncation disciplinef9d90ebb— Plan status section updated for PR napkin-math: Phase 2 extract — source-arithmetic preservation, threshold-pairing parity, source_text truncation #740 narrow scopeTest plan
validate_parameters.pypasses against all six regenerated v51 outputs (already confirmed: 6/6 clean)