Skip to content

napkin-math: Phase 2 extract — source-arithmetic preservation, threshold-pairing parity, source_text truncation#740

Merged
neoneye merged 4 commits into
mainfrom
napkin-math/phase2-extract-decomposition-rules
May 21, 2026
Merged

napkin-math: Phase 2 extract — source-arithmetic preservation, threshold-pairing parity, source_text truncation#740
neoneye merged 4 commits into
mainfrom
napkin-math/phase2-extract-decomposition-rules

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 20, 2026

Scope

Phase 2 extract-prompt rules. Three commits to the two extract skills (extract-parameters-from-digest and extract-parameters-from-full), plus a same-session regression check via the napkin_math validate_parameters.py against six baseline plan probes.

The PR is narrow on purpose. It does NOT include:

  • The source-preservation audit (proposal Multiple api keys cleanup5 #141; design is on main, implementation is a separate follow-up PR).
  • Compress-LLM run-to-run variance handling (also a separate follow-up).
  • Prompt-hygiene cleanup of the older domain-specific example (european_prepper_active_buyers) in the extract-from-digest prompt (scoped follow-up).
  • Any claim that v51 outputs are globally better than v49 outputs — see "Verification posture" below.

What changes

extract-parameters-from-digest/system-prompt.txt and extract-parameters-from-full/system-prompt.txt

All edits applied symmetrically to both files.

  1. Source-arithmetic preservation rule (new section, slotted between "Combined viability gate preservation" and "Critical-output completeness rule"). Three named patterns under one principle:
    • Aggregate sum (R1.1) — aggregate = sum_of_components ONLY when the source states or clearly implies the total is computed from the named constituents. Independent caps / ceilings / committed budgets / funding envelopes stay as primitive thresholds in key_values and are tested via the threshold-pairing rule's spend-vs-cap margin.
    • Burn rate × duration (R2.3) — total = rate * duration when the source supplies the rate AND the duration.
    • Explicit decomposition block (R2.4) — preserve every named operand and the operation between them.
  2. Threshold-pairing rule backfilled into extract-parameters-from-full. PR napkin-math: prompt cleanup for compress + extract #737 only added it to extract-parameters-from-digest; this closes the parity gap.
  3. Tightened aggregate-sum wording to explicitly exclude budget caps / ceilings / envelopes from the sum-identity pattern. Reconciled the "discipline shared" paragraph with the cap-pressure paragraph (placement decision between recommended_first_calculations and derived_questions is not a relaxation of the preservation rule).
  4. Source_text 20-word cap reinforced with truncation discipline. Existing rule said the cap exists; new rule says how to truncate when a source if/then conditional runs long: drop the consequence clause, keep subject + threshold, end with ellipsis if mid-sentence, count words before emitting.

All four edits are corpus-agnostic. No plan names, no literals from any baseline plan, no expected output ids, no domain-specific jargon.

Verification posture

Compress + extract were re-run on six baseline plans (paperclip, datacenter, crate_recovery, yellowstone, mars_gtld, euro_adoption). v51 outputs were produced fresh via the skill under the updated prompts. All six v51/parameters.json files validate against validate_parameters.py with valid=True, errors=0, warnings=0.

This is a regression check, not an improvement claim.

  • Same-LLM same-session regeneration cannot prove behavioural improvement on rules the same LLM authored and is applying. The same model that wrote the rules is also obeying them.
  • Honest behavioural verification of these rules requires a different LLM or fresh-context same-model extraction against the same digests. That work is a separate follow-up.
  • The regression check proves only that the rules do not break valid extraction.

Known unresolved issues, NOT addressed in this PR

These are explicitly out of scope. Each gets its own follow-up PR.

  • Paperclip OPC UA / p99 latency tripwires absent from v50/v51 (present in v49). Compress-stage variance, not an extract-prompt problem. Addressed by the compress retry/merge or lower-temperature follow-up.
  • v49 absences across the other probes (e.g. yellowstone public_compliance_*, mars registration_volume_buffer_fraction, crate q_logistics_budget_margin) remain unclassified tradeoffs. Without proposal 141's dropped_signals schema and audit script, I cannot mechanically distinguish "rename" / "structural restructure" / "acceptable drop under cap pressure" / "silent regression". Implementing Multiple api keys cleanup5 #141 is the next-most-load-bearing follow-up.
  • Behavioural validation on a different LLM has not been done.
  • european_prepper_active_buyers still appears in the extract-from-digest prompt as a formula example. Prompt-hygiene follow-up.

Documentation

experiments/napkin_math/docs/20260520_plan.md updated to reflect the narrow scope (commit f9d90ebb):

Commit chain

Test plan

  • CI green (lint / tests / typecheck)
  • validate_parameters.py passes against all six regenerated v51 outputs (already confirmed: 6/6 clean)
  • Hand-spot all four edits read corpus-agnostically (no plan-specific literals introduced)
  • PR description framing matches the merge-readiness criteria — narrow scope, regression check not improvement claim, unresolved issues explicitly deferred

neoneye added 4 commits May 21, 2026 01:58
…hold-pairing parity

Continues Phase 2 of the 2026-05-20 methodology plan. Adds one new rule and fixes one parity gap.

New rule — source-arithmetic preservation:

When the source explicitly relates a dependent quantity to its named components through an arithmetic operation (sum, product, ratio, fraction of a base, scaled magnitude), the extractor must preserve that relationship as a recommended_first_calculation. Three named patterns: aggregate sum (R1.1), burn rate × duration (R2.3), explicit decomposition block (R2.4). The dependent quantity is the calculation; the operands are primitives in key_values or missing_values_to_estimate. Inverting this ordering loses the source's named structure and the simulation cannot recover it from independent bounds.

Parity fix — threshold-pairing rule was added to extract-parameters-from-digest in PR #737 but not to extract-parameters-from-full. Backfilled identically so both extract skills follow the same discipline.

Both rules are corpus-agnostic: structural categories only (rate / duration / aggregate / component / threshold / floor / cap / margin), no plan names, no literals, no expected output ids, no domain-specific framings. Slotted symmetrically between 'Combined viability gate preservation' and 'Critical-output completeness rule' in both files.

Closes the R1.1, R2.3, R2.4 directives from the Phase 2 row of the methodology plan's status table; R2.5 (threshold-pairing) was already done in PR #737.

Regression probes to follow as part of this PR's verification.
…ource-arithmetic preservation rule

Two wording fixes per review feedback on PR #740:

1. Aggregate-sum pattern was overly broad: it would have told the extractor to treat any 'total alongside named allocations' as a derived sum. That would erase budget ceilings, funding envelopes, and committed targets — which are primitive thresholds tested by the threshold-pairing rule's spend-vs-cap margin, NOT computed by sum identity. Tightened to apply only when the source states or clearly implies the total is computed from the named constituents.

2. The 'discipline shared by all three patterns' paragraph said derived quantities go in recommended_first_calculations, conflicting with the cap-pressure paragraph below that allows moving less-critical calcs to derived_questions. Rephrased so the choice between the two containers is a placement decision (critical vs. secondary) and does not relax the rule that source-stated arithmetic must be preserved as a calculation, not a flat variable.

Both edits applied symmetrically to extract-parameters-from-digest and extract-parameters-from-full. No corpus literals introduced. 6/6 line additions match across the two files.
PR #740's v51 regression runs surfaced two validator failures (yellowstone[4].source_text = 28 words; mars_gtld[5].source_text = 26 words) against the existing 'Each source_text must be at most 20 words' bullet. The cap was declared but had no truncation strategy, so when a source if/then conditional ran long the extractor pasted the whole sentence.

Adds a follow-up bullet to the hard-limits list in both extract skills: when the source phrase exceeds 20 words, truncate to the subject + threshold portion (drop the consequence clause), end with ellipsis if mid-sentence, and count words before emitting. Applied symmetrically to extract-parameters-from-digest and extract-parameters-from-full.

After this prompt change, re-running the skill on the affected plans produced parameters.json files that validate with exit code 0 across all six v51 baselines. No outputs were hand-edited.

Corpus-agnostic by construction: the rule names only structural categories (subject, threshold, consequence clause, ellipsis, word count) and applies to any if/then-shaped source phrasing in any domain or language.
Replaces the prior 'Status as of 2026-05-21' content with three explicit subsections:

1. Landed on main: PR #737 (Phase 1 compress + initial extract threshold-pairing + OPTIMIZE_INSTRUCTIONS) and PR #739 (Proposal 141 design only).

2. Open for merge: PR #740 commit chain (4cda70b source-arithmetic + parity, 19f927b aggregate-sum tightening, 8f94c8c source_text truncation discipline). All edits applied symmetrically to both extract skills.

3. PR #740 verification posture: same-LLM same-session regression check, not improvement proof. All six v51 parameters.json files validate clean. Behavioural verification of the rules on a different LLM is a separate piece of follow-up work, not part of PR #740.

Known limitations section now explicitly names the clearest unresolved regression (paperclip OPC UA / p99 latency at compress stage), the cap-pressure-without-recorded-rationale gap (yellowstone public_compliance trio), and the absence of a source-preservation audit implementation (proposal 141 design merged, code not). Lists four follow-up PRs in preferred order; bundling them re-creates the scope creep PR #740 was extracted from.
@neoneye neoneye changed the title napkin-math: phase 2 extract — source-arithmetic preservation + threshold-pairing parity napkin-math: Phase 2 extract — source-arithmetic preservation, threshold-pairing parity, source_text truncation May 21, 2026
@neoneye neoneye merged commit d73ff9b into main May 21, 2026
3 checks passed
@neoneye neoneye deleted the napkin-math/phase2-extract-decomposition-rules branch May 21, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant