From 8d14232e570c5c218f8c4e2c3b1ba793f75ba2f9 Mon Sep 17 00:00:00 2001 From: Simon Strandgaard Date: Thu, 21 May 2026 03:13:36 +0200 Subject: [PATCH] docs(napkin-math): prioritize compress variance next --- experiments/napkin_math/docs/20260520_plan.md | 62 +++++++++++-------- 1 file changed, 36 insertions(+), 26 deletions(-) diff --git a/experiments/napkin_math/docs/20260520_plan.md b/experiments/napkin_math/docs/20260520_plan.md index e7354da0..e2e44d60 100644 --- a/experiments/napkin_math/docs/20260520_plan.md +++ b/experiments/napkin_math/docs/20260520_plan.md @@ -76,13 +76,11 @@ separated below. - **PR #737** (merged) — Phase 1 compress prompts + initial extract threshold-pairing rule + `OPTIMIZE_INSTRUCTIONS` discipline banner. Substantive content described below under "PR #737 detail" for continuity. - **PR #739** (merged) — Proposal 141 ("Source-Preservation Audit for the Napkin Math Pipeline") landed as design only. No code or prompt change. Implementation deferred. - -### Open for merge - -- **PR #740** — Phase 2 extract-prompt rules. Three commits land in `extract-parameters-from-digest` and `extract-parameters-from-full`: +- **PR #740** (merged) — Phase 2 extract-prompt rules. Four commits landed in `extract-parameters-from-digest` and `extract-parameters-from-full`: - `4cda70ba` — Source-arithmetic preservation rule (Patterns 1/2/3: aggregate sum, burn rate × duration, explicit decomposition block) + threshold-pairing parity backfill into the full-extract skill. - `19f927b7` — Tightened aggregate-sum wording so independent caps/envelopes are NOT collapsed into derived sums; reconciled discipline-shared paragraph with the cap-pressure paragraph. - `8f94c8cd` — 20-word `source_text` cap reinforced with explicit truncation discipline (drop the consequence clause, end with ellipsis if mid-sentence). + - `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits. All edits applied symmetrically to both extract skills. No corpus literals introduced. ### PR #737 detail (already on main) @@ -225,7 +223,7 @@ too: | Phase | Skill / module | Status | |---|---|---| | 1 | `compress_report_section.py` | **DONE on main via PR #737** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner) | -| 2 | `extract-parameters-from-{full,digest}` | **PARTIAL** — threshold-pairing on `from-digest` shipped in PR #737. Source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline are in PR #740, open for merge. After #740 merges, Phase 2's prompt-side work is complete for the original directives. | +| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. | | 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. | | 4 | `generate-bounds` | not started | | 5 | `verify-bounds-citations` (new) | not started | @@ -237,30 +235,42 @@ too: ### Next likely move -After PR #740 merges, the next-most-load-bearing follow-ups, in -preferred order: - -1. **Implement proposal 141** (`dropped_signals` schema in extract +After PR #740, the next work should be ordered by what improves +napkin_math output quality most directly, not by what is easiest to +measure. Preferred order: + +1. **Compress-LLM variance handling.** Deterministic retry/merge or + lower-temperature reruns for high-impact compress buckets should + come next. The clearest driver is the paperclip OPC UA / latency + tripwires that v49 surfaced and v50/v51 drop at the compress + layer. This is upstream of extraction: if the digest does not + carry the tripwire, no extract prompt can recover it. Proposal + 141 would classify this loss, but variance handling is the piece + that can restore the missing source signal. +2. **Implement proposal 141** (`dropped_signals` schema in extract prompts + `audit_source_preservation.py` deterministic script). - The design is on main; without the implementation, v49 absences - across the probe set cannot be mechanically classified, and the - yellowstone-style cap-pressure tradeoffs have no place to record - their structural rationale in the artifact itself. -2. **Compress-LLM variance handling.** Deterministic retry/merge or - lower-temperature reruns for high-impact buckets. The clearest - driver: the paperclip OPC UA / latency tripwires that v49 - surfaced and v50/v51 drop at the compress layer. -3. **Different-LLM behavioural validation** of the rules now in - #740. A Self-Improve run with the default napkin_math LLM - (Gemini Flash Lite) against the same v51 digests would close - the same-LLM same-session confound. + This should follow close behind variance handling. It is the + right guardrail for v49/v51 absences and cap-pressure tradeoffs, + but it is primarily a measurement and accountability layer: it + classifies preserved / replaced / dropped signals and records + rationale in the artifact. It does not by itself make the + compressor less lossy. +3. **Different-LLM behavioural validation** of the rules now on + main. A Self-Improve run with the default napkin_math LLM + (Gemini Flash Lite) against the same digests would close the + same-LLM same-session confound. This should be treated as + validation of prompt generality, not as the next quality fix. 4. **Prompt-hygiene pass** for the remaining domain-specific examples (e.g. `european_prepper_active_buyers`) in either - extract prompt. Small, scoped, can ride alongside any of the - above. - -These are four separate PRs, not one. Bundling them re-creates the -scope creep PR #740 was extracted from. + extract prompt. This is worthwhile and small, but not + load-bearing for the currently observed napkin_math failures. + +These are separate PRs. The next PR should be compress variance only: +no corpus literals, no hand-patched outputs, rerun compress + extract +through the skills, validate regenerated `parameters.json`, and +compare against v49/v50/v51 honestly. Bundling the audit, +behavioural validation, or prompt hygiene into that PR would obscure +whether the upstream signal-loss fix actually worked. ## Per-theme mapping