diff --git a/experiments/napkin_math/docs/20260520_plan.md b/experiments/napkin_math/docs/20260520_plan.md index e2e44d60..e625513e 100644 --- a/experiments/napkin_math/docs/20260520_plan.md +++ b/experiments/napkin_math/docs/20260520_plan.md @@ -82,6 +82,8 @@ separated below. - `8f94c8cd` — 20-word `source_text` cap reinforced with explicit truncation discipline (drop the consequence clause, end with ellipsis if mid-sentence). - `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits. All edits applied symmetrically to both extract skills. No corpus literals introduced. +- **PR #743** (merged) — Compress emission-layer second pass. `compress_report_section.py` now makes a second LLM call per saturated bucket with the first-pass items as context, asking only for items the first pass missed. `merge_second_pass_items` deduplicates by normalised `source_quote`. Honest framing: this closes the *emission* side of the run-to-run variance problem (when a tripwire is skipped by the first pass, the second pass often catches it) but does not close the *ranking* side — items that emit with `quote_verified=False` can still be outranked at the deterministic top-N filter. +- **PR #744** (open, CI green, awaiting merge) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss. ### PR #737 detail (already on main) @@ -158,13 +160,18 @@ either; each gets its own follow-up: - **Compress-LLM run-to-run variance.** Same prompts, same source, two compress passes can produce materially different bucket - selections. The paperclip premortem tripwires (`$75k OPC UA bid`, - `100ms` p99 latency, both in v49) drop at the compress stage in - v50 and v51 and cannot be recovered at the extract layer because - the digest itself does not surface them. This is the **clearest - unresolved regression** across the probe set. The fix belongs in - orchestration: deterministic retry/merge across N compress - passes, or lower-temperature reruns for high-impact buckets. + selections. PR #743 (second pass) closed the emission side: when + the first pass skips a tripwire, the second pass given the + first-pass items as context often catches it. PR #744 (paraphrase- + tolerant quote match) closed the verification side: paraphrased + quotes whose tokens are all in the source no longer flip to + `qv=False` and lose the +10 verified-quote bonus at the ranking + layer. Residual modes: (a) bucket-categorisation variance — the + LLM occasionally files a `$X exceeds threshold` tripwire under + `risks_and_shocks` rather than `gates_and_thresholds` (paperclip + v53c); (b) the second pass itself sometimes also misses (paperclip + v52c). Both are at the LLM's emission/categorisation layer and + cannot be fixed by deterministic post-processing alone. - **Threshold-pairing rule × `missing_values_to_estimate` 5-cap.** When a plan names many independent thresholds, every-threshold pairing collides with the cap and forces a tradeoff. The @@ -222,7 +229,7 @@ too: | Phase | Skill / module | Status | |---|---|---| -| 1 | `compress_report_section.py` | **DONE on main via PR #737** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner) | +| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance). **PR #744 (open)** adds paraphrase-tolerant quote verification on the ranking layer. | | 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. | | 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. | | 4 | `generate-bounds` | not started | @@ -235,26 +242,25 @@ too: ### Next likely move -After PR #740, the next work should be ordered by what improves -napkin_math output quality most directly, not by what is easiest to -measure. Preferred order: - -1. **Compress-LLM variance handling.** Deterministic retry/merge or - lower-temperature reruns for high-impact compress buckets should - come next. The clearest driver is the paperclip OPC UA / latency - tripwires that v49 surfaced and v50/v51 drop at the compress - layer. This is upstream of extraction: if the digest does not - carry the tripwire, no extract prompt can recover it. Proposal - 141 would classify this loss, but variance handling is the piece - that can restore the missing source signal. +After PR #743 (emission-layer second pass) and PR #744 (paraphrase- +tolerant quote verification), the remaining work is ordered by what +improves napkin_math output quality most directly: + +1. **Bucket-categorisation discipline in compress.** The residual + public-output miss in paperclip v53c is the LLM filing a + `$X exceeds threshold` tripwire under `risks_and_shocks` instead + of `gates_and_thresholds`. The bucket-prompt for + `gates_and_thresholds` could be tightened to claim any + "If , then ..." sentence, + even when the source frames it as a downside risk. Verify across + the 6-plan probe set; do not overfit to the paperclip OPC UA case. 2. **Implement proposal 141** (`dropped_signals` schema in extract prompts + `audit_source_preservation.py` deterministic script). - This should follow close behind variance handling. It is the - right guardrail for v49/v51 absences and cap-pressure tradeoffs, - but it is primarily a measurement and accountability layer: it - classifies preserved / replaced / dropped signals and records - rationale in the artifact. It does not by itself make the - compressor less lossy. + This is the right guardrail for v49/v51 absences and + cap-pressure tradeoffs. Now that the upstream variance fixes + landed (#743, #744), the audit's classification of preserved / + replaced / dropped signals will be measuring against a less + leaky pipeline. 3. **Different-LLM behavioural validation** of the rules now on main. A Self-Improve run with the default napkin_math LLM (Gemini Flash Lite) against the same digests would close the @@ -265,12 +271,10 @@ measure. Preferred order: extract prompt. This is worthwhile and small, but not load-bearing for the currently observed napkin_math failures. -These are separate PRs. The next PR should be compress variance only: -no corpus literals, no hand-patched outputs, rerun compress + extract -through the skills, validate regenerated `parameters.json`, and -compare against v49/v50/v51 honestly. Bundling the audit, -behavioural validation, or prompt hygiene into that PR would obscure -whether the upstream signal-loss fix actually worked. +These are separate PRs. Each ships independently; bundling +categorisation, audit-implementation, behavioural validation, or +prompt hygiene into one PR would obscure which piece moved which +metric. ## Per-theme mapping