From b1c671611d27fcf898b9633fe9a939007efa0ea1 Mon Sep 17 00:00:00 2001 From: Simon Strandgaard Date: Thu, 21 May 2026 16:11:11 +0200 Subject: [PATCH] docs(napkin-math): record PR #746 (Phase 3) and PR #747 (Phase 4 runtime) in 20260520 plan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Marks PR #744 as merged (previously open), and adds PR #746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR #747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (#737, #743, #744). Phase 3 row marks DONE via PR #746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR #747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/napkin_math/docs/20260520_plan.md | 35 ++++++++++++------- 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/experiments/napkin_math/docs/20260520_plan.md b/experiments/napkin_math/docs/20260520_plan.md index e625513e..7cedfc03 100644 --- a/experiments/napkin_math/docs/20260520_plan.md +++ b/experiments/napkin_math/docs/20260520_plan.md @@ -83,7 +83,9 @@ separated below. - `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits. All edits applied symmetrically to both extract skills. No corpus literals introduced. - **PR #743** (merged) — Compress emission-layer second pass. `compress_report_section.py` now makes a second LLM call per saturated bucket with the first-pass items as context, asking only for items the first pass missed. `merge_second_pass_items` deduplicates by normalised `source_quote`. Honest framing: this closes the *emission* side of the run-to-run variance problem (when a tripwire is skipped by the first pass, the second pass often catches it) but does not close the *ranking* side — items that emit with `quote_verified=False` can still be outranked at the deterministic top-N filter. -- **PR #744** (open, CI green, awaiting merge) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss. +- **PR #744** (merged) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss. +- **PR #746** (merged) — Phase 3 validate-parameters. Two new deterministic structural checks added to `validate_parameters.py`. `aggregate_not_bounded` (R1.1): when an entry's `formula_hint` is a pure sum of named identifiers and its `output_name` also appears in `missing_values_to_estimate`, the aggregate is sampled independently of its constituents (a single Monte Carlo trial can pair sub-component p95s with a total p05). Detection is syntactic — RHS contains `+`, no other operators, every operand is a snake_case identifier. `requirement_has_margin` (R2.5): when a `key_value`'s id ends in `_required`, at least one calculation must reference it via formula RHS, contain a subtraction or ratio operator, AND emit an `output_name` with a positive-pass margin suffix (`_margin`/`_surplus`/`_buffer`/`_coverage`). All three properties required so a bare reference inside a sum (`combined = actual + required`) no longer satisfies the rule. 19 unit tests, 6 v51 plans still validate clean. Out of scope: the `sampling_discipline` enum expansion (lives in `run_monte_carlo.py`, deferred to Phase 4 to avoid a silent-shim antipattern). +- **PR #747** (merged) — Phase 4 runtime + schema readiness. Code-side half of Phase 4: (1) `strip_threshold_bounds` extended with a `calculation-output` reason — variables that are the declared `output_name` of any calculation get stripped from bounds (R1.1 backstop); (2) `lognormal` and `pert` reserved in `VALID_DISCIPLINES`, `sample_one` raises `NotImplementedError` loudly when sampling is attempted (no silent fall-back to triangular); (3) the optional `correlations` top-level key reserved and preserved through the stripper; (4) the warning text after `strip_threshold_bounds` is now reason-branched (calculation-output strips say "simulation will compute it from calculations.py", suffix/formula-side strips keep the original threshold wording). 73 unit tests, 9/9 smoke checks, 0 false positives on the v48 corpus. The LLM-rule changes (base-anchoring conditional, citation-context-leak self-audit examples, plan_type-driven lognormal default, detailed correlations selection rules) ship in the Phase 4 follow-up. ### PR #737 detail (already on main) @@ -229,10 +231,10 @@ too: | Phase | Skill / module | Status | |---|---|---| -| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance). **PR #744 (open)** adds paraphrase-tolerant quote verification on the ranking layer. | +| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743 + PR #744** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance; paraphrase-tolerant quote verification on the ranking layer). | | 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. | -| 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. | -| 4 | `generate-bounds` | not started | +| 3 | `validate-parameters` | **DONE on main via PR #746** (R1.1 `aggregate_not_bounded` and R2.5 `requirement_has_margin` structural checks). The `sampling_discipline` enum expansion bullet that the original plan tucked here actually lives in `run_monte_carlo.py` and was routed to Phase 4 (PR #747) to avoid a silent shim. | +| 4 | `generate-bounds` | **Code-side runtime DONE on main via PR #747** (R1.1 calculation-output strip extension; R1.4 / R2.2 `lognormal` and `pert` reserved in `VALID_DISCIPLINES` with sample-time `NotImplementedError`; R1.3 `correlations` top-level key reserved and preserved). **Prompt-side LLM-rule changes (R1.2 base-anchoring conditional, R1.5 self-audit citation examples, R2.2 plan_type lognormal default, R1.3 detailed correlations rules) remain a follow-up.** | | 5 | `verify-bounds-citations` (new) | not started | | 6 | `generate-calculations` | no change required per the original plan | | 7 | `run-scenarios` | not started | @@ -242,11 +244,20 @@ too: ### Next likely move -After PR #743 (emission-layer second pass) and PR #744 (paraphrase- -tolerant quote verification), the remaining work is ordered by what -improves napkin_math output quality most directly: - -1. **Bucket-categorisation discipline in compress.** The residual +After PR #746 (validate-parameters structural checks) and PR #747 +(generate-bounds runtime + schema readiness), the remaining work is +ordered by what improves napkin_math output quality most directly: + +1. **Phase 4 prompt-side follow-up.** With the bounds runtime now + accepting `lognormal`/`pert` and stripping calculation outputs, + the next move is the LLM-rule layer: base-anchoring conditional + rewrite (R1.2 — source: assumption on commitment-default, + source: data on a named anchor), self-audit examples for + citation context-leak (R1.5), `plan_type`-driven lognormal + default for megaproject CAPEX (R2.2), and the detailed + correlations-emission rules (R1.3). Same-LLM same-session + posture: regression check, not improvement claim. +2. **Bucket-categorisation discipline in compress.** The residual public-output miss in paperclip v53c is the LLM filing a `$X exceeds threshold` tripwire under `risks_and_shocks` instead of `gates_and_thresholds`. The bucket-prompt for @@ -254,19 +265,19 @@ improves napkin_math output quality most directly: "If , then ..." sentence, even when the source frames it as a downside risk. Verify across the 6-plan probe set; do not overfit to the paperclip OPC UA case. -2. **Implement proposal 141** (`dropped_signals` schema in extract +3. **Implement proposal 141** (`dropped_signals` schema in extract prompts + `audit_source_preservation.py` deterministic script). This is the right guardrail for v49/v51 absences and cap-pressure tradeoffs. Now that the upstream variance fixes landed (#743, #744), the audit's classification of preserved / replaced / dropped signals will be measuring against a less leaky pipeline. -3. **Different-LLM behavioural validation** of the rules now on +4. **Different-LLM behavioural validation** of the rules now on main. A Self-Improve run with the default napkin_math LLM (Gemini Flash Lite) against the same digests would close the same-LLM same-session confound. This should be treated as validation of prompt generality, not as the next quality fix. -4. **Prompt-hygiene pass** for the remaining domain-specific +5. **Prompt-hygiene pass** for the remaining domain-specific examples (e.g. `european_prepper_active_buyers`) in either extract prompt. This is worthwhile and small, but not load-bearing for the currently observed napkin_math failures.