From b1c671611d27fcf898b9633fe9a939007efa0ea1 Mon Sep 17 00:00:00 2001
From: Simon Strandgaard <neoneye@gmail.com>
Date: Thu, 21 May 2026 16:11:11 +0200
Subject: [PATCH] docs(napkin-math): record PR #746 (Phase 3) and PR #747
 (Phase 4 runtime) in 20260520 plan
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Marks PR #744 as merged (previously open), and adds PR #746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR #747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section.

Phase 1 status row now references all three compress PRs (#737, #743, #744). Phase 3 row marks DONE via PR #746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR #747 and lists the deferred prompt-side LLM-rule changes.

Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/napkin_math/docs/20260520_plan.md | 35 ++++++++++++-------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/experiments/napkin_math/docs/20260520_plan.md b/experiments/napkin_math/docs/20260520_plan.md
index e625513e..7cedfc03 100644
--- a/experiments/napkin_math/docs/20260520_plan.md
+++ b/experiments/napkin_math/docs/20260520_plan.md
@@ -83,7 +83,9 @@ separated below.
   - `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
   All edits applied symmetrically to both extract skills. No corpus literals introduced.
 - **PR #743** (merged) — Compress emission-layer second pass. `compress_report_section.py` now makes a second LLM call per saturated bucket with the first-pass items as context, asking only for items the first pass missed. `merge_second_pass_items` deduplicates by normalised `source_quote`. Honest framing: this closes the *emission* side of the run-to-run variance problem (when a tripwire is skipped by the first pass, the second pass often catches it) but does not close the *ranking* side — items that emit with `quote_verified=False` can still be outranked at the deterministic top-N filter.
-- **PR #744** (open, CI green, awaiting merge) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
+- **PR #744** (merged) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
+- **PR #746** (merged) — Phase 3 validate-parameters. Two new deterministic structural checks added to `validate_parameters.py`. `aggregate_not_bounded` (R1.1): when an entry's `formula_hint` is a pure sum of named identifiers and its `output_name` also appears in `missing_values_to_estimate`, the aggregate is sampled independently of its constituents (a single Monte Carlo trial can pair sub-component p95s with a total p05). Detection is syntactic — RHS contains `+`, no other operators, every operand is a snake_case identifier. `requirement_has_margin` (R2.5): when a `key_value`'s id ends in `_required`, at least one calculation must reference it via formula RHS, contain a subtraction or ratio operator, AND emit an `output_name` with a positive-pass margin suffix (`_margin`/`_surplus`/`_buffer`/`_coverage`). All three properties required so a bare reference inside a sum (`combined = actual + required`) no longer satisfies the rule. 19 unit tests, 6 v51 plans still validate clean. Out of scope: the `sampling_discipline` enum expansion (lives in `run_monte_carlo.py`, deferred to Phase 4 to avoid a silent-shim antipattern).
+- **PR #747** (merged) — Phase 4 runtime + schema readiness. Code-side half of Phase 4: (1) `strip_threshold_bounds` extended with a `calculation-output` reason — variables that are the declared `output_name` of any calculation get stripped from bounds (R1.1 backstop); (2) `lognormal` and `pert` reserved in `VALID_DISCIPLINES`, `sample_one` raises `NotImplementedError` loudly when sampling is attempted (no silent fall-back to triangular); (3) the optional `correlations` top-level key reserved and preserved through the stripper; (4) the warning text after `strip_threshold_bounds` is now reason-branched (calculation-output strips say "simulation will compute it from calculations.py", suffix/formula-side strips keep the original threshold wording). 73 unit tests, 9/9 smoke checks, 0 false positives on the v48 corpus. The LLM-rule changes (base-anchoring conditional, citation-context-leak self-audit examples, plan_type-driven lognormal default, detailed correlations selection rules) ship in the Phase 4 follow-up.
 
 ### PR #737 detail (already on main)
 
@@ -229,10 +231,10 @@ too:
 
 | Phase | Skill / module | Status |
 |---|---|---|
-| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance). **PR #744 (open)** adds paraphrase-tolerant quote verification on the ranking layer. |
+| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743 + PR #744** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance; paraphrase-tolerant quote verification on the ranking layer). |
 | 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
-| 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. |
-| 4 | `generate-bounds` | not started |
+| 3 | `validate-parameters` | **DONE on main via PR #746** (R1.1 `aggregate_not_bounded` and R2.5 `requirement_has_margin` structural checks). The `sampling_discipline` enum expansion bullet that the original plan tucked here actually lives in `run_monte_carlo.py` and was routed to Phase 4 (PR #747) to avoid a silent shim. |
+| 4 | `generate-bounds` | **Code-side runtime DONE on main via PR #747** (R1.1 calculation-output strip extension; R1.4 / R2.2 `lognormal` and `pert` reserved in `VALID_DISCIPLINES` with sample-time `NotImplementedError`; R1.3 `correlations` top-level key reserved and preserved). **Prompt-side LLM-rule changes (R1.2 base-anchoring conditional, R1.5 self-audit citation examples, R2.2 plan_type lognormal default, R1.3 detailed correlations rules) remain a follow-up.** |
 | 5 | `verify-bounds-citations` (new) | not started |
 | 6 | `generate-calculations` | no change required per the original plan |
 | 7 | `run-scenarios` | not started |
@@ -242,11 +244,20 @@ too:
 
 ### Next likely move
 
-After PR #743 (emission-layer second pass) and PR #744 (paraphrase-
-tolerant quote verification), the remaining work is ordered by what
-improves napkin_math output quality most directly:
-
-1. **Bucket-categorisation discipline in compress.** The residual
+After PR #746 (validate-parameters structural checks) and PR #747
+(generate-bounds runtime + schema readiness), the remaining work is
+ordered by what improves napkin_math output quality most directly:
+
+1. **Phase 4 prompt-side follow-up.** With the bounds runtime now
+   accepting `lognormal`/`pert` and stripping calculation outputs,
+   the next move is the LLM-rule layer: base-anchoring conditional
+   rewrite (R1.2 — source: assumption on commitment-default,
+   source: data on a named anchor), self-audit examples for
+   citation context-leak (R1.5), `plan_type`-driven lognormal
+   default for megaproject CAPEX (R2.2), and the detailed
+   correlations-emission rules (R1.3). Same-LLM same-session
+   posture: regression check, not improvement claim.
+2. **Bucket-categorisation discipline in compress.** The residual
    public-output miss in paperclip v53c is the LLM filing a
    `$X exceeds threshold` tripwire under `risks_and_shocks` instead
    of `gates_and_thresholds`. The bucket-prompt for
@@ -254,19 +265,19 @@ improves napkin_math output quality most directly:
    "If <metric> <comparator> <numeric threshold>, then ..." sentence,
    even when the source frames it as a downside risk. Verify across
    the 6-plan probe set; do not overfit to the paperclip OPC UA case.
-2. **Implement proposal 141** (`dropped_signals` schema in extract
+3. **Implement proposal 141** (`dropped_signals` schema in extract
    prompts + `audit_source_preservation.py` deterministic script).
    This is the right guardrail for v49/v51 absences and
    cap-pressure tradeoffs. Now that the upstream variance fixes
    landed (#743, #744), the audit's classification of preserved /
    replaced / dropped signals will be measuring against a less
    leaky pipeline.
-3. **Different-LLM behavioural validation** of the rules now on
+4. **Different-LLM behavioural validation** of the rules now on
    main. A Self-Improve run with the default napkin_math LLM
    (Gemini Flash Lite) against the same digests would close the
    same-LLM same-session confound. This should be treated as
    validation of prompt generality, not as the next quality fix.
-4. **Prompt-hygiene pass** for the remaining domain-specific
+5. **Prompt-hygiene pass** for the remaining domain-specific
    examples (e.g. `european_prepper_active_buyers`) in either
    extract prompt. This is worthwhile and small, but not
    load-bearing for the currently observed napkin_math failures.