PlanExeOrg · neoneye · May 21, 2026 · May 21, 2026
diff --git a/experiments/napkin_math/docs/20260520_plan.md b/experiments/napkin_math/docs/20260520_plan.md
@@ -82,6 +82,8 @@ separated below.
   - `8f94c8cd` — 20-word `source_text` cap reinforced with explicit truncation discipline (drop the consequence clause, end with ellipsis if mid-sentence).
   - `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
   All edits applied symmetrically to both extract skills. No corpus literals introduced.
+- **PR #743** (merged) — Compress emission-layer second pass. `compress_report_section.py` now makes a second LLM call per saturated bucket with the first-pass items as context, asking only for items the first pass missed. `merge_second_pass_items` deduplicates by normalised `source_quote`. Honest framing: this closes the *emission* side of the run-to-run variance problem (when a tripwire is skipped by the first pass, the second pass often catches it) but does not close the *ranking* side — items that emit with `quote_verified=False` can still be outranked at the deterministic top-N filter.
+- **PR #744** (open, CI green, awaiting merge) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
 
 ### PR #737 detail (already on main)
 
@@ -158,13 +160,18 @@ either; each gets its own follow-up:
 
 - **Compress-LLM run-to-run variance.** Same prompts, same source,
   two compress passes can produce materially different bucket
-  selections. The paperclip premortem tripwires (`$75k OPC UA bid`,
-  `100ms` p99 latency, both in v49) drop at the compress stage in
-  v50 and v51 and cannot be recovered at the extract layer because
-  the digest itself does not surface them. This is the **clearest
-  unresolved regression** across the probe set. The fix belongs in
-  orchestration: deterministic retry/merge across N compress
-  passes, or lower-temperature reruns for high-impact buckets.
+  selections. PR #743 (second pass) closed the emission side: when
+  the first pass skips a tripwire, the second pass given the
+  first-pass items as context often catches it. PR #744 (paraphrase-
+  tolerant quote match) closed the verification side: paraphrased
+  quotes whose tokens are all in the source no longer flip to
+  `qv=False` and lose the +10 verified-quote bonus at the ranking
+  layer. Residual modes: (a) bucket-categorisation variance — the
+  LLM occasionally files a `$X exceeds threshold` tripwire under
+  `risks_and_shocks` rather than `gates_and_thresholds` (paperclip
+  v53c); (b) the second pass itself sometimes also misses (paperclip
+  v52c). Both are at the LLM's emission/categorisation layer and
+  cannot be fixed by deterministic post-processing alone.
 - **Threshold-pairing rule × `missing_values_to_estimate` 5-cap.**
   When a plan names many independent thresholds, every-threshold
   pairing collides with the cap and forces a tradeoff. The
@@ -222,7 +229,7 @@ too:
 
 | Phase | Skill / module | Status |
 |---|---|---|
-| 1 | `compress_report_section.py` | **DONE on main via PR #737** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner) |
+| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance). **PR #744 (open)** adds paraphrase-tolerant quote verification on the ranking layer. |
 | 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
 | 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. |
 | 4 | `generate-bounds` | not started |
@@ -235,26 +242,25 @@ too:
 
 ### Next likely move
 
-After PR #740, the next work should be ordered by what improves
-napkin_math output quality most directly, not by what is easiest to
-measure. Preferred order:
-
-1. **Compress-LLM variance handling.** Deterministic retry/merge or
-   lower-temperature reruns for high-impact compress buckets should
-   come next. The clearest driver is the paperclip OPC UA / latency
-   tripwires that v49 surfaced and v50/v51 drop at the compress
-   layer. This is upstream of extraction: if the digest does not
-   carry the tripwire, no extract prompt can recover it. Proposal
-   141 would classify this loss, but variance handling is the piece
-   that can restore the missing source signal.
+After PR #743 (emission-layer second pass) and PR #744 (paraphrase-
+tolerant quote verification), the remaining work is ordered by what
+improves napkin_math output quality most directly:
+
+1. **Bucket-categorisation discipline in compress.** The residual
+   public-output miss in paperclip v53c is the LLM filing a
+   `$X exceeds threshold` tripwire under `risks_and_shocks` instead
+   of `gates_and_thresholds`. The bucket-prompt for
+   `gates_and_thresholds` could be tightened to claim any
+   "If <metric> <comparator> <numeric threshold>, then ..." sentence,
+   even when the source frames it as a downside risk. Verify across
+   the 6-plan probe set; do not overfit to the paperclip OPC UA case.
 2. **Implement proposal 141** (`dropped_signals` schema in extract
    prompts + `audit_source_preservation.py` deterministic script).
-   This should follow close behind variance handling. It is the
-   right guardrail for v49/v51 absences and cap-pressure tradeoffs,
-   but it is primarily a measurement and accountability layer: it
-   classifies preserved / replaced / dropped signals and records
-   rationale in the artifact. It does not by itself make the
-   compressor less lossy.
+   This is the right guardrail for v49/v51 absences and
+   cap-pressure tradeoffs. Now that the upstream variance fixes
+   landed (#743, #744), the audit's classification of preserved /
+   replaced / dropped signals will be measuring against a less
+   leaky pipeline.
 3. **Different-LLM behavioural validation** of the rules now on
    main. A Self-Improve run with the default napkin_math LLM
    (Gemini Flash Lite) against the same digests would close the
@@ -265,12 +271,10 @@ measure. Preferred order:
    extract prompt. This is worthwhile and small, but not
    load-bearing for the currently observed napkin_math failures.
 
-These are separate PRs. The next PR should be compress variance only:
-no corpus literals, no hand-patched outputs, rerun compress + extract
-through the skills, validate regenerated `parameters.json`, and
-compare against v49/v50/v51 honestly. Bundling the audit,
-behavioural validation, or prompt hygiene into that PR would obscure
-whether the upstream signal-loss fix actually worked.
+These are separate PRs. Each ships independently; bundling
+categorisation, audit-implementation, behavioural validation, or
+prompt hygiene into one PR would obscure which piece moved which
+metric.
 
 ## Per-theme mapping