napkin-math(compress): second-pass shifts the variance failure from emission to ranking by neoneye · Pull Request #743 · PlanExeOrg/PlanExe

neoneye · 2026-05-21T12:28:20Z

Scope

Compress-stage change only. Single substantive file: compress_report_section.py. Tests added; no extract or downstream changes.

What this PR does

Adds a second "what did you miss?" pass to each scored-list bucket (numeric_values, load_bearing_assumptions, gates_and_thresholds, risks_and_shocks, missing_data_to_estimate). The first pass uses the existing bucket prompt unchanged. The first-pass assistant turn is appended to the chat. A USER follow-up asks the LLM to surface items the first pass missed, with the same Pydantic schema. The two batches are merged by normalised source_quote (merge_second_pass_items), and the merged pool goes through the existing deterministic top-N filter.

Framing

Second-pass compress improves candidate recall and moves the paperclip tripwire problem from emission failure to ranking/cap selection.

This PR does NOT fully fix public-digest stability for the $75k OPC UA bid tripwire. What it does:

Closes the emission-layer gap — items the source names are now reliably reaching the candidate pool.
Leaves the ranking-layer gap open — when the candidate pool exceeds the public cap (6 items), the deterministic composite scorer drops some high-value tripwires in favor of higher-source_evidence explicit alternatives.

Acceptance evidence

Paperclip 3-run test (the load-bearing case):

Tripwire	candidate pool	public top-6	v50/v51 baseline
100ms p99 latency	3/3	3/3	0/3 in public
$75k OPC UA bid	3/3	1/3	0/3 in public

The 100ms tripwire fully recovers; the $75k bid recovers in candidate generation but not in public output.

Other-probe regression check (datacenter, crate_recovery, mars_gtld, euro_adoption):

Gate counts in compress_review_plan.md stable vs v51 (6/6 in saturated buckets).
Datacenter premortem gates_and_thresholds: 0 → 6 (pure-add from the second pass).
Euro premortem gates_and_thresholds: 0 → 6 (pure-add).
Sampled the new gates: most are plausible, but some are softer risk-trigger items rather than hard pass/fail gates, and a few have quote: unverified. Acceptable for this PR's scope (improving recall) but called out honestly — the second pass surfaces more candidates and not all of them are top-quality. The scoring/cap layer is what should ultimately filter; this PR doesn't touch that.

Yellowstone: the first compress run had compress_premortem.md absent from disk. Confirmed to be a pre-existing flaky failure mode (also observed on prior baselines). Re-ran cleanly. Not caused by this change.

Cost

Roughly 2x compress LLM calls. Original 24 calls → up to 44 per full run (4 sections × (1 section_summary + 5 scored buckets × 2 passes)). Wall time ~90s → ~3 min per plan.

Out of scope — each a separate PR

Scorer/cap policy. Raising MAX_ITEMS_PER_BUCKET from 6, or adding a tripwire/stress_test bonus to composite_score. This is the layer that drops the $75k bid in 2/3 paperclip runs. If high-impact tripwires need to survive public output reliably, the next PR belongs here.
Source-preservation audit (proposal Multiple api keys cleanup5 #141 implementation).
Different-LLM behavioural validation against the second-pass mechanism. Same-session evidence above is a regression check posture.
Pre-existing missing-section flake in compress orchestration (not introduced by this change).

Tests

21/21 pass. Four new unit tests for merge_second_pass_items:

Empty second pass returns first pass unchanged.
All-new second-pass items are appended.
Duplicate detection survives whitespace / casing / punctuation differences in source_quote.
Emit order preserved (first-pass first, then second-pass in emitted order).

Commit

5f44c6f9 — napkin-math(compress): two-batch per-bucket compress to reduce variance on saturated buckets

Test plan

CI green (lint / tests / typecheck)
SECOND_PASS_USER_PROMPT_TEMPLATE uses only abstract structural language (no corpus literals)
Treat any quality concerns about second-pass-added gates as input to the next PR (scorer/cap policy), not as a reason to block this one

…ce on saturated buckets Adds a second "what did you miss?" pass to each scored-list bucket (numeric_values, load_bearing_assumptions, gates_and_thresholds, risks_and_shocks, missing_data_to_estimate). The first pass uses the existing bucket prompt unchanged; the second pass appends a follow-up user message asking the LLM to surface items the first pass missed, with the same Pydantic schema. Why: when a section names more high-signal candidates than the per-bucket cap accommodates, the LLM has to make a hard prioritisation cut in a single pass. Smaller models (Gemini Flash Lite specifically) struggle to count and triage near the cap reliably, and across runs the cut lands differently — different runs drop different tripwires. The Paperclip premortem's $75k OPC UA bid and 100ms p99 latency tripwires were the load-bearing example: v49 had both, v50 and v51 dropped both intermittently. Two passes each producing a manageable batch are easier for a small model than one pass producing a saturated batch. How: in the existing per-bucket loop, after the first-pass LLM call and after appending the first-pass assistant turn to the chat history, for the 5 scored-list buckets only, append a USER follow-up message ('Review the items you produced above. Identify items the section names that you missed. Emit only NEW items. Same bucket rules. Up to 8 new items. Empty list is fine.'), call the LLM again with the same schema, and merge the two batches by normalised source_quote (dropping second-pass duplicates). The deterministic top-N filter (annotate_scored_items) runs on the merged pool, so over-emission in the second pass is safe. Cost: roughly 2x compress LLM calls per section (24 → up to 44 per full compress run; 4 sections × 6 buckets, of which 5 are scored-list and now do 2 passes). Time impact ~2x for the compress phase. Acceptance evidence (single-session, regression check posture): - Paperclip 3-run test: 100ms p99 latency tripwire preserved 3/3 in the public top-6 (vs 0/3 in v50/v51). $75k OPC UA bid surfaced in the candidate pool 3/3, but the deterministic scorer + 6-cap kept it in the public top-6 only 1/3 — the bid loses on composite_score to higher-evidence explicit gates from the first pass. The second-pass mechanism works; the remaining variance is at the scoring/cap layer, not the LLM emission layer. - Other-probe regression check on datacenter, crate_recovery, mars_gtld, euro_adoption: gate counts in compress_review_plan.md and compress_premortem.md are stable vs v51 (6/6 in saturated buckets); two probes (datacenter, euro) gain previously-empty premortem gates buckets, a pure-add from the second pass surfacing content the first pass had not produced. - Yellowstone v52 had a missing compress_premortem output (file absent from disk). Investigated and confirmed to be a pre-existing flaky failure mode not introduced by this change — the file goes missing on prior baselines too. Out of scope for this PR. Unit tests: 4 new tests for merge_second_pass_items cover the empty-second-pass, all-new-second-pass, normalised-source_quote-deduplication, and order-preservation cases. 21/21 tests pass. Out of scope for this PR (each is a separate follow-up): - Raising MAX_ITEMS_PER_BUCKET from 6 or adding a stress_test/tripwire bonus to composite_score, which is the layer that's now dropping the $75k bid. - Source-preservation audit (proposal 141) implementation that would classify v49 absences mechanically. - Different-LLM behavioural validation against this mechanism.

docs(napkin-math): record PR #743 and PR #744 in 20260520 plan

…promoter) in 20260520 plan Per user direction, the plan-status update lands in PR #750 (not a separate doc PR). PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line. Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744. Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…or LLM paraphrase The strict substring check rejected verbatim claims whose source_quote dropped or reordered words even though every content token was present in the source. That made quote_verified a fragile signal — and because composite_score adds a +10 verified-quote bonus on a 1-25 base, a single missed verification can drop a load-bearing tripwire (e.g. paperclip's $75,000 OPC UA bid threshold) below the MAX_ITEMS_PER_BUCKET cutoff in roughly two-thirds of runs. Fallback path: when the substring check misses, tokenize quote and source on Unicode word boundaries, require every digit-bearing quote token to appear in the source (no hallucinated numbers), and require >=90% of all quote tokens to appear (blocks single-word content substitutions like "highest" for "lowest"). Minimum 3 quote tokens to avoid trivial overlap on a large source. Substring fast-path is preserved as the primary signal. Empirically (6 plans, 8 compress runs, 1361 candidate items): qv=True total rises from ~1027 to 1192. 165 items (13.8% of qv=True) now verify only via the fallback; spot-checking 30 random flips shows all are legitimate paraphrases (dropped intermediate words, em-dash variants, ellipsis-elided clauses), no hallucinations. Paperclip acceptance probe (3 runs): $75,000 OPC UA bid now survives the public top-6 with qv=True in 2/3 (vs 1/3 before). The remaining 1/3 is an emission-layer miss (LLM does not produce the bid in either pass) and is out of scope for this PR — it was the same gap PR PlanExeOrg#743 explicitly left open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tus in 20260520 plan Adds PR PlanExeOrg#743 (compress emission-layer second pass, merged) and PR PlanExeOrg#744 (compress ranking-layer paraphrase-tolerant quote verification, open and CI green) to the landed-on-main section with honest framing of what each closes and what each does not. Refreshes the Compress-LLM run-to-run variance known-limitation entry: the emission side is closed by PlanExeOrg#743 and the verification side by PlanExeOrg#744. Residual modes are bucket-categorisation variance and second-pass-also-misses cases, both at the LLM's emission layer. Updates Phase 1 status row and the Next-likely-move list — compress-variance handling is no longer item 1; bucket-categorisation discipline takes its place, then proposal 141, then different-LLM validation, then prompt hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rg#747 (Phase 4 runtime) in 20260520 plan Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye changed the title ~~napkin-math(compress): two-batch per-bucket compress to reduce variance on saturated buckets~~ napkin-math(compress): second-pass shifts the variance failure from emission to ranking May 21, 2026

neoneye merged commit cd62982 into main May 21, 2026
3 checks passed

neoneye deleted the napkin-math/compress-second-pass branch May 21, 2026 12:42

This was referenced May 21, 2026

napkin-math(compress): paraphrase-tolerant quote verification #744

Merged

docs(napkin-math): record PR #743 and PR #744 in 20260520 plan #745

Merged

neoneye added a commit that referenced this pull request May 21, 2026

Merge pull request #745 from PlanExeOrg/docs/napkin-math-plan-update-744

7edbe92

docs(napkin-math): record PR #743 and PR #744 in 20260520 plan

This was referenced May 21, 2026

docs(napkin-math): record PR #746 (Phase 3) and PR #747 (Phase 4 runtime) #748

Merged

napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks #750

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

napkin-math(compress): second-pass shifts the variance failure from emission to ranking#743

napkin-math(compress): second-pass shifts the variance failure from emission to ranking#743
neoneye merged 1 commit into
mainfrom
napkin-math/compress-second-pass

neoneye commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

What this PR does

Framing

Acceptance evidence

Cost

Out of scope — each a separate PR

Tests

Commit

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neoneye commented May 21, 2026 •

edited

Loading