napkin-math(compress): second-pass shifts the variance failure from emission to ranking#743
Merged
Merged
Conversation
…ce on saturated buckets
Adds a second "what did you miss?" pass to each scored-list bucket (numeric_values, load_bearing_assumptions, gates_and_thresholds, risks_and_shocks, missing_data_to_estimate). The first pass uses the existing bucket prompt unchanged; the second pass appends a follow-up user message asking the LLM to surface items the first pass missed, with the same Pydantic schema.
Why: when a section names more high-signal candidates than the per-bucket cap accommodates, the LLM has to make a hard prioritisation cut in a single pass. Smaller models (Gemini Flash Lite specifically) struggle to count and triage near the cap reliably, and across runs the cut lands differently — different runs drop different tripwires. The Paperclip premortem's $75k OPC UA bid and 100ms p99 latency tripwires were the load-bearing example: v49 had both, v50 and v51 dropped both intermittently. Two passes each producing a manageable batch are easier for a small model than one pass producing a saturated batch.
How: in the existing per-bucket loop, after the first-pass LLM call and after appending the first-pass assistant turn to the chat history, for the 5 scored-list buckets only, append a USER follow-up message ('Review the items you produced above. Identify items the section names that you missed. Emit only NEW items. Same bucket rules. Up to 8 new items. Empty list is fine.'), call the LLM again with the same schema, and merge the two batches by normalised source_quote (dropping second-pass duplicates). The deterministic top-N filter (annotate_scored_items) runs on the merged pool, so over-emission in the second pass is safe.
Cost: roughly 2x compress LLM calls per section (24 → up to 44 per full compress run; 4 sections × 6 buckets, of which 5 are scored-list and now do 2 passes). Time impact ~2x for the compress phase.
Acceptance evidence (single-session, regression check posture):
- Paperclip 3-run test: 100ms p99 latency tripwire preserved 3/3 in the public top-6 (vs 0/3 in v50/v51). $75k OPC UA bid surfaced in the candidate pool 3/3, but the deterministic scorer + 6-cap kept it in the public top-6 only 1/3 — the bid loses on composite_score to higher-evidence explicit gates from the first pass. The second-pass mechanism works; the remaining variance is at the scoring/cap layer, not the LLM emission layer.
- Other-probe regression check on datacenter, crate_recovery, mars_gtld, euro_adoption: gate counts in compress_review_plan.md and compress_premortem.md are stable vs v51 (6/6 in saturated buckets); two probes (datacenter, euro) gain previously-empty premortem gates buckets, a pure-add from the second pass surfacing content the first pass had not produced.
- Yellowstone v52 had a missing compress_premortem output (file absent from disk). Investigated and confirmed to be a pre-existing flaky failure mode not introduced by this change — the file goes missing on prior baselines too. Out of scope for this PR.
Unit tests: 4 new tests for merge_second_pass_items cover the empty-second-pass, all-new-second-pass, normalised-source_quote-deduplication, and order-preservation cases. 21/21 tests pass.
Out of scope for this PR (each is a separate follow-up):
- Raising MAX_ITEMS_PER_BUCKET from 6 or adding a stress_test/tripwire bonus to composite_score, which is the layer that's now dropping the $75k bid.
- Source-preservation audit (proposal 141) implementation that would classify v49 absences mechanically.
- Different-LLM behavioural validation against this mechanism.
This was referenced May 21, 2026
This was referenced May 21, 2026
neoneye
added a commit
that referenced
this pull request
May 21, 2026
…promoter) in 20260520 plan Per user direction, the plan-status update lands in PR #750 (not a separate doc PR). PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line. Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744. Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting
pushed a commit
to repomesh/PlanExe
that referenced
this pull request
May 22, 2026
…or LLM paraphrase The strict substring check rejected verbatim claims whose source_quote dropped or reordered words even though every content token was present in the source. That made quote_verified a fragile signal — and because composite_score adds a +10 verified-quote bonus on a 1-25 base, a single missed verification can drop a load-bearing tripwire (e.g. paperclip's $75,000 OPC UA bid threshold) below the MAX_ITEMS_PER_BUCKET cutoff in roughly two-thirds of runs. Fallback path: when the substring check misses, tokenize quote and source on Unicode word boundaries, require every digit-bearing quote token to appear in the source (no hallucinated numbers), and require >=90% of all quote tokens to appear (blocks single-word content substitutions like "highest" for "lowest"). Minimum 3 quote tokens to avoid trivial overlap on a large source. Substring fast-path is preserved as the primary signal. Empirically (6 plans, 8 compress runs, 1361 candidate items): qv=True total rises from ~1027 to 1192. 165 items (13.8% of qv=True) now verify only via the fallback; spot-checking 30 random flips shows all are legitimate paraphrases (dropped intermediate words, em-dash variants, ellipsis-elided clauses), no hallucinations. Paperclip acceptance probe (3 runs): $75,000 OPC UA bid now survives the public top-6 with qv=True in 2/3 (vs 1/3 before). The remaining 1/3 is an emission-layer miss (LLM does not produce the bid in either pass) and is out of scope for this PR — it was the same gap PR PlanExeOrg#743 explicitly left open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting
pushed a commit
to repomesh/PlanExe
that referenced
this pull request
May 22, 2026
…tus in 20260520 plan Adds PR PlanExeOrg#743 (compress emission-layer second pass, merged) and PR PlanExeOrg#744 (compress ranking-layer paraphrase-tolerant quote verification, open and CI green) to the landed-on-main section with honest framing of what each closes and what each does not. Refreshes the Compress-LLM run-to-run variance known-limitation entry: the emission side is closed by PlanExeOrg#743 and the verification side by PlanExeOrg#744. Residual modes are bucket-categorisation variance and second-pass-also-misses cases, both at the LLM's emission layer. Updates Phase 1 status row and the Next-likely-move list — compress-variance handling is no longer item 1; bucket-categorisation discipline takes its place, then proposal 141, then different-LLM validation, then prompt hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting
pushed a commit
to repomesh/PlanExe
that referenced
this pull request
May 22, 2026
…rg#747 (Phase 4 runtime) in 20260520 plan Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope
Compress-stage change only. Single substantive file:
compress_report_section.py. Tests added; no extract or downstream changes.What this PR does
Adds a second "what did you miss?" pass to each scored-list bucket (
numeric_values,load_bearing_assumptions,gates_and_thresholds,risks_and_shocks,missing_data_to_estimate). The first pass uses the existing bucket prompt unchanged. The first-pass assistant turn is appended to the chat. A USER follow-up asks the LLM to surface items the first pass missed, with the same Pydantic schema. The two batches are merged by normalisedsource_quote(merge_second_pass_items), and the merged pool goes through the existing deterministic top-N filter.Framing
Second-pass compress improves candidate recall and moves the paperclip tripwire problem from emission failure to ranking/cap selection.
This PR does NOT fully fix public-digest stability for the
$75k OPC UA bidtripwire. What it does:source_evidenceexplicit alternatives.Acceptance evidence
Paperclip 3-run test (the load-bearing case):
The 100ms tripwire fully recovers; the $75k bid recovers in candidate generation but not in public output.
Other-probe regression check (datacenter, crate_recovery, mars_gtld, euro_adoption):
compress_review_plan.mdstable vs v51 (6/6 in saturated buckets).gates_and_thresholds: 0 → 6 (pure-add from the second pass).gates_and_thresholds: 0 → 6 (pure-add).quote: unverified. Acceptable for this PR's scope (improving recall) but called out honestly — the second pass surfaces more candidates and not all of them are top-quality. The scoring/cap layer is what should ultimately filter; this PR doesn't touch that.Yellowstone: the first compress run had
compress_premortem.mdabsent from disk. Confirmed to be a pre-existing flaky failure mode (also observed on prior baselines). Re-ran cleanly. Not caused by this change.Cost
Roughly 2x compress LLM calls. Original 24 calls → up to 44 per full run (4 sections × (1 section_summary + 5 scored buckets × 2 passes)). Wall time ~90s → ~3 min per plan.
Out of scope — each a separate PR
MAX_ITEMS_PER_BUCKETfrom 6, or adding a tripwire/stress_testbonus tocomposite_score. This is the layer that drops the $75k bid in 2/3 paperclip runs. If high-impact tripwires need to survive public output reliably, the next PR belongs here.Tests
21/21 pass. Four new unit tests for
merge_second_pass_items:source_quote.Commit
5f44c6f9—napkin-math(compress): two-batch per-bucket compress to reduce variance on saturated bucketsTest plan
SECOND_PASS_USER_PROMPT_TEMPLATEuses only abstract structural language (no corpus literals)