Skip to content

napkin-math(compress): second-pass shifts the variance failure from emission to ranking#743

Merged
neoneye merged 1 commit into
mainfrom
napkin-math/compress-second-pass
May 21, 2026
Merged

napkin-math(compress): second-pass shifts the variance failure from emission to ranking#743
neoneye merged 1 commit into
mainfrom
napkin-math/compress-second-pass

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 21, 2026

Scope

Compress-stage change only. Single substantive file: compress_report_section.py. Tests added; no extract or downstream changes.

What this PR does

Adds a second "what did you miss?" pass to each scored-list bucket (numeric_values, load_bearing_assumptions, gates_and_thresholds, risks_and_shocks, missing_data_to_estimate). The first pass uses the existing bucket prompt unchanged. The first-pass assistant turn is appended to the chat. A USER follow-up asks the LLM to surface items the first pass missed, with the same Pydantic schema. The two batches are merged by normalised source_quote (merge_second_pass_items), and the merged pool goes through the existing deterministic top-N filter.

Framing

Second-pass compress improves candidate recall and moves the paperclip tripwire problem from emission failure to ranking/cap selection.

This PR does NOT fully fix public-digest stability for the $75k OPC UA bid tripwire. What it does:

  • Closes the emission-layer gap — items the source names are now reliably reaching the candidate pool.
  • Leaves the ranking-layer gap open — when the candidate pool exceeds the public cap (6 items), the deterministic composite scorer drops some high-value tripwires in favor of higher-source_evidence explicit alternatives.

Acceptance evidence

Paperclip 3-run test (the load-bearing case):

Tripwire candidate pool public top-6 v50/v51 baseline
100ms p99 latency 3/3 3/3 0/3 in public
$75k OPC UA bid 3/3 1/3 0/3 in public

The 100ms tripwire fully recovers; the $75k bid recovers in candidate generation but not in public output.

Other-probe regression check (datacenter, crate_recovery, mars_gtld, euro_adoption):

  • Gate counts in compress_review_plan.md stable vs v51 (6/6 in saturated buckets).
  • Datacenter premortem gates_and_thresholds: 0 → 6 (pure-add from the second pass).
  • Euro premortem gates_and_thresholds: 0 → 6 (pure-add).
  • Sampled the new gates: most are plausible, but some are softer risk-trigger items rather than hard pass/fail gates, and a few have quote: unverified. Acceptable for this PR's scope (improving recall) but called out honestly — the second pass surfaces more candidates and not all of them are top-quality. The scoring/cap layer is what should ultimately filter; this PR doesn't touch that.

Yellowstone: the first compress run had compress_premortem.md absent from disk. Confirmed to be a pre-existing flaky failure mode (also observed on prior baselines). Re-ran cleanly. Not caused by this change.

Cost

Roughly 2x compress LLM calls. Original 24 calls → up to 44 per full run (4 sections × (1 section_summary + 5 scored buckets × 2 passes)). Wall time ~90s → ~3 min per plan.

Out of scope — each a separate PR

  1. Scorer/cap policy. Raising MAX_ITEMS_PER_BUCKET from 6, or adding a tripwire/stress_test bonus to composite_score. This is the layer that drops the $75k bid in 2/3 paperclip runs. If high-impact tripwires need to survive public output reliably, the next PR belongs here.
  2. Source-preservation audit (proposal Multiple api keys cleanup5 #141 implementation).
  3. Different-LLM behavioural validation against the second-pass mechanism. Same-session evidence above is a regression check posture.
  4. Pre-existing missing-section flake in compress orchestration (not introduced by this change).

Tests

21/21 pass. Four new unit tests for merge_second_pass_items:

  • Empty second pass returns first pass unchanged.
  • All-new second-pass items are appended.
  • Duplicate detection survives whitespace / casing / punctuation differences in source_quote.
  • Emit order preserved (first-pass first, then second-pass in emitted order).

Commit

  • 5f44c6f9napkin-math(compress): two-batch per-bucket compress to reduce variance on saturated buckets

Test plan

  • CI green (lint / tests / typecheck)
  • SECOND_PASS_USER_PROMPT_TEMPLATE uses only abstract structural language (no corpus literals)
  • Treat any quality concerns about second-pass-added gates as input to the next PR (scorer/cap policy), not as a reason to block this one

…ce on saturated buckets

Adds a second "what did you miss?" pass to each scored-list bucket (numeric_values, load_bearing_assumptions, gates_and_thresholds, risks_and_shocks, missing_data_to_estimate). The first pass uses the existing bucket prompt unchanged; the second pass appends a follow-up user message asking the LLM to surface items the first pass missed, with the same Pydantic schema.

Why: when a section names more high-signal candidates than the per-bucket cap accommodates, the LLM has to make a hard prioritisation cut in a single pass. Smaller models (Gemini Flash Lite specifically) struggle to count and triage near the cap reliably, and across runs the cut lands differently — different runs drop different tripwires. The Paperclip premortem's $75k OPC UA bid and 100ms p99 latency tripwires were the load-bearing example: v49 had both, v50 and v51 dropped both intermittently. Two passes each producing a manageable batch are easier for a small model than one pass producing a saturated batch.

How: in the existing per-bucket loop, after the first-pass LLM call and after appending the first-pass assistant turn to the chat history, for the 5 scored-list buckets only, append a USER follow-up message ('Review the items you produced above. Identify items the section names that you missed. Emit only NEW items. Same bucket rules. Up to 8 new items. Empty list is fine.'), call the LLM again with the same schema, and merge the two batches by normalised source_quote (dropping second-pass duplicates). The deterministic top-N filter (annotate_scored_items) runs on the merged pool, so over-emission in the second pass is safe.

Cost: roughly 2x compress LLM calls per section (24 → up to 44 per full compress run; 4 sections × 6 buckets, of which 5 are scored-list and now do 2 passes). Time impact ~2x for the compress phase.

Acceptance evidence (single-session, regression check posture):

- Paperclip 3-run test: 100ms p99 latency tripwire preserved 3/3 in the public top-6 (vs 0/3 in v50/v51). $75k OPC UA bid surfaced in the candidate pool 3/3, but the deterministic scorer + 6-cap kept it in the public top-6 only 1/3 — the bid loses on composite_score to higher-evidence explicit gates from the first pass. The second-pass mechanism works; the remaining variance is at the scoring/cap layer, not the LLM emission layer.

- Other-probe regression check on datacenter, crate_recovery, mars_gtld, euro_adoption: gate counts in compress_review_plan.md and compress_premortem.md are stable vs v51 (6/6 in saturated buckets); two probes (datacenter, euro) gain previously-empty premortem gates buckets, a pure-add from the second pass surfacing content the first pass had not produced.

- Yellowstone v52 had a missing compress_premortem output (file absent from disk). Investigated and confirmed to be a pre-existing flaky failure mode not introduced by this change — the file goes missing on prior baselines too. Out of scope for this PR.

Unit tests: 4 new tests for merge_second_pass_items cover the empty-second-pass, all-new-second-pass, normalised-source_quote-deduplication, and order-preservation cases. 21/21 tests pass.

Out of scope for this PR (each is a separate follow-up):

- Raising MAX_ITEMS_PER_BUCKET from 6 or adding a stress_test/tripwire bonus to composite_score, which is the layer that's now dropping the $75k bid.

- Source-preservation audit (proposal 141) implementation that would classify v49 absences mechanically.

- Different-LLM behavioural validation against this mechanism.
@neoneye neoneye changed the title napkin-math(compress): two-batch per-bucket compress to reduce variance on saturated buckets napkin-math(compress): second-pass shifts the variance failure from emission to ranking May 21, 2026
@neoneye neoneye merged commit cd62982 into main May 21, 2026
3 checks passed
@neoneye neoneye deleted the napkin-math/compress-second-pass branch May 21, 2026 12:42
neoneye added a commit that referenced this pull request May 21, 2026
docs(napkin-math): record PR #743 and PR #744 in 20260520 plan
neoneye added a commit that referenced this pull request May 21, 2026
…promoter) in 20260520 plan

Per user direction, the plan-status update lands in PR #750 (not a separate doc PR).

PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line.

Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744.

Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…or LLM paraphrase

The strict substring check rejected verbatim claims whose source_quote dropped or reordered words even though every content token was present in the source. That made quote_verified a fragile signal — and because composite_score adds a +10 verified-quote bonus on a 1-25 base, a single missed verification can drop a load-bearing tripwire (e.g. paperclip's $75,000 OPC UA bid threshold) below the MAX_ITEMS_PER_BUCKET cutoff in roughly two-thirds of runs.

Fallback path: when the substring check misses, tokenize quote and source on Unicode word boundaries, require every digit-bearing quote token to appear in the source (no hallucinated numbers), and require >=90% of all quote tokens to appear (blocks single-word content substitutions like "highest" for "lowest"). Minimum 3 quote tokens to avoid trivial overlap on a large source. Substring fast-path is preserved as the primary signal.

Empirically (6 plans, 8 compress runs, 1361 candidate items): qv=True total rises from ~1027 to 1192. 165 items (13.8% of qv=True) now verify only via the fallback; spot-checking 30 random flips shows all are legitimate paraphrases (dropped intermediate words, em-dash variants, ellipsis-elided clauses), no hallucinations.

Paperclip acceptance probe (3 runs): $75,000 OPC UA bid now survives the public top-6 with qv=True in 2/3 (vs 1/3 before). The remaining 1/3 is an emission-layer miss (LLM does not produce the bid in either pass) and is out of scope for this PR — it was the same gap PR PlanExeOrg#743 explicitly left open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…tus in 20260520 plan

Adds PR PlanExeOrg#743 (compress emission-layer second pass, merged) and PR PlanExeOrg#744 (compress ranking-layer paraphrase-tolerant quote verification, open and CI green) to the landed-on-main section with honest framing of what each closes and what each does not.

Refreshes the Compress-LLM run-to-run variance known-limitation entry: the emission side is closed by PlanExeOrg#743 and the verification side by PlanExeOrg#744. Residual modes are bucket-categorisation variance and second-pass-also-misses cases, both at the LLM's emission layer.

Updates Phase 1 status row and the Next-likely-move list — compress-variance handling is no longer item 1; bucket-categorisation discipline takes its place, then proposal 141, then different-LLM validation, then prompt hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…rg#747 (Phase 4 runtime) in 20260520 plan

Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section.

Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes.

Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant