Skip to content

napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks#750

Merged
neoneye merged 5 commits into
mainfrom
napkin-math/compress-gates-vs-risks-priority
May 21, 2026
Merged

napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks#750
neoneye merged 5 commits into
mainfrom
napkin-math/compress-gates-vs-risks-priority

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 21, 2026

Summary

Addresses the residual paperclip v53c failure mode: the LLM sometimes files a tripwire (If $X exceeds threshold, then <downside>, OR the declarative <X> exceeds threshold, <consequence>) under risks_and_shocks instead of gates_and_thresholds. The misfiled item then competes against actual risks at top-N selection and can fall out of the public output entirely.

Approach

A deterministic post-processor over the LLM emissions, not a prompt rule. The first iteration of this PR tried a risks-side prompt rule — review correctly flagged the causal model was wrong (gates emits before risks, so a risks-side prompt rule cannot move items into gates, and risks creating a worse failure mode where both buckets miss). That commit has been reverted on this branch; the gates and risks prompts are now back to their pre-PR state.

Code change

  1. has_gate_shape(line) — true when the surface form matches either:

    • Canonical if/then numeric: If <... digit token ...> then <consequence>
    • Declarative comparison numeric: <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence> — the actual v53c phrasing. Recognised comparison verbs: exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. Causal verbs (X risks Y, X causes Y, Failure of X leads to Y) are not recognised — those stay in risks.

    Numeric guard: separator comma/colon must not be followed by another digit, so commas inside numbers like $75,000 don't split the match. Qualitative if-then sentences without a numeric token between if and then are intentionally excluded so the promoter doesn't steal categorical/approval/deadline gates that the LLM already categorises correctly.

  2. gate_shape_promotion(line) — returns the if/then form of line if it has any recognised gate shape, else None. For declarative inputs it produces a deterministic if/then rewrite preserving the gates bucket's output contract. Acronym casing is preserved (API job queue latency... rewrites to If API ..., NOT If aPI ...) — the case adjustment only fires on regular capitalised words (uppercase followed by lowercase). line_original is intentionally NOT rewritten; it preserves the source's native phrasing.

  3. promote_gate_shaped_risks(gates, risks) — scans risks for gate-shaped items whose normalised source_quote is NOT already in the gates pool. Promoted items are MOVED to the gates candidate pool (not copied) so the risks slot is reclaimed. Items already in gates by quote are left in risks untouched (within-bucket dedupe is the existing 'do not restate' prompt rule's job).

  4. Wiring — defers annotate_scored_items for gates_and_thresholds and risks_and_shocks until both have completed first+second-pass merging. After the bucket loop, the promoter runs on both merged pools, then annotate_scored_items fires on the augmented gates pool and the remaining risks pool. The promoter's count is exposed in per_bucket.gates_and_thresholds.cross_bucket_promoted_count for downstream auditing.

Empirical posture

  • Unit tests: 44 pass in test_compress_report_section.py. New cases include: has_gate_shape true/false for both if/then and declarative shapes; positive regression on the literal v53c phrasing Middleware development bid exceeds $75,000, consuming budget...; negative regression on the genuine risk Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase.; end-to-end promoter test that asserts the v53c-shaped risk is moved with line_english rewritten to if/then form and source_quote / scores / status / line_original preserved; acronym preservation (API stays API); regular capitalisation lowering (Middlewaremiddleware).
  • v56 regression sweep (paperclip 3× + 5 other probes, 290 risks candidate lines, 32 plan×section cells): 0 promotions fired. The v53c shape did NOT recur in this same-LLM same-session sweep; the v56 risks emissions are dominated by causal forms (X risks Y, X causes Y, Failure of X leads to Y) which the detector intentionally rejects. The change is a deterministic backstop for a rare LLM failure mode, analogous to PR napkin-math(bounds): Phase 4 runtime + schema readiness #747's calculation-output strip rule which also fired 0 times on its v48 regression corpus.
  • The 8-run regression otherwise shows typical same-LLM variance (mostly ±1-3 items per cell, no systematic over-narrowing). One section had a 0-candidate emission failure (paperclip v56c expert_criticism gates), unrelated to this change — the gates LLM call itself is unchanged; the promoter only acts post-LLM.

What this PR explicitly does NOT claim

Test plan

  • pytest worker_plan/.../tests/test_compress_report_section.py (44 pass)
  • v56 paperclip 3× + 5 regression probes — no systematic over-narrowing, 0 false-positive promotions
  • Reverted the wrong risks-side prompt rule from this PR's first commit
  • Detector covers both if/then numeric AND declarative comparison forms (the actual v53c phrasing)
  • Acronym casing preserved by the if/then rewrite
  • CI green on this branch

🤖 Generated with Claude Code

neoneye and others added 2 commits May 21, 2026 17:40
…tes_and_thresholds via the risks-side rule

Addresses the residual paperclip v53c failure mode: the LLM occasionally files a '$X exceeds threshold, then <downside>' tripwire under risks_and_shocks instead of gates_and_thresholds. The 'do not restate' guard in the risks prompt didn't catch the case because the LLM put the item in risks first, not as a restatement.

Change: adds a structural-priority paragraph to the risks_and_shocks bucket prompt that tells the LLM NOT to emit a sentence here when its source side has the 'If <metric> <comparator> <numeric threshold>, then <consequence>' shape — that shape belongs in gates_and_thresholds even when the then-clause is downside-flavoured (cost, schedule, scope, penalty, vendor switch).

Why ONLY the risks side and not also a parallel paragraph in the gates prompt: I tried both sides in v54 and found a clear over-narrowing regression — the LLM became more conservative about what counted as a gate, with paperclip expert_criticism dropping from 6 gates to 2, yellowstone selected_scenario from 6 to 3, and similar shrinkages elsewhere. Adding a long structural-shape paragraph to the gates prompt implicitly raised the bar for what counted as a gate (numeric thresholds only), excluding legitimate deadline/categorical gates. The risks-side rule alone is enough to claim the if/then numeric sentences for gates without narrowing gates from the other direction.

Empirical posture (regression check, NOT improvement claim; same-LLM same-session Gemini Flash Lite reruns):

Paperclip 3x (v55a/b/c): $75k OPC UA bid lands in gates_and_thresholds public top in ALL THREE runs (vs 2/3 before in v53). This is the focal v53c case the change targets.

5-other-plan cross-probe regression (euro_adoption, yellowstone, crate, mars_gtld, datacenter): bucket counts mostly unchanged. Modest 1-2-item shrinkages in datacenter selected_scenario and yellowstone selected_scenario are within typical LLM run variance. Two sections produced 0 gates in v55 (crate premortem and mars_gtld expert_criticism) — these are LLM run variance unrelated to this change: (a) the gates bucket prompt is unchanged so a risks-side rule cannot affect first-pass gate emission; (b) mars_gtld expert_criticism gates is a known-flaky combo (saw 0 candidates on PR #744 rerun); (c) crate premortem produced 6 gates fine in v54 with an even more aggressive prompt, ruling out the v55 risks-side change as the cause.

Unit tests: 28 pass (no test changes; the prompt edit is structural language, not new code paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…et promoter (deterministic, code-side)

Review feedback on PR #750 (first round): the risks-side prompt rule had a wrong causal model. gates_and_thresholds is emitted BEFORE risks_and_shocks in BUCKET_SPECS, so a risks-side prompt rule cannot move an item into gates — it can only suppress emission in risks. Worse, when gates already missed the item (v53c-style), the risks-side suppression rule removes the fallback visible copy, turning 'wrong bucket but visible' into 'missing from public output entirely'. The v55 3/3 paperclip result was LLM run variance in the gates bucket call, not evidence the prompt rule worked.

Replaces the prompt-side rule with a deterministic post-processor that scans the actual LLM emissions across both buckets and reroutes by structural shape:

has_gate_shape(line): true when the surface form matches 'If <something with a digit token> ... then <consequence>' — the structural shape the gates bucket prompt asks the LLM to produce. Language-neutral (digits are digits in any locale); does not key on English-only keywords beyond the if/then template the bucket prompt already requires. Qualitative if-then sentences (no numeric token) are intentionally excluded — they may legitimately be gates (categorical/approval/deadline) but the promoter only fires on the unambiguous numeric pattern to avoid stealing genuine risks.

promote_gate_shaped_risks(gates_items, risks_items): scans risks for gate-shaped items whose normalised source_quote is NOT already represented in the gates pool. Promoted items are MOVED to the gates candidate pool (not copied) so the risks slot is reclaimed for an actual risk. Items already in gates by quote are left in risks untouched (within-bucket dedupe is a separate concern handled by the existing 'do not restate' prompt rule). Inputs are not mutated.

Wiring: defers annotate_scored_items (top-N filter) for gates_and_thresholds and risks_and_shocks until both have completed first+second-pass merging. After the bucket loop, the promoter runs on both merged candidate pools, then annotate fires on the augmented gates pool and the remaining risks pool. Per-bucket metadata gains a cross_bucket_promoted_count field so downstream consumers can audit.

Reverted the earlier risks-side prompt addition from this branch — it was both causally wrong AND created a worse failure mode (per the user critique). Gates and risks bucket prompts are now back to their pre-PR state.

Empirical posture: 37 unit tests pass (28 prior + 9 new — has_gate_shape true/false/non-string, promotion fire/skip/dedupe/empty/no-mutation). Same-LLM same-session paperclip 3x + 5-other-plan regression sweep (v56) shows 0 promotions across all 32 plan x section cells — the v53c-style miscategorisation did not recur in this session, so the promoter had nothing to act on. The change is a deterministic backstop for a rare LLM failure mode, analogous to PR #747's calculation-output strip which also fired 0 times on its regression corpus. The 8-run regression is otherwise within typical same-LLM variance (mostly +/-1-3 items, paperclip v56c expert_criticism is a separate 0-candidate emission failure unrelated to this change — the gates LLM call is unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye neoneye changed the title napkin-math(compress): claim downside-framed if/then sentences for gates_and_thresholds (risks-side rule) napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks May 21, 2026
@neoneye
Copy link
Copy Markdown
Member Author

neoneye commented May 21, 2026

Addressed review (1755269c). The risks-side prompt rule has been reverted on this branch; the new approach is the deterministic post-processor (option 2):

  1. has_gate_shape(line) — true for If <... digit token ...> then <...> surface form. Language-neutral.
  2. promote_gate_shaped_risks(gates, risks) — scans risks for gate-shaped items whose normalised quote is not in gates, moves them to the gates candidate pool, removes them from risks. Items already in gates by quote are left in risks untouched.
  3. Wiring: defers annotate_scored_items for both buckets until both have emitted; runs promoter between LLM emission and top-N filter.

Empirical finding worth surfacing: 0 promotions fired across all 8 v56 runs (paperclip 3× + 5 regression probes, 32 plan×section cells). The v53c-style miscategorisation didn't recur in this LLM session, so the promoter had nothing to act on. The change is a deterministic backstop, same posture as PR #747's calculation-output strip (which also fired 0 times on its v48 regression corpus).

37 unit tests pass (28 prior + 9 new). PR title and body rewritten to reflect the new approach and the honest empirical posture.

…mparison shape (the actual v53c phrasing)

Review feedback on PR #750 (second round): the first iteration only caught canonical 'If <... digit ...> then <...>' form, which is NOT the phrasing the LLM used in the historical v53c failure. The v53c risk-bucket line was declarative: 'Middleware development bid exceeds $75,000, consuming budget planned for the physical handoff accumulation system.' has_gate_shape() returned False on that, so the advertised v53c backstop did not address the historical failure.

Extended the detector to also recognise the declarative form: <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence>. Comparison verbs in the recognised list are structural cues, not domain vocabulary: exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. The verb membership is the structural cue; if the line uses a causal verb ('X risks Y', 'X causes Y', 'Failure of X leads to Y'), it stays in risks.

Numeric guard: the threshold must contain a digit token AND the separator comma/colon must not be followed by another digit (so commas inside numbers like '$75,000' do not split the match). 'Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase.' is rejected because there is no comparison verb between subject and digit (negative regression test added).

Deterministic if/then rewrite preserves the gates_and_thresholds bucket's output contract: a declarative line is rewritten as 'If <subject> <verb> <threshold>, then <consequence>' with case adjustments for mid-sentence flow. line_original is intentionally not rewritten — it keeps the source's native phrasing for downstream consumers.

Three new regression tests added: (1) the exact v53c phrasing is now recognised by has_gate_shape and rewritten to if/then form by gate_shape_promotion; (2) the genuine supply-chain risk shape stays rejected; (3) the promoter end-to-end correctly moves the v53c-shaped risk to gates with line_english rewritten while source_quote, scores, status, and line_original are preserved. 42 unit tests pass total.

Empirical posture: audited the extended detector across the v56 sweep (290 risks candidate lines, 8 runs). 0 of 290 match the extended pattern. v56 risks emissions are dominated by causal forms ('X risks Y', 'X causes Y') rather than declarative comparison. The detector covers the v53c shape (verified by unit test on the exact historical line) but the v53c shape did not recur in v56. The promoter remains a deterministic backstop — exercised by unit tests, dormant on the live sweep, same posture as PR #747's calculation-output strip rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye
Copy link
Copy Markdown
Member Author

neoneye commented May 21, 2026

Addressed (b495818b). The detector now actually covers the v53c shape:

Extended has_gate_shape to recognise two forms:

  1. Canonical If <... digit ...> then <...> (unchanged from prior commit)
  2. Declarative <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence> — the actual v53c phrasing

Comparison verb list (structural cues, not domain vocabulary): exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. Causal verbs (X risks Y, X causes Y, Failure of X leads to Y) are NOT recognised — those stay in risks.

Numeric guard: separator comma/colon must not be followed by another digit so commas inside numbers like $75,000 don't split the match.

Deterministic if/then rewrite preserves the gates bucket's output contract: a declarative line is rewritten as If <subject> <verb> <threshold>, then <consequence> with case adjustments. line_original stays unchanged.

The exact tests you requested are in:

  • Positive: test_has_gate_shape_accepts_declarative_v53c_form uses the literal v53c line "Middleware development bid exceeds $75,000, consuming budget planned for the physical handoff accumulation system."
  • Negative: test_has_gate_shape_rejects_genuine_risk_with_colon_delay uses "Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase."
  • End-to-end: test_promote_gate_shaped_risks_rewrites_declarative_v53c_form asserts the v53c-shaped risk is promoted with line_english rewritten to if/then form, source_quote/scores/status preserved, line_original untouched, and the item removed from risks.

42 unit tests pass total.

Honest empirical posture (audited the extended detector against v56): across 290 risks candidate lines in the 8 v56 runs, 0 match the extended pattern. The actual v56 risks emissions are dominated by causal forms, not declarative comparison. The v53c shape didn't recur in v56. The promoter is still a deterministic backstop — covered by the regression test on the literal historical line, dormant on the live sweep. Same posture as PR #747's calculation-output strip rule.

PR title remains "addresses paperclip v53c miscategorisation" — the change now actually does, by regression-test construction on the literal phrasing.

…f/then rewrite

Review feedback on PR #750 (third round): the rewrite naively lowercased the first character of the subject and consequence, which damages acronyms — 'API job queue latency exceeds 100ms, ...' was being rewritten to 'If aPI job queue latency...' which is visibly broken.

Fix: only lowercase the first character when it is followed by a lowercase letter (a regular capitalised word like 'Middleware'). Acronyms like 'API' / 'OPC UA' / digit prefixes like '5G' all have a non-lowercase second character, so they are left unchanged.

Two new regression tests: (1) 'API job queue latency exceeds 100ms, ...' rewrites to 'If API job queue latency exceeds 100ms, then ...' with the acronym intact, (2) 'Middleware development bid exceeds $75,000, ...' still rewrites to 'If middleware development bid exceeds $75,000, then ...' (regular capitalisation still adjusted).

44 unit tests pass (42 prior + 2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye
Copy link
Copy Markdown
Member Author

neoneye commented May 21, 2026

Addressed both cleanup items (542cc196):

  1. Acronym casing preserved. _lowercase_first_preserving_acronyms only lowercases the first character when it's uppercase AND followed by a lowercase letter (a regular capitalised word). API / OPC UA / 5G all have a non-lowercase second character, so they're left unchanged. Two new regression tests: one asserts API job queue latency exceeds 100ms, ... rewrites to If API job queue latency exceeds 100ms, then ... with aPI explicitly absent; the other asserts Middleware still lowers to middleware so the normal capitalisation path still works.
  2. PR body updated to reflect 44 tests (was 37) and to describe both the canonical if/then detector AND the declarative comparison detector with the comparison-verb list, numeric guard, rewrite behaviour, and acronym preservation.

44 unit tests pass.

…promoter) in 20260520 plan

Per user direction, the plan-status update lands in PR #750 (not a separate doc PR).

PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line.

Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744.

Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye neoneye merged commit 63b59ef into main May 21, 2026
3 checks passed
@neoneye neoneye deleted the napkin-math/compress-gates-vs-risks-priority branch May 21, 2026 17:13
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…hip-set

Updates two docs to reflect the post-PlanExeOrg#753 state of the napkin-math pipeline.

methology.md: describe the current pipeline behaviour — two-batch compress with paraphrase-tolerant quote match and cross-bucket promoter; extract's source-arithmetic preservation, threshold-pairing, and dropped_signals field; 19-check validator (added aggregate_not_bounded, requirement_has_margin, dropped_signals_schema); bounds' asymmetric source label on commitment defaults, calculation-output strip, reserved correlations block, reserved lognormal/pert disciplines with loud NotImplementedError; advisory audit_source_preservation.py step.

20260520_plan.md → 20260522_plan.md: bump status date; mark PR PlanExeOrg#750 merged; add PR PlanExeOrg#751/PlanExeOrg#752/PlanExeOrg#753 entries (proposal 141 implementation); update Phase status table (added 4.5 audit row, reclassified Phase 8 as partially done, Phase 10 marked done for current ship-set); add v58 14-plan empirical snapshot (1 viable / 5 fragile / 8 doom); reorder Next likely move now that proposal 141 has shipped — Phase 5 citation verifier promoted to PlanExeOrg#1, Phase 8 samplers added as PlanExeOrg#2 with v58 cases that bite now, Phase 9 composite-band cap as PlanExeOrg#3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant