napkin-math(bounds): Phase 4 runtime + schema readiness#747
Merged
Conversation
…plines, correlations, and calculation-output strip Phase 4 of the napkin_math methodology plan touches six items spanning code and LLM prompt. This PR ships the code-side runtime and schema readiness pieces; the prompt-side behavioural rules (base-anchoring tightening, self-audit citation examples, plan_type lognormal default) ship in a follow-up so the LLM-rule changes can be regression-checked in isolation. (1) Extend strip_threshold_bounds with a calculation-output rule (Gemini R1.1). A variable that is the declared output_name of a recommended_first_calculations or derived_questions entry is a computed quantity, not a primitive. Bounding it independently lets a single Monte Carlo trial pair sub-component p95s with a total p05. Skip with reason 'calculation-output' alongside the existing 'suffix' and 'formula-side' reasons. (2) Reserve lognormal and pert in VALID_DISCIPLINES (R1.4, R2.2). The schema accepts these values so generate-bounds can begin emitting them for the megaproject CAPEX default that lands in the prompt-side PR. sample_one raises NotImplementedError loudly when sampling is attempted — no silent fall-back to triangular. The matching samplers land in Phase 8. (3) Reserve the 'correlations' top-level key (R1.3). strip_threshold_bounds preserves it unchanged so the optional cross-variable correlation block survives the pre-processor; the copula sampler that reads it lands in Phase 8. (4) System prompt updated with the two new disciplines and the correlations block, each annotated 'schema-reserved; do not emit yet' so the LLM does not start emitting them before the matching sampler is implemented. Empirical posture: 71 unit tests pass (9 new — 4 calculation-output strip cases, 1 reserved-key preservation, 4 new-discipline schema and sample-time loud failure). 9/9 smoke checks pass. The new strip rule is inert on the v48 checked-in bounds corpus (paperclip and yellowstone): 0 calculation-output strips, confirming the LLM already skips calculation outputs and the new rule is a deterministic backstop, not a behavioural change. Out of scope: base-anchoring conditional rewrite (R1.2), citation-context-leak self-audit examples (R1.5), plan_type-driven lognormal default (R2.2), and the detailed correlations-emission rules. These are LLM-rule changes that require same-LLM same-session regression checks against the napkin_math probe set and ship as the Phase 4 follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review feedback on PR #747: the warning emitted after strip_threshold_bounds said every stripped item was a 'threshold variable' whose value would come from parameters.json. That is false for the new calculation-output strip reason — calculation outputs are computed from calculations.py, not read as a stated parameter value. Branch the warning text by reason. calculation-output strips now read 'stripped calculation output <id> from bounds; simulation will compute it from calculations.py instead of sampling it independently'. suffix and formula-side strips keep the original threshold-variable wording. Adds two regression tests: one asserts the new wording on a calculation-output strip and asserts the old threshold-variable phrasing is NOT used; the other asserts the threshold wording is preserved unchanged for suffix/formula-side strips. 73 unit tests pass (71 prior + 2 new), 9/9 smoke checks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
Addressed review feedback ( The warning emitted for stripped bounds is now reason-branched:
Added two regression tests in 73 unit tests pass (71 prior + 2 new), 9/9 smoke checks pass. |
3 tasks
4 tasks
neoneye
added a commit
that referenced
this pull request
May 21, 2026
Per user direction, the plan-status update lands in this PR (not a separate one). Adds PR #749 (Phase 4 prompt-side LLM rules — R1.2 base-anchoring source-tag asymmetry, R1.5 citation context-leak self-audit, R2.2 megaproject CAPEX lognormal default, R1.3 correlations selection rules + worked-example/source-rule cleanups from review) to the Landed-on-main section as the prompt-side completion of Phase 4. Phase 4 status row now reads 'Code-side runtime DONE on main via PR #747; prompt-side LLM rules in PR #749 (open, CI green, awaiting merge)' with each R-tag enumerated. pert discipline noted as schema-reserved with no selection rule yet. Next-likely-move list re-ordered now that the Phase 4 prompt-side item is in flight: item 1 is bucket-categorisation discipline in compress (the residual paperclip v53c miss), item 2 is proposal 141 implementation, item 3 is Phase 5 verify-bounds-citations (new deterministic R1.5 backstop), item 4 is different-LLM behavioural validation, item 5 is prompt-hygiene pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
neoneye
added a commit
that referenced
this pull request
May 21, 2026
…et promoter (deterministic, code-side) Review feedback on PR #750 (first round): the risks-side prompt rule had a wrong causal model. gates_and_thresholds is emitted BEFORE risks_and_shocks in BUCKET_SPECS, so a risks-side prompt rule cannot move an item into gates — it can only suppress emission in risks. Worse, when gates already missed the item (v53c-style), the risks-side suppression rule removes the fallback visible copy, turning 'wrong bucket but visible' into 'missing from public output entirely'. The v55 3/3 paperclip result was LLM run variance in the gates bucket call, not evidence the prompt rule worked. Replaces the prompt-side rule with a deterministic post-processor that scans the actual LLM emissions across both buckets and reroutes by structural shape: has_gate_shape(line): true when the surface form matches 'If <something with a digit token> ... then <consequence>' — the structural shape the gates bucket prompt asks the LLM to produce. Language-neutral (digits are digits in any locale); does not key on English-only keywords beyond the if/then template the bucket prompt already requires. Qualitative if-then sentences (no numeric token) are intentionally excluded — they may legitimately be gates (categorical/approval/deadline) but the promoter only fires on the unambiguous numeric pattern to avoid stealing genuine risks. promote_gate_shaped_risks(gates_items, risks_items): scans risks for gate-shaped items whose normalised source_quote is NOT already represented in the gates pool. Promoted items are MOVED to the gates candidate pool (not copied) so the risks slot is reclaimed for an actual risk. Items already in gates by quote are left in risks untouched (within-bucket dedupe is a separate concern handled by the existing 'do not restate' prompt rule). Inputs are not mutated. Wiring: defers annotate_scored_items (top-N filter) for gates_and_thresholds and risks_and_shocks until both have completed first+second-pass merging. After the bucket loop, the promoter runs on both merged candidate pools, then annotate fires on the augmented gates pool and the remaining risks pool. Per-bucket metadata gains a cross_bucket_promoted_count field so downstream consumers can audit. Reverted the earlier risks-side prompt addition from this branch — it was both causally wrong AND created a worse failure mode (per the user critique). Gates and risks bucket prompts are now back to their pre-PR state. Empirical posture: 37 unit tests pass (28 prior + 9 new — has_gate_shape true/false/non-string, promotion fire/skip/dedupe/empty/no-mutation). Same-LLM same-session paperclip 3x + 5-other-plan regression sweep (v56) shows 0 promotions across all 32 plan x section cells — the v53c-style miscategorisation did not recur in this session, so the promoter had nothing to act on. The change is a deterministic backstop for a rare LLM failure mode, analogous to PR #747's calculation-output strip which also fired 0 times on its regression corpus. The 8-run regression is otherwise within typical same-LLM variance (mostly +/-1-3 items, paperclip v56c expert_criticism is a separate 0-candidate emission failure unrelated to this change — the gates LLM call is unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
neoneye
added a commit
that referenced
this pull request
May 21, 2026
…mparison shape (the actual v53c phrasing) Review feedback on PR #750 (second round): the first iteration only caught canonical 'If <... digit ...> then <...>' form, which is NOT the phrasing the LLM used in the historical v53c failure. The v53c risk-bucket line was declarative: 'Middleware development bid exceeds $75,000, consuming budget planned for the physical handoff accumulation system.' has_gate_shape() returned False on that, so the advertised v53c backstop did not address the historical failure. Extended the detector to also recognise the declarative form: <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence>. Comparison verbs in the recognised list are structural cues, not domain vocabulary: exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. The verb membership is the structural cue; if the line uses a causal verb ('X risks Y', 'X causes Y', 'Failure of X leads to Y'), it stays in risks. Numeric guard: the threshold must contain a digit token AND the separator comma/colon must not be followed by another digit (so commas inside numbers like '$75,000' do not split the match). 'Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase.' is rejected because there is no comparison verb between subject and digit (negative regression test added). Deterministic if/then rewrite preserves the gates_and_thresholds bucket's output contract: a declarative line is rewritten as 'If <subject> <verb> <threshold>, then <consequence>' with case adjustments for mid-sentence flow. line_original is intentionally not rewritten — it keeps the source's native phrasing for downstream consumers. Three new regression tests added: (1) the exact v53c phrasing is now recognised by has_gate_shape and rewritten to if/then form by gate_shape_promotion; (2) the genuine supply-chain risk shape stays rejected; (3) the promoter end-to-end correctly moves the v53c-shaped risk to gates with line_english rewritten while source_quote, scores, status, and line_original are preserved. 42 unit tests pass total. Empirical posture: audited the extended detector across the v56 sweep (290 risks candidate lines, 8 runs). 0 of 290 match the extended pattern. v56 risks emissions are dominated by causal forms ('X risks Y', 'X causes Y') rather than declarative comparison. The detector covers the v53c shape (verified by unit test on the exact historical line) but the v53c shape did not recur in v56. The promoter remains a deterministic backstop — exercised by unit tests, dormant on the live sweep, same posture as PR #747's calculation-output strip rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting
pushed a commit
to repomesh/PlanExe
that referenced
this pull request
May 22, 2026
…rg#747 (Phase 4 runtime) in 20260520 plan Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting
pushed a commit
to repomesh/PlanExe
that referenced
this pull request
May 22, 2026
… R1.3) Completes Phase 4 of the napkin_math methodology plan. PR PlanExeOrg#747 shipped the code-side runtime and schema readiness; this PR ships the four LLM-rule changes that drive new behaviour on top of that runtime. (R1.2) ACTUAL-VS-COMMITMENT base anchoring — adds an explicit source-tag asymmetry. When base is shifted on a named plan-internal anchor (Premortem/Risk/Issue/Decision/expert-criticism passage forecasting a gap), source=data and the rationale names the artifact verbatim. When base is left at the commitment default with no named anchor, source=assumption (NOT data) and the rationale states explicitly that no plan passage anchors a shift. A downstream reader must be able to tell at a glance whether the base reflects a plan finding or a modelling default. (R1.5) New SELF-AUDIT: CITATION CONTEXT-LEAK section — concrete abstract examples of the failure mode where a citation is lexically present (the plan has a Risk N) but substantively wrong (Risk N is about a different topic than the claim). Drop the citation when it fails the substantive-support check; re-evaluate whether the base/range shift is justified at all if no substantively correct anchor exists. (R2.2) New DISTRIBUTION DEFAULT BY PLAN SCALE section — for plans at megaproject scale (multi-billion budget, multi-year horizon, multiple binding regulatory dependencies, first-of-kind execution), CAPEX and major OPEX variables default to sampling_discipline lognormal with low/high as P5/P95. Identification is by abstract criteria (budget magnitude, time horizon, regulatory weight, modelling_frame language, Premortem framing of cost overrun as structural), not by a plan_type-string enum — plan_type strings are corpus literals and the prompt must stay corpus-agnostic. The plan_type field is one signal among several. (R1.3) New CORRELATIONS (OPTIONAL TOP-LEVEL BLOCK) section — declares the schema for the correlations key reserved in PR PlanExeOrg#747 and the selection rules for emitting it. A correlation group is declared only when the plan ITSELF names a shared driver between two or more bounded variables (Risk/Issue/Decision/Premortem/expert-criticism). Modeller priors are not valid anchors. Rho is bucketed by anchor strength: 0.6-0.8 strong, 0.3-0.5 moderate, weak couplings are not declared (independence is the default). Max 5 groups. Discipline-table notes updated: the PR PlanExeOrg#747 'do not emit yet' guards on lognormal and on the correlations top-level key are lifted, replaced with pointers to the new selection-rule sections. pert remains 'no prompt rule directs you to emit' since this PR ships no pert selection rule. Sampler implementation status (Phase 8) is documented on both lognormal and correlations: lognormal raises NotImplementedError at sample time (loud failure), correlations is preserved but sampled independently with a warning (no silent shim). Runner: adds a loud warning when bounds declares a correlations block but the Gaussian-copula sampler is not yet implemented. Names the block size, names Phase 8 explicitly, and states that joint-tail risk is structurally understated until the sampler ships. Two regression tests cover the warning emission and its absence. No corpus literals: grep verifies that none of the actual v51 plan_type strings (hyperscale_infrastructure, industrial_automation_pilot, catastrophic_disaster_response, public_finance_transition, commercial_csr_reverse_logistics, commercial_digital_infrastructure_application) and no corpus-specific identifiers (OPC UA, $75k, 32.2 km, paperclip, datacenter, yellowstone, mars_gtld, crate, euro_adoption) appear in the prompt text. Empirical posture: 75 unit tests pass, 9/9 smoke checks pass. Behavioural verification of the new prompt rules against the napkin_math probe set is a same-LLM same-session regression check — not an improvement claim. Honest verification needs a different-LLM run, which is a separate follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 4 of
experiments/napkin_math/docs/20260520_plan.mdcovers six items spanning code and LLM prompt. This PR ships the code-side runtime and schema readiness; the prompt-side behavioural rules ship in a follow-up so the LLM-rule changes can be regression-checked in isolation.Shipped here
strip_threshold_boundsextension (R1.1). A variable that is the declaredoutput_nameof arecommended_first_calculationsorderived_questionsentry is a computed quantity, not a primitive. Bounding it independently lets a single Monte Carlo trial pair sub-component p95s with a total p05 — Gemini's disconnected-aggregates failure. New strip reason"calculation-output"lands alongside the existing"suffix"and"formula-side"reasons. Variables prefixedactual_are still never stripped.VALID_DISCIPLINESreserveslognormalandpert(R1.4, R2.2). Schema validation accepts these so generate-bounds can begin emitting them for the megaproject CAPEX default that lands in the prompt-side follow-up.sample_oneraisesNotImplementedErrorloudly when sampling is attempted — no silent fall-back to triangular. The matching samplers ship in Phase 8.correlationstop-level key reserved (R1.3).strip_threshold_boundspreserves it unchanged so the optional cross-variable correlation block survives the pre-processor; the copula sampler that reads it lands in Phase 8.correlationsblock, each annotatedschema-reserved; do not emit yetso the LLM does not start emitting them before the matching sampler is implemented.Out of scope (Phase 4 follow-up PR)
plan_type-driven lognormal default for hyperscale / geopolitical_megaproject CAPEX (R2.2)These are LLM-rule changes that require same-LLM same-session regression checks against the napkin_math probe set. Shipping them separately keeps the deterministic code review (this PR) clean from the LLM-rule behavioural review.
Empirical posture
tests/test_run_monte_carlo.pyandtests/test_strip_threshold_bounds.py. 9 new: 4 calculation-output strip cases (recommended, derived, actual-prefix override, non-output-not-stripped), 1 reserved-key preservation (correlations), 4 new-discipline tests (schema accept + sample-time loud failure for lognormal and pert).run_smoke.py).Test plan
pytest experiments/napkin_math/tests/test_run_monte_carlo.py experiments/napkin_math/tests/test_strip_threshold_bounds.py(71 pass)python3 experiments/napkin_math/tests/run_smoke.py(9/9 pass)🤖 Generated with Claude Code