napkin-math(bounds): Phase 4 runtime + schema readiness by neoneye · Pull Request #747 · PlanExeOrg/PlanExe

neoneye · 2026-05-21T14:01:40Z

Summary

Phase 4 of experiments/napkin_math/docs/20260520_plan.md covers six items spanning code and LLM prompt. This PR ships the code-side runtime and schema readiness; the prompt-side behavioural rules ship in a follow-up so the LLM-rule changes can be regression-checked in isolation.

Shipped here

strip_threshold_bounds extension (R1.1). A variable that is the declared output_name of a recommended_first_calculations or derived_questions entry is a computed quantity, not a primitive. Bounding it independently lets a single Monte Carlo trial pair sub-component p95s with a total p05 — Gemini's disconnected-aggregates failure. New strip reason "calculation-output" lands alongside the existing "suffix" and "formula-side" reasons. Variables prefixed actual_ are still never stripped.
VALID_DISCIPLINES reserves lognormal and pert (R1.4, R2.2). Schema validation accepts these so generate-bounds can begin emitting them for the megaproject CAPEX default that lands in the prompt-side follow-up. sample_one raises NotImplementedError loudly when sampling is attempted — no silent fall-back to triangular. The matching samplers ship in Phase 8.
correlations top-level key reserved (R1.3). strip_threshold_bounds preserves it unchanged so the optional cross-variable correlation block survives the pre-processor; the copula sampler that reads it lands in Phase 8.
System prompt updates the discipline table with the two reserved disciplines and a mention of the optional correlations block, each annotated schema-reserved; do not emit yet so the LLM does not start emitting them before the matching sampler is implemented.

Out of scope (Phase 4 follow-up PR)

Base-anchoring conditional rule rewrite (R1.2)
Self-audit citation context-leak examples (R1.5)
plan_type-driven lognormal default for hyperscale / geopolitical_megaproject CAPEX (R2.2)
Detailed correlations-emission selection rules (R1.3)

These are LLM-rule changes that require same-LLM same-session regression checks against the napkin_math probe set. Shipping them separately keeps the deterministic code review (this PR) clean from the LLM-rule behavioural review.

Empirical posture

Unit tests: 71 pass in tests/test_run_monte_carlo.py and tests/test_strip_threshold_bounds.py. 9 new: 4 calculation-output strip cases (recommended, derived, actual-prefix override, non-output-not-stripped), 1 reserved-key preservation (correlations), 4 new-discipline tests (schema accept + sample-time loud failure for lognormal and pert).
Smoke suite: 9/9 checks pass (run_smoke.py).
Corpus inertness: the new calculation-output strip rule fires 0 times across the v48 checked-in bounds (paperclip, yellowstone). The LLM already skips calculation outputs; the new rule is a deterministic backstop, not a behavioural change.
No silent shims: trying to sample lognormal/pert raises NotImplementedError naming Phase 8 explicitly. The schema-reserved disciplines do not pretend to work.

Test plan

pytest experiments/napkin_math/tests/test_run_monte_carlo.py experiments/napkin_math/tests/test_strip_threshold_bounds.py (71 pass)
python3 experiments/napkin_math/tests/run_smoke.py (9/9 pass)
v48 corpus regression: 0 false-positive calculation-output strips
CI green on this branch

🤖 Generated with Claude Code

…plines, correlations, and calculation-output strip Phase 4 of the napkin_math methodology plan touches six items spanning code and LLM prompt. This PR ships the code-side runtime and schema readiness pieces; the prompt-side behavioural rules (base-anchoring tightening, self-audit citation examples, plan_type lognormal default) ship in a follow-up so the LLM-rule changes can be regression-checked in isolation. (1) Extend strip_threshold_bounds with a calculation-output rule (Gemini R1.1). A variable that is the declared output_name of a recommended_first_calculations or derived_questions entry is a computed quantity, not a primitive. Bounding it independently lets a single Monte Carlo trial pair sub-component p95s with a total p05. Skip with reason 'calculation-output' alongside the existing 'suffix' and 'formula-side' reasons. (2) Reserve lognormal and pert in VALID_DISCIPLINES (R1.4, R2.2). The schema accepts these values so generate-bounds can begin emitting them for the megaproject CAPEX default that lands in the prompt-side PR. sample_one raises NotImplementedError loudly when sampling is attempted — no silent fall-back to triangular. The matching samplers land in Phase 8. (3) Reserve the 'correlations' top-level key (R1.3). strip_threshold_bounds preserves it unchanged so the optional cross-variable correlation block survives the pre-processor; the copula sampler that reads it lands in Phase 8. (4) System prompt updated with the two new disciplines and the correlations block, each annotated 'schema-reserved; do not emit yet' so the LLM does not start emitting them before the matching sampler is implemented. Empirical posture: 71 unit tests pass (9 new — 4 calculation-output strip cases, 1 reserved-key preservation, 4 new-discipline schema and sample-time loud failure). 9/9 smoke checks pass. The new strip rule is inert on the v48 checked-in bounds corpus (paperclip and yellowstone): 0 calculation-output strips, confirming the LLM already skips calculation outputs and the new rule is a deterministic backstop, not a behavioural change. Out of scope: base-anchoring conditional rewrite (R1.2), citation-context-leak self-audit examples (R1.5), plan_type-driven lognormal default (R2.2), and the detailed correlations-emission rules. These are LLM-rule changes that require same-LLM same-session regression checks against the napkin_math probe set and ship as the Phase 4 follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Review feedback on PR #747: the warning emitted after strip_threshold_bounds said every stripped item was a 'threshold variable' whose value would come from parameters.json. That is false for the new calculation-output strip reason — calculation outputs are computed from calculations.py, not read as a stated parameter value. Branch the warning text by reason. calculation-output strips now read 'stripped calculation output <id> from bounds; simulation will compute it from calculations.py instead of sampling it independently'. suffix and formula-side strips keep the original threshold-variable wording. Adds two regression tests: one asserts the new wording on a calculation-output strip and asserts the old threshold-variable phrasing is NOT used; the other asserts the threshold wording is preserved unchanged for suffix/formula-side strips. 73 unit tests pass (71 prior + 2 new), 9/9 smoke checks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye · 2026-05-21T14:07:40Z

Addressed review feedback (f958aa9b):

The warning emitted for stripped bounds is now reason-branched:

reason: "calculation-output" → stripped calculation output '<id>' from bounds; simulation will compute it from calculations.py instead of sampling it independently
reason: "suffix" / "formula-side" → unchanged threshold-variable wording (those values really do come from parameters.json, so the original wording is correct)

Added two regression tests in TestStrippedBoundsWarnings: one asserts the new calculation-output wording fires (and the old threshold-variable phrasing does NOT); the other asserts the threshold wording is preserved unchanged for suffix strips.

73 unit tests pass (71 prior + 2 new), 9/9 smoke checks pass.

…746-747 docs(napkin-math): record PR #746 (Phase 3) and PR #747 (Phase 4 runtime)

Per user direction, the plan-status update lands in this PR (not a separate one). Adds PR #749 (Phase 4 prompt-side LLM rules — R1.2 base-anchoring source-tag asymmetry, R1.5 citation context-leak self-audit, R2.2 megaproject CAPEX lognormal default, R1.3 correlations selection rules + worked-example/source-rule cleanups from review) to the Landed-on-main section as the prompt-side completion of Phase 4. Phase 4 status row now reads 'Code-side runtime DONE on main via PR #747; prompt-side LLM rules in PR #749 (open, CI green, awaiting merge)' with each R-tag enumerated. pert discipline noted as schema-reserved with no selection rule yet. Next-likely-move list re-ordered now that the Phase 4 prompt-side item is in flight: item 1 is bucket-categorisation discipline in compress (the residual paperclip v53c miss), item 2 is proposal 141 implementation, item 3 is Phase 5 verify-bounds-citations (new deterministic R1.5 backstop), item 4 is different-LLM behavioural validation, item 5 is prompt-hygiene pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…et promoter (deterministic, code-side) Review feedback on PR #750 (first round): the risks-side prompt rule had a wrong causal model. gates_and_thresholds is emitted BEFORE risks_and_shocks in BUCKET_SPECS, so a risks-side prompt rule cannot move an item into gates — it can only suppress emission in risks. Worse, when gates already missed the item (v53c-style), the risks-side suppression rule removes the fallback visible copy, turning 'wrong bucket but visible' into 'missing from public output entirely'. The v55 3/3 paperclip result was LLM run variance in the gates bucket call, not evidence the prompt rule worked. Replaces the prompt-side rule with a deterministic post-processor that scans the actual LLM emissions across both buckets and reroutes by structural shape: has_gate_shape(line): true when the surface form matches 'If <something with a digit token> ... then <consequence>' — the structural shape the gates bucket prompt asks the LLM to produce. Language-neutral (digits are digits in any locale); does not key on English-only keywords beyond the if/then template the bucket prompt already requires. Qualitative if-then sentences (no numeric token) are intentionally excluded — they may legitimately be gates (categorical/approval/deadline) but the promoter only fires on the unambiguous numeric pattern to avoid stealing genuine risks. promote_gate_shaped_risks(gates_items, risks_items): scans risks for gate-shaped items whose normalised source_quote is NOT already represented in the gates pool. Promoted items are MOVED to the gates candidate pool (not copied) so the risks slot is reclaimed for an actual risk. Items already in gates by quote are left in risks untouched (within-bucket dedupe is a separate concern handled by the existing 'do not restate' prompt rule). Inputs are not mutated. Wiring: defers annotate_scored_items (top-N filter) for gates_and_thresholds and risks_and_shocks until both have completed first+second-pass merging. After the bucket loop, the promoter runs on both merged candidate pools, then annotate fires on the augmented gates pool and the remaining risks pool. Per-bucket metadata gains a cross_bucket_promoted_count field so downstream consumers can audit. Reverted the earlier risks-side prompt addition from this branch — it was both causally wrong AND created a worse failure mode (per the user critique). Gates and risks bucket prompts are now back to their pre-PR state. Empirical posture: 37 unit tests pass (28 prior + 9 new — has_gate_shape true/false/non-string, promotion fire/skip/dedupe/empty/no-mutation). Same-LLM same-session paperclip 3x + 5-other-plan regression sweep (v56) shows 0 promotions across all 32 plan x section cells — the v53c-style miscategorisation did not recur in this session, so the promoter had nothing to act on. The change is a deterministic backstop for a rare LLM failure mode, analogous to PR #747's calculation-output strip which also fired 0 times on its regression corpus. The 8-run regression is otherwise within typical same-LLM variance (mostly +/-1-3 items, paperclip v56c expert_criticism is a separate 0-candidate emission failure unrelated to this change — the gates LLM call is unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mparison shape (the actual v53c phrasing) Review feedback on PR #750 (second round): the first iteration only caught canonical 'If <... digit ...> then <...>' form, which is NOT the phrasing the LLM used in the historical v53c failure. The v53c risk-bucket line was declarative: 'Middleware development bid exceeds $75,000, consuming budget planned for the physical handoff accumulation system.' has_gate_shape() returned False on that, so the advertised v53c backstop did not address the historical failure. Extended the detector to also recognise the declarative form: <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence>. Comparison verbs in the recognised list are structural cues, not domain vocabulary: exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. The verb membership is the structural cue; if the line uses a causal verb ('X risks Y', 'X causes Y', 'Failure of X leads to Y'), it stays in risks. Numeric guard: the threshold must contain a digit token AND the separator comma/colon must not be followed by another digit (so commas inside numbers like '$75,000' do not split the match). 'Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase.' is rejected because there is no comparison verb between subject and digit (negative regression test added). Deterministic if/then rewrite preserves the gates_and_thresholds bucket's output contract: a declarative line is rewritten as 'If <subject> <verb> <threshold>, then <consequence>' with case adjustments for mid-sentence flow. line_original is intentionally not rewritten — it keeps the source's native phrasing for downstream consumers. Three new regression tests added: (1) the exact v53c phrasing is now recognised by has_gate_shape and rewritten to if/then form by gate_shape_promotion; (2) the genuine supply-chain risk shape stays rejected; (3) the promoter end-to-end correctly moves the v53c-shaped risk to gates with line_english rewritten while source_quote, scores, status, and line_original are preserved. 42 unit tests pass total. Empirical posture: audited the extended detector across the v56 sweep (290 risks candidate lines, 8 runs). 0 of 290 match the extended pattern. v56 risks emissions are dominated by causal forms ('X risks Y', 'X causes Y') rather than declarative comparison. The detector covers the v53c shape (verified by unit test on the exact historical line) but the v53c shape did not recur in v56. The promoter remains a deterministic backstop — exercised by unit tests, dormant on the live sweep, same posture as PR #747's calculation-output strip rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rg#747 (Phase 4 runtime) in 20260520 plan Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… R1.3) Completes Phase 4 of the napkin_math methodology plan. PR PlanExeOrg#747 shipped the code-side runtime and schema readiness; this PR ships the four LLM-rule changes that drive new behaviour on top of that runtime. (R1.2) ACTUAL-VS-COMMITMENT base anchoring — adds an explicit source-tag asymmetry. When base is shifted on a named plan-internal anchor (Premortem/Risk/Issue/Decision/expert-criticism passage forecasting a gap), source=data and the rationale names the artifact verbatim. When base is left at the commitment default with no named anchor, source=assumption (NOT data) and the rationale states explicitly that no plan passage anchors a shift. A downstream reader must be able to tell at a glance whether the base reflects a plan finding or a modelling default. (R1.5) New SELF-AUDIT: CITATION CONTEXT-LEAK section — concrete abstract examples of the failure mode where a citation is lexically present (the plan has a Risk N) but substantively wrong (Risk N is about a different topic than the claim). Drop the citation when it fails the substantive-support check; re-evaluate whether the base/range shift is justified at all if no substantively correct anchor exists. (R2.2) New DISTRIBUTION DEFAULT BY PLAN SCALE section — for plans at megaproject scale (multi-billion budget, multi-year horizon, multiple binding regulatory dependencies, first-of-kind execution), CAPEX and major OPEX variables default to sampling_discipline lognormal with low/high as P5/P95. Identification is by abstract criteria (budget magnitude, time horizon, regulatory weight, modelling_frame language, Premortem framing of cost overrun as structural), not by a plan_type-string enum — plan_type strings are corpus literals and the prompt must stay corpus-agnostic. The plan_type field is one signal among several. (R1.3) New CORRELATIONS (OPTIONAL TOP-LEVEL BLOCK) section — declares the schema for the correlations key reserved in PR PlanExeOrg#747 and the selection rules for emitting it. A correlation group is declared only when the plan ITSELF names a shared driver between two or more bounded variables (Risk/Issue/Decision/Premortem/expert-criticism). Modeller priors are not valid anchors. Rho is bucketed by anchor strength: 0.6-0.8 strong, 0.3-0.5 moderate, weak couplings are not declared (independence is the default). Max 5 groups. Discipline-table notes updated: the PR PlanExeOrg#747 'do not emit yet' guards on lognormal and on the correlations top-level key are lifted, replaced with pointers to the new selection-rule sections. pert remains 'no prompt rule directs you to emit' since this PR ships no pert selection rule. Sampler implementation status (Phase 8) is documented on both lognormal and correlations: lognormal raises NotImplementedError at sample time (loud failure), correlations is preserved but sampled independently with a warning (no silent shim). Runner: adds a loud warning when bounds declares a correlations block but the Gaussian-copula sampler is not yet implemented. Names the block size, names Phase 8 explicitly, and states that joint-tail risk is structurally understated until the sampler ships. Two regression tests cover the warning emission and its absence. No corpus literals: grep verifies that none of the actual v51 plan_type strings (hyperscale_infrastructure, industrial_automation_pilot, catastrophic_disaster_response, public_finance_transition, commercial_csr_reverse_logistics, commercial_digital_infrastructure_application) and no corpus-specific identifiers (OPC UA, $75k, 32.2 km, paperclip, datacenter, yellowstone, mars_gtld, crate, euro_adoption) appear in the prompt text. Empirical posture: 75 unit tests pass, 9/9 smoke checks pass. Behavioural verification of the new prompt rules against the napkin_math probe set is a same-LLM same-session regression check — not an improvement claim. Honest verification needs a different-LLM run, which is a separate follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye and others added 2 commits May 21, 2026 16:01

neoneye merged commit 372ddc3 into main May 21, 2026
3 checks passed

neoneye deleted the napkin-math/generate-bounds-phase4-runtime branch May 21, 2026 14:09

neoneye mentioned this pull request May 21, 2026

docs(napkin-math): record PR #746 (Phase 3) and PR #747 (Phase 4 runtime) #748

Merged

3 tasks

neoneye added a commit that referenced this pull request May 21, 2026

Merge pull request #748 from PlanExeOrg/docs/napkin-math-plan-update-…

ebbc01b

…746-747 docs(napkin-math): record PR #746 (Phase 3) and PR #747 (Phase 4 runtime)

neoneye mentioned this pull request May 21, 2026

napkin-math(bounds): Phase 4 prompt-side LLM rules (R1.2, R1.5, R2.2, R1.3) #749

Merged

4 tasks

neoneye mentioned this pull request May 21, 2026

napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks #750

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

napkin-math(bounds): Phase 4 runtime + schema readiness#747

napkin-math(bounds): Phase 4 runtime + schema readiness#747
neoneye merged 2 commits into
mainfrom
napkin-math/generate-bounds-phase4-runtime

neoneye commented May 21, 2026

Uh oh!

neoneye commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented May 21, 2026

Summary

Shipped here

Out of scope (Phase 4 follow-up PR)

Empirical posture

Test plan

Uh oh!

neoneye commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant