napkin-math(bounds): Phase 4 prompt-side LLM rules (R1.2, R1.5, R2.2, R1.3)#749
Merged
Merged
Conversation
… R1.3) Completes Phase 4 of the napkin_math methodology plan. PR #747 shipped the code-side runtime and schema readiness; this PR ships the four LLM-rule changes that drive new behaviour on top of that runtime. (R1.2) ACTUAL-VS-COMMITMENT base anchoring — adds an explicit source-tag asymmetry. When base is shifted on a named plan-internal anchor (Premortem/Risk/Issue/Decision/expert-criticism passage forecasting a gap), source=data and the rationale names the artifact verbatim. When base is left at the commitment default with no named anchor, source=assumption (NOT data) and the rationale states explicitly that no plan passage anchors a shift. A downstream reader must be able to tell at a glance whether the base reflects a plan finding or a modelling default. (R1.5) New SELF-AUDIT: CITATION CONTEXT-LEAK section — concrete abstract examples of the failure mode where a citation is lexically present (the plan has a Risk N) but substantively wrong (Risk N is about a different topic than the claim). Drop the citation when it fails the substantive-support check; re-evaluate whether the base/range shift is justified at all if no substantively correct anchor exists. (R2.2) New DISTRIBUTION DEFAULT BY PLAN SCALE section — for plans at megaproject scale (multi-billion budget, multi-year horizon, multiple binding regulatory dependencies, first-of-kind execution), CAPEX and major OPEX variables default to sampling_discipline lognormal with low/high as P5/P95. Identification is by abstract criteria (budget magnitude, time horizon, regulatory weight, modelling_frame language, Premortem framing of cost overrun as structural), not by a plan_type-string enum — plan_type strings are corpus literals and the prompt must stay corpus-agnostic. The plan_type field is one signal among several. (R1.3) New CORRELATIONS (OPTIONAL TOP-LEVEL BLOCK) section — declares the schema for the correlations key reserved in PR #747 and the selection rules for emitting it. A correlation group is declared only when the plan ITSELF names a shared driver between two or more bounded variables (Risk/Issue/Decision/Premortem/expert-criticism). Modeller priors are not valid anchors. Rho is bucketed by anchor strength: 0.6-0.8 strong, 0.3-0.5 moderate, weak couplings are not declared (independence is the default). Max 5 groups. Discipline-table notes updated: the PR #747 'do not emit yet' guards on lognormal and on the correlations top-level key are lifted, replaced with pointers to the new selection-rule sections. pert remains 'no prompt rule directs you to emit' since this PR ships no pert selection rule. Sampler implementation status (Phase 8) is documented on both lognormal and correlations: lognormal raises NotImplementedError at sample time (loud failure), correlations is preserved but sampled independently with a warning (no silent shim). Runner: adds a loud warning when bounds declares a correlations block but the Gaussian-copula sampler is not yet implemented. Names the block size, names Phase 8 explicitly, and states that joint-tail risk is structurally understated until the sampler ships. Two regression tests cover the warning emission and its absence. No corpus literals: grep verifies that none of the actual v51 plan_type strings (hyperscale_infrastructure, industrial_automation_pilot, catastrophic_disaster_response, public_finance_transition, commercial_csr_reverse_logistics, commercial_digital_infrastructure_application) and no corpus-specific identifiers (OPC UA, $75k, 32.2 km, paperclip, datacenter, yellowstone, mars_gtld, crate, euro_adoption) appear in the prompt text. Empirical posture: 75 unit tests pass, 9/9 smoke checks pass. Behavioural verification of the new prompt rules against the napkin_math probe set is a same-LLM same-session regression check — not an improvement claim. Honest verification needs a different-LLM run, which is a separate follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h R1.2 actual_X exception Review feedback on PR #749: the new R1.2 actual_X source-tag rule contradicted three existing prompt sites that this PR introduced. Each contradiction directly undermined R1.2 because the example or generic rule is what the LLM mirrors when emitting bounds. (1) Worked example: actual_outreach_contact_rate and actual_cooling_center_utilization both have base at the commitment default (no plan-internal anchor for a base shift) but were tagged source: data. Flipped both to source: assumption. Updated rationales to match the R1.2-required phrasing ('base centered on plan commitment; no plan-internal passage anchors a shift'). For actual_cooling_center_utilization the rationale also explains why source remains 'assumption' despite the Risk-4 spread anchor — the source tag tracks BASE anchoring, not spread anchoring. (2) Output rules: removed the stale 'which is not yet emitted' phrase on the correlations exception. This PR explicitly directs the LLM to emit correlations per the CORRELATIONS section's selection rules, so the 'not yet emitted' guard is wrong. (3) Generic source rule: added an explicit deferral pointer in HOW TO CHOOSE THE RANGE ('Exception: actual_X variables follow a different source-tag rule — see the SOURCE TAG FOR actual_X VARIABLES paragraph'). Also split the bullet in 'Rules for the output' into two — one bullet for non-actual_X (range-anchoring controls source), one for actual_X (base-anchoring controls source). 75 unit tests pass, 9/9 smoke checks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
Addressed review feedback (
75 unit tests pass, 9/9 smoke checks pass. |
Per user direction, the plan-status update lands in this PR (not a separate one). Adds PR #749 (Phase 4 prompt-side LLM rules — R1.2 base-anchoring source-tag asymmetry, R1.5 citation context-leak self-audit, R2.2 megaproject CAPEX lognormal default, R1.3 correlations selection rules + worked-example/source-rule cleanups from review) to the Landed-on-main section as the prompt-side completion of Phase 4. Phase 4 status row now reads 'Code-side runtime DONE on main via PR #747; prompt-side LLM rules in PR #749 (open, CI green, awaiting merge)' with each R-tag enumerated. pert discipline noted as schema-reserved with no selection rule yet. Next-likely-move list re-ordered now that the Phase 4 prompt-side item is in flight: item 1 is bucket-categorisation discipline in compress (the residual paperclip v53c miss), item 2 is proposal 141 implementation, item 3 is Phase 5 verify-bounds-citations (new deterministic R1.5 backstop), item 4 is different-LLM behavioural validation, item 5 is prompt-hygiene pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
neoneye
added a commit
that referenced
this pull request
May 21, 2026
…promoter) in 20260520 plan Per user direction, the plan-status update lands in PR #750 (not a separate doc PR). PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line. Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744. Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes Phase 4 of
experiments/napkin_math/docs/20260520_plan.md. PR #747 shipped the code-side runtime + schema readiness; this PR ships the four LLM-rule changes that drive new behaviour on top of that runtime.Prompt edits (
generate-bounds/system-prompt.txt)baseis shifted on a named plan-internal anchor (Premortem entry, Risk N, Issue N, Decision N, expert-criticism passage forecasting a gap),source: "data"and the rationale names the artifact verbatim. Whenbaseis left at the commitment default with no named anchor,source: "assumption"(NOT"data") and the rationale states explicitly that no plan passage anchors a shift. The asymmetry must be visible in the rationale so a downstream reader can tell at a glance whether the base reflects a plan finding or a modelling default.sampling_discipline: "lognormal"withlow/highas P5/P95 (Flyvbjerg's iron law). Identification by abstract criteria — multi-billion budget, multi-year horizon, multiple binding regulatory dependencies, first-of-kind execution, structural cost-overrun framing in Premortem — NOT by aplan_type-string enum.plan_typeis one signal among several; treating it as the sole criterion would encode corpus literals into the prompt.correlationskey reserved in PR napkin-math(bounds): Phase 4 runtime + schema readiness #747. A group is declared only when the plan ITSELF names a shared driver between two or more bounded variables. Modeller priors are not valid anchors. Rho is bucketed by anchor strength (0.6-0.8 strong, 0.3-0.5 moderate, weak couplings stay independent). Max 5 groups.Discipline-table updates
PR #747's "do not emit yet" guards on
lognormaland thecorrelationstop-level key are lifted: both now point to the new selection-rule sections.pertremains "no prompt rule currently directs you to emit" since this PR ships no pert selection rule (it stays schema-reserved).Phase-8 implementation status is documented on both:
lognormal—NotImplementedErrorat sample time (loud failure; no silent fall-back to triangular)correlations— preserved through stripper + warning emitted at run time (no silent shim; joint-tail risk understatement is surfaced to the user)Runner edit (
run_monte_carlo.py)Adds a loud warning when bounds declares a
correlationsblock but the Gaussian-copula sampler is not yet implemented. Names the block size, names Phase 8 explicitly, and states that joint-tail risk is structurally understated until the sampler ships.Anti-overfitting verification
grepverifies that none of the actual v51plan_typestrings —hyperscale_infrastructure,industrial_automation_pilot,catastrophic_disaster_response,public_finance_transition,commercial_csr_reverse_logistics,commercial_digital_infrastructure_application— appear in the prompt text. No corpus-specific identifiers (OPC UA,$75k,32.2 km, plan slugs) appear either. The rules read identically whether the input is a renewable-energy plan, a public-benefit policy, a renovation project, or a language no one on the team reads.Empirical posture
tests/test_run_monte_carlo.pyandtests/test_strip_threshold_bounds.py. 2 new tests cover the correlations-block warning emission and absence.Test plan
pytest experiments/napkin_math/tests/test_run_monte_carlo.py experiments/napkin_math/tests/test_strip_threshold_bounds.py(75 pass)python3 experiments/napkin_math/tests/run_smoke.py(9/9 pass)🤖 Generated with Claude Code