napkin-math(validate): add aggregate_not_bounded and requirement_has_margin (Phase 3)#746
Conversation
…margin checks (Phase 3) Phase 3 of the napkin_math methodology plan adds two new structural checks to validate_parameters.py, targeting Gemini's R1.1 (disconnected aggregates) and R2.5 (threshold trivialization) findings. aggregate_not_bounded: when an entry's formula_hint is a pure sum of named identifiers (total = A + B + C) and the output_name also appears in missing_values_to_estimate, the aggregate is sampled independently of its constituents, so a single Monte Carlo trial can pair sub-component p95s with a total p05. Detection is deterministic syntactic — RHS contains '+', no other operators, every operand is a snake_case identifier. Mixed-operator expressions (a + b - c, a * b + c) are intentionally out of scope to avoid false positives on multiplicative or netted decompositions. requirement_has_margin: when a key_value's id ends in _required, at least one calculation in derived_questions or recommended_first_calculations must reference it (formula RHS or depends_on). Otherwise the realised-vs-required margin variable is missing and the gate defaults to >= 0 against an absolute quantity, passing for any non-negative realisation regardless of whether it meets the requirement. Adds 16 focused unit tests in tests/test_validate_parameters.py (sum-formula classifier, both rules' fire and silent paths, checks_performed enumeration). All 6 existing v51 parameters.json files (paperclip, euro_adoption, yellowstone, crate, mars_gtld, datacenter_france) continue to validate clean (0 false positives). Smoke-test count updated from 16 to 18. Out of scope: the Phase 3 sampling_discipline enum expansion (lognormal/pert). VALID_DISCIPLINES lives in run_monte_carlo.py and expanding it without the matching sampler implementation (Phase 8) would create a silent shim where validation passes but sampling uses the wrong distribution. That bullet belongs alongside Phase 4 (bounds prompt emits the new disciplines) or Phase 8 (samplers implement them). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… margin shape Review feedback on PR #746: the original rule only checked that a _required key_value was referenced by some calculation, which mostly duplicated no_dead_end_variables with a more specific message. False pass: 'combined_area = actual_area + buildable_area_required' validated clean, even though it adds the requirement into an aggregate rather than testing the realised-vs-required margin. Tightened rule requires three properties on at least one calculation in derived_questions or recommended_first_calculations: (1) the requirement id appears on the formula RHS, (2) the formula contains a subtraction or ratio operator, (3) the output_name carries a positive-pass margin suffix (_margin/_surplus/_buffer/_coverage). All three are needed — any combination of two leaves a hole. Adds three regression tests: the false-pass sum case, subtraction-without-margin-suffix, and the legitimate coverage-ratio (actual / required → *_coverage) silent case. 19 unit tests total pass (16 prior + 3 new). All 6 v51 parameters.json files still validate clean — none currently emit _required key_values, so the tightened rule is correctly dormant on the existing corpus. Also fixes the smoke-test docstring that still said '16 checks' after PR #746 updated the assertion to 18. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed review feedback (commit Tightened
A bare reference inside a sum no longer satisfies the rule. Specifically, the false-pass case from the review ( Also added a positive test that the legitimate coverage-ratio shape ( Doc cleanup: Empirical posture:
|
…rg#747 (Phase 4 runtime) in 20260520 plan Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Phase 3 of
experiments/napkin_math/docs/20260520_plan.md. Adds two new deterministic structural checks tovalidate_parameters.py, targeting Gemini's R1.1 (disconnected aggregates) and R2.5 (threshold trivialization).New rules
aggregate_not_bounded(ERROR) — when an entry'sformula_hintis a pure sum of named identifiers (total = A + B + C) and itsoutput_namealso appears inmissing_values_to_estimate, the aggregate is sampled independently of its constituents. A single Monte Carlo trial can then pair sub-component p95s with a total p05. The total is a calculation over the parts, not a primitive input to estimate. Detection is deterministic and syntactic — RHS contains+, no other operators, every operand is a snake_case identifier. Mixed-operator expressions (a + b - c,a * b + c) are intentionally out of scope to avoid false positives on multiplicative or netted decompositions.requirement_has_margin(ERROR) — when akey_value's id ends in_required, at least one calculation inderived_questionsorrecommended_first_calculationsmust declare a real realised-vs-required margin. Three properties together constitute a margin:_margin/_surplus/_buffer/_coverage), so a downstream>= 0(or>= 1for coverage) threshold reads correctly.A bare reference inside a sum (e.g.
combined = actual + required) consumes the value but does not test whether the realised quantity meets the requirement, and was the false-pass case flagged in the first review round.Out of scope
The Phase 3
sampling_disciplineenum expansion (lognormal,pert) is not in this PR.VALID_DISCIPLINESlives inrun_monte_carlo.py, notvalidate_parameters.py. Expanding the allowed set without the matching sampler implementation (Phase 8) would create a silent shim where validation passes but sampling uses the wrong (triangular) distribution — exactly the "shim that pretends to work" antipattern. That bullet belongs alongside Phase 4 (bounds prompt emits the new disciplines) or Phase 8 (samplers implement them).Empirical posture
tests/test_validate_parameters.py. Coverage includes the sum-formula classifier alone, theaggregate_not_boundedfire and silent paths, therequirement_has_marginfire/silent paths with the three property combinations (false-pass sum, subtraction without margin suffix, ratio coverage as legitimate margin), and thechecks_performedenumeration. All pass.valid=True, errors=0, warns=0with zero new-rule violations. The corpus does not currently emit_requiredids, so the tightened margin rule is correctly dormant — it will engage once Phase 2 extract prompts start producing requirement-floor key_values.Test plan
pytest experiments/napkin_math/tests/test_validate_parameters.py(19 pass)combined = actual + required) — rule now fires🤖 Generated with Claude Code