Skip to content

napkin-math(validate): add aggregate_not_bounded and requirement_has_margin (Phase 3)#746

Merged
neoneye merged 2 commits into
mainfrom
napkin-math/validate-parameters-phase3
May 21, 2026
Merged

napkin-math(validate): add aggregate_not_bounded and requirement_has_margin (Phase 3)#746
neoneye merged 2 commits into
mainfrom
napkin-math/validate-parameters-phase3

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 21, 2026

Summary

Phase 3 of experiments/napkin_math/docs/20260520_plan.md. Adds two new deterministic structural checks to validate_parameters.py, targeting Gemini's R1.1 (disconnected aggregates) and R2.5 (threshold trivialization).

New rules

  1. aggregate_not_bounded (ERROR) — when an entry's formula_hint is a pure sum of named identifiers (total = A + B + C) and its output_name also appears in missing_values_to_estimate, the aggregate is sampled independently of its constituents. A single Monte Carlo trial can then pair sub-component p95s with a total p05. The total is a calculation over the parts, not a primitive input to estimate. Detection is deterministic and syntactic — RHS contains +, no other operators, every operand is a snake_case identifier. Mixed-operator expressions (a + b - c, a * b + c) are intentionally out of scope to avoid false positives on multiplicative or netted decompositions.

  2. requirement_has_margin (ERROR) — when a key_value's id ends in _required, at least one calculation in derived_questions or recommended_first_calculations must declare a real realised-vs-required margin. Three properties together constitute a margin:

    1. The requirement id appears on the formula RHS (the calculation actually consumes the required value).
    2. The formula contains a subtraction or ratio operator (the calculation compares the realised quantity to the requirement, rather than adding the requirement into an aggregate).
    3. The output_name carries a positive-pass margin suffix (_margin/_surplus/_buffer/_coverage), so a downstream >= 0 (or >= 1 for coverage) threshold reads correctly.

    A bare reference inside a sum (e.g. combined = actual + required) consumes the value but does not test whether the realised quantity meets the requirement, and was the false-pass case flagged in the first review round.

Out of scope

The Phase 3 sampling_discipline enum expansion (lognormal, pert) is not in this PR. VALID_DISCIPLINES lives in run_monte_carlo.py, not validate_parameters.py. Expanding the allowed set without the matching sampler implementation (Phase 8) would create a silent shim where validation passes but sampling uses the wrong (triangular) distribution — exactly the "shim that pretends to work" antipattern. That bullet belongs alongside Phase 4 (bounds prompt emits the new disciplines) or Phase 8 (samplers implement them).

Empirical posture

  • Unit tests: 19 focused tests in tests/test_validate_parameters.py. Coverage includes the sum-formula classifier alone, the aggregate_not_bounded fire and silent paths, the requirement_has_margin fire/silent paths with the three property combinations (false-pass sum, subtraction without margin suffix, ratio coverage as legitimate margin), and the checks_performed enumeration. All pass.
  • Smoke-test fixture: validates clean against the updated validator (18 checks_performed, 0 errors, 0 warnings). Smoke-test assertion and docstring updated from 16 → 18 checks.
  • 6 existing v51 parameters.json files (paperclip, euro_adoption, yellowstone, crate, mars_gtld, datacenter_france): all still valid=True, errors=0, warns=0 with zero new-rule violations. The corpus does not currently emit _required ids, so the tightened margin rule is correctly dormant — it will engage once Phase 2 extract prompts start producing requirement-floor key_values.

Test plan

  • pytest experiments/napkin_math/tests/test_validate_parameters.py (19 pass)
  • Smoke fixture validates clean (18 checks_performed, 0 errors, 0 warns)
  • All 6 v51 plans still validate clean (0 false positives from new rules)
  • Regression test for the false-pass sum case (combined = actual + required) — rule now fires
  • CI green on this branch (tests, lint, typecheck after the original commit)

🤖 Generated with Claude Code

neoneye and others added 2 commits May 21, 2026 15:36
…margin checks (Phase 3)

Phase 3 of the napkin_math methodology plan adds two new structural checks to validate_parameters.py, targeting Gemini's R1.1 (disconnected aggregates) and R2.5 (threshold trivialization) findings.

aggregate_not_bounded: when an entry's formula_hint is a pure sum of named identifiers (total = A + B + C) and the output_name also appears in missing_values_to_estimate, the aggregate is sampled independently of its constituents, so a single Monte Carlo trial can pair sub-component p95s with a total p05. Detection is deterministic syntactic — RHS contains '+', no other operators, every operand is a snake_case identifier. Mixed-operator expressions (a + b - c, a * b + c) are intentionally out of scope to avoid false positives on multiplicative or netted decompositions.

requirement_has_margin: when a key_value's id ends in _required, at least one calculation in derived_questions or recommended_first_calculations must reference it (formula RHS or depends_on). Otherwise the realised-vs-required margin variable is missing and the gate defaults to >= 0 against an absolute quantity, passing for any non-negative realisation regardless of whether it meets the requirement.

Adds 16 focused unit tests in tests/test_validate_parameters.py (sum-formula classifier, both rules' fire and silent paths, checks_performed enumeration). All 6 existing v51 parameters.json files (paperclip, euro_adoption, yellowstone, crate, mars_gtld, datacenter_france) continue to validate clean (0 false positives). Smoke-test count updated from 16 to 18.

Out of scope: the Phase 3 sampling_discipline enum expansion (lognormal/pert). VALID_DISCIPLINES lives in run_monte_carlo.py and expanding it without the matching sampler implementation (Phase 8) would create a silent shim where validation passes but sampling uses the wrong distribution. That bullet belongs alongside Phase 4 (bounds prompt emits the new disciplines) or Phase 8 (samplers implement them).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… margin shape

Review feedback on PR #746: the original rule only checked that a _required key_value was referenced by some calculation, which mostly duplicated no_dead_end_variables with a more specific message. False pass: 'combined_area = actual_area + buildable_area_required' validated clean, even though it adds the requirement into an aggregate rather than testing the realised-vs-required margin.

Tightened rule requires three properties on at least one calculation in derived_questions or recommended_first_calculations: (1) the requirement id appears on the formula RHS, (2) the formula contains a subtraction or ratio operator, (3) the output_name carries a positive-pass margin suffix (_margin/_surplus/_buffer/_coverage). All three are needed — any combination of two leaves a hole.

Adds three regression tests: the false-pass sum case, subtraction-without-margin-suffix, and the legitimate coverage-ratio (actual / required → *_coverage) silent case. 19 unit tests total pass (16 prior + 3 new). All 6 v51 parameters.json files still validate clean — none currently emit _required key_values, so the tightened rule is correctly dormant on the existing corpus.

Also fixes the smoke-test docstring that still said '16 checks' after PR #746 updated the assertion to 18.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye
Copy link
Copy Markdown
Member Author

neoneye commented May 21, 2026

Addressed review feedback (commit 0e5f64be):

Tightened requirement_has_margin to require three properties together, not just "referenced somewhere":

  1. The requirement id appears on the formula RHS.
  2. The formula contains a subtraction or ratio operator (- or /).
  3. The output_name ends in _margin/_surplus/_buffer/_coverage.

A bare reference inside a sum no longer satisfies the rule. Specifically, the false-pass case from the review (combined_area = actual_area + buildable_area_required) now fires requirement_has_margin ERROR, locked in by a regression test.

Also added a positive test that the legitimate coverage-ratio shape (buildable_area_coverage = actual_area / buildable_area_required) stays silent, and a test that a real subtraction without a margin-suffix output (area_delta = actual - required) still fires.

Doc cleanup: run_smoke.py docstring updated from "16 checks listed" to "18 checks listed" to match the assertion.

Empirical posture:

  • 19 unit tests pass (16 from PR baseline + 3 new for the tightened rule)
  • 6 v51 plans still validate clean (no false positives; the corpus doesn't yet emit _required ids, so the rule is correctly dormant)
  • CI was green on the original commit; updated PR body now reflects that and the tightened rule design.

@neoneye neoneye merged commit 70048df into main May 21, 2026
3 checks passed
@neoneye neoneye deleted the napkin-math/validate-parameters-phase3 branch May 21, 2026 13:52
neoneye added a commit that referenced this pull request May 21, 2026
…746-747

docs(napkin-math): record PR #746 (Phase 3) and PR #747 (Phase 4 runtime)
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…rg#747 (Phase 4 runtime) in 20260520 plan

Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section.

Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes.

Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant