feat(napkin-math): shared-pool legitimacy check for aggregate surplus tests#707
Merged
Conversation
… tests
ChatGPT review of the v31 cross-border rail run flagged that combined_program_viability_surplus_eur subtracted both peak_daily_clearing_obligation_eur AND enforcement_delay_revenue_shortfall_eur from the same (initial_clearing_float + emergency_float_reserve) pool. The digest text scopes the emergency reserve to the clearing mechanism only — those two pressures draw on different pools. Additive netting overstated headroom: P(combined_surplus >= 0) read 47% under the additive form, while the conceptually correct min-of-individual-surpluses read 26%.
Added a 'Shared-pool legitimacy check for combined surplus tests' rule to both extractor prompts. When an aggregate test subtracts multiple pressures from one reserve, verify those pressures actually draw from the same pool — same named buffer, same line item, same envelope. If yes, additive form is correct. If no, use min() over the individual surplus calculations instead. The rule lists concrete signs of each pattern, explains why the threshold semantics ('all gates pass') still hold under min(), and recommends renaming the aggregate to 'weakest_gate_surplus' or 'worst_case_pool_surplus' when min() is used so the form is legible from the name.
Cross-validated against both reference plans:
- Nuuk Clay Workshop: digest explicitly says the same 15% DKK contingency absorbs labor-law, rental shortfall, and kiln-failure shocks. Additive form (contingency - labor_shock - rental_shortfall) is correct under the new rule. No model change needed.
- Cross-border rail ticketing: digest scopes the 300M EUR emergency reserve to the clearing mechanism. v32 replaces the additive combined_program_viability_surplus_eur with weakest_financial_gate_surplus_eur = min(clearing_capacity_surplus, regulatory_risk_buffer_surplus, royalty_cost_coverage_surplus). The tighter framing drops the pass probability from 47.33% to 26.15% — closer to the truth given that the adoption gate is DOOM and the regulatory/royalty gates are coin-flips. The other five gates are individually scoped and unchanged.
Smoke 8/8, unittest 45/45.
ChatGPT's v32 caveat: 'weakest_financial_gate_surplus_eur uses min(...) across values with very different magnitudes. That is logically valid for a pass/fail gate, but it can make sensitivity analysis less intuitive: the regulatory buffer dominates because it has much larger swings than the royalty surplus. That is okay as long as you treat it as a gate metric, not as a monetary expected surplus in the usual sense. For reporting, I would present the individual gates first, then the weakest-gate result.'
summarize_insights.py now identifies min()-style aggregates via a simple formula_hint scan and reorders the verdict table + bad-news-first lists so non-aggregates come first within each severity band. The verdict table also gains a 'min' marker column so the reader can spot which rows are aggregates at a glance. Section heading expanded to explain why the ordering differs from pure severity sort.
Additive aggregates are NOT demoted. Nuuk's combined_viability_surplus_dkk = contingency - labor_shock - rental_shortfall uses subtraction over one named pool, and its magnitude is a real DKK depletion of that pool. The heuristic 'formula contains min(' distinguishes the two cases cleanly without a new schema field.
Verified: rail v32 verdict table now lists regulatory_risk_buffer_surplus_eur (individual, 48.3%) before weakest_financial_gate_surplus_eur (aggregate, marked 'min', 26.2%) within the FRAGILE band, even though the aggregate has a worse pass probability. Nuuk v31 verdict table is unchanged because none of its formulas contain min(.
Smoke 8/8, unittest 45/45.
…ation Adds slab 4 to the TL;DR covering the late-evening fifth prompt rule (shared-pool legitimacy) and the summarize-insights min()-aggregate demotion, both still on the unmerged PR #707. Adds a cross-plan validation table showing the same five rules applied across three plans in two domains: Nuuk Clay Workshop (DKK, commercial small-business), Cross-Border European Rail Ticketing (EUR, EU public infrastructure), and Faraday Enclosure Launch (EUR, commercial hardware launch). Headline-state block now includes the threshold pass-probability summaries for all three plans, showing that each plan produced a domain-appropriate verdict shape (Nuuk: two DOOM + MARGINAL + ROBUST; rail: one DOOM + ROBUST + three MARGINAL/FRAGILE + a FRAGILE min aggregate; Faraday: three DOOM + two FRAGILE). None of the three is a template overfit of another. Outstanding-issues section rewritten to reflect what closed and what remains. Cross-plan generalisation moved from 'highest leverage, untested' to 'substantively closed' — three plans across two domains have validated the rules. The remaining headline gap is the original compressed-vs-full-HTML head-to-head, which still has not been run. 'What I would do next' re-prioritised: the head-to-head is now item 1, a fourth-domain run (Leipzig public-health) is item 2, lifting run-scenarios to Python is item 3, and the validator/repair loop is item 4. Closing thought rewritten to record one observation from the Faraday pass: my first draft mis-modelled the funder's 'net cash from operations excluding the initial €400k investment' by subtracting marketing/cert/CAPEX from gross profit. The deterministic three-scenario table caught this before Monte Carlo ran — when scenarios read as obviously worse than the plan's own framing, the formula is probably wrong, not the plan. Worth recording as a pattern.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ChatGPT reviewed the v31 cross-border rail ticketing run (the cross-plan generalisation test from PR #706) and flagged one real modelling discipline issue:
The rail digest explicitly scopes the €300M emergency reserve to the clearing mechanism — those two pressures draw on different pools. Additive netting overstated headroom. Conceptually correct form is
min()over the individual surplus calculations.What changed
Both extractor prompts (
extract-parameters-from-full,extract-parameters-from-digest) gain a Shared-pool legitimacy check for combined surplus tests rule:combined_viability_surplus = pool_reserve - pressure_a - pressure_b - pressure_cmin()over individual surpluses.combined_viability_surplus = min(surplus_a, surplus_b, surplus_c)weakest_gate_surplus/worst_case_pool_surpluswhen usingmin().Cross-validated against both reference plans
Nuuk Clay Workshop: digest explicitly says the same 15% DKK contingency absorbs labor-law, rental, and kiln-failure shocks. Additive form (
contingency - labor_shock - rental_shortfall) is correct under the new rule. No model change needed.Cross-border rail ticketing: v32 replaces additive
combined_program_viability_surplus_eurwith:Numerical effect of the framing change
The other five individually-scoped gates are unchanged. The additive form was netting deficits in revenue economics against surplus in clearing liquidity — different stakeholders, different pools. The min() form correctly says "all three gates must independently hold".
Test plan
tests/run_smoke.py)tests/test_run_monte_carlo.py)min()direction is the right default when the source is ambiguous (current text leans toward min() unless the source explicitly names a shared pool — fail safe).🤖 Generated with Claude Code