Skip to content

feat(napkin-math): shared-pool legitimacy check for aggregate surplus tests#707

Merged
neoneye merged 3 commits into
mainfrom
feat/napkin-math-shared-pool
May 16, 2026
Merged

feat(napkin-math): shared-pool legitimacy check for aggregate surplus tests#707
neoneye merged 3 commits into
mainfrom
feat/napkin-math-shared-pool

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 16, 2026

Summary

ChatGPT reviewed the v31 cross-border rail ticketing run (the cross-plan generalisation test from PR #706) and flagged one real modelling discipline issue:

The combined_program_viability_surplus_eur may be too aggressive or conceptually overloaded. It mixes liquidity/clearing solvency with revenue loss from enforcement delay. Those may not draw on the same reserve unless the report explicitly says the emergency float reserve also backstops regulatory/revenue shortfalls.

The rail digest explicitly scopes the €300M emergency reserve to the clearing mechanism — those two pressures draw on different pools. Additive netting overstated headroom. Conceptually correct form is min() over the individual surplus calculations.

What changed

Both extractor prompts (extract-parameters-from-full, extract-parameters-from-digest) gain a Shared-pool legitimacy check for combined surplus tests rule:

  • When an aggregate test subtracts multiple pressures from one reserve, verify those pressures actually draw from the same pool — same named buffer, same line item, same envelope.
  • If yes (single pool absorbs every pressure): additive form is correct.
    • combined_viability_surplus = pool_reserve - pressure_a - pressure_b - pressure_c
  • If no (separate pools): use min() over individual surpluses.
    • combined_viability_surplus = min(surplus_a, surplus_b, surplus_c)
  • Concrete signs of each pattern listed. Aggregate name should match the form: prefer weakest_gate_surplus / worst_case_pool_surplus when using min().

Cross-validated against both reference plans

Nuuk Clay Workshop: digest explicitly says the same 15% DKK contingency absorbs labor-law, rental, and kiln-failure shocks. Additive form (contingency - labor_shock - rental_shortfall) is correct under the new rule. No model change needed.

Cross-border rail ticketing: v32 replaces additive combined_program_viability_surplus_eur with:

weakest_financial_gate_surplus_eur = min(
  clearing_capacity_surplus_eur,
  regulatory_risk_buffer_surplus_eur,
  royalty_cost_coverage_surplus_eur,
)

Numerical effect of the framing change

Metric v31 (additive) v32 (min)
Aggregate pass probability 47.33% (FRAGILE) 26.15% (FRAGILE, near DOOM)
Base scenario value +87.5M EUR 0 EUR
Low scenario value +400M EUR −500k EUR
High scenario value −800M EUR −600M EUR

The other five individually-scoped gates are unchanged. The additive form was netting deficits in revenue economics against surplus in clearing liquidity — different stakeholders, different pools. The min() form correctly says "all three gates must independently hold".

Test plan

  • Smoke 8/8 (tests/run_smoke.py)
  • Unittest 45/45 (tests/test_run_monte_carlo.py)
  • v32 rail regenerated with min() framing; runner emits zero warnings; insights.md reads cleanly with the new headline
  • Nuuk's additive form re-validated against the digest text — no change needed
  • Reviewer: confirm the prompt rule's min() direction is the right default when the source is ambiguous (current text leans toward min() unless the source explicitly names a shared pool — fail safe).

🤖 Generated with Claude Code

neoneye added 3 commits May 16, 2026 16:42
… tests

ChatGPT review of the v31 cross-border rail run flagged that combined_program_viability_surplus_eur subtracted both peak_daily_clearing_obligation_eur AND enforcement_delay_revenue_shortfall_eur from the same (initial_clearing_float + emergency_float_reserve) pool. The digest text scopes the emergency reserve to the clearing mechanism only — those two pressures draw on different pools. Additive netting overstated headroom: P(combined_surplus >= 0) read 47% under the additive form, while the conceptually correct min-of-individual-surpluses read 26%.

Added a 'Shared-pool legitimacy check for combined surplus tests' rule to both extractor prompts. When an aggregate test subtracts multiple pressures from one reserve, verify those pressures actually draw from the same pool — same named buffer, same line item, same envelope. If yes, additive form is correct. If no, use min() over the individual surplus calculations instead. The rule lists concrete signs of each pattern, explains why the threshold semantics ('all gates pass') still hold under min(), and recommends renaming the aggregate to 'weakest_gate_surplus' or 'worst_case_pool_surplus' when min() is used so the form is legible from the name.

Cross-validated against both reference plans:

- Nuuk Clay Workshop: digest explicitly says the same 15% DKK contingency absorbs labor-law, rental shortfall, and kiln-failure shocks. Additive form (contingency - labor_shock - rental_shortfall) is correct under the new rule. No model change needed.

- Cross-border rail ticketing: digest scopes the 300M EUR emergency reserve to the clearing mechanism. v32 replaces the additive combined_program_viability_surplus_eur with weakest_financial_gate_surplus_eur = min(clearing_capacity_surplus, regulatory_risk_buffer_surplus, royalty_cost_coverage_surplus). The tighter framing drops the pass probability from 47.33% to 26.15% — closer to the truth given that the adoption gate is DOOM and the regulatory/royalty gates are coin-flips. The other five gates are individually scoped and unchanged.

Smoke 8/8, unittest 45/45.
ChatGPT's v32 caveat: 'weakest_financial_gate_surplus_eur uses min(...) across values with very different magnitudes. That is logically valid for a pass/fail gate, but it can make sensitivity analysis less intuitive: the regulatory buffer dominates because it has much larger swings than the royalty surplus. That is okay as long as you treat it as a gate metric, not as a monetary expected surplus in the usual sense. For reporting, I would present the individual gates first, then the weakest-gate result.'

summarize_insights.py now identifies min()-style aggregates via a simple formula_hint scan and reorders the verdict table + bad-news-first lists so non-aggregates come first within each severity band. The verdict table also gains a 'min' marker column so the reader can spot which rows are aggregates at a glance. Section heading expanded to explain why the ordering differs from pure severity sort.

Additive aggregates are NOT demoted. Nuuk's combined_viability_surplus_dkk = contingency - labor_shock - rental_shortfall uses subtraction over one named pool, and its magnitude is a real DKK depletion of that pool. The heuristic 'formula contains min(' distinguishes the two cases cleanly without a new schema field.

Verified: rail v32 verdict table now lists regulatory_risk_buffer_surplus_eur (individual, 48.3%) before weakest_financial_gate_surplus_eur (aggregate, marked 'min', 26.2%) within the FRAGILE band, even though the aggregate has a worse pass probability. Nuuk v31 verdict table is unchanged because none of its formulas contain min(.

Smoke 8/8, unittest 45/45.
…ation

Adds slab 4 to the TL;DR covering the late-evening fifth prompt rule (shared-pool legitimacy) and the summarize-insights min()-aggregate demotion, both still on the unmerged PR #707. Adds a cross-plan validation table showing the same five rules applied across three plans in two domains: Nuuk Clay Workshop (DKK, commercial small-business), Cross-Border European Rail Ticketing (EUR, EU public infrastructure), and Faraday Enclosure Launch (EUR, commercial hardware launch).

Headline-state block now includes the threshold pass-probability summaries for all three plans, showing that each plan produced a domain-appropriate verdict shape (Nuuk: two DOOM + MARGINAL + ROBUST; rail: one DOOM + ROBUST + three MARGINAL/FRAGILE + a FRAGILE min aggregate; Faraday: three DOOM + two FRAGILE). None of the three is a template overfit of another.

Outstanding-issues section rewritten to reflect what closed and what remains. Cross-plan generalisation moved from 'highest leverage, untested' to 'substantively closed' — three plans across two domains have validated the rules. The remaining headline gap is the original compressed-vs-full-HTML head-to-head, which still has not been run. 'What I would do next' re-prioritised: the head-to-head is now item 1, a fourth-domain run (Leipzig public-health) is item 2, lifting run-scenarios to Python is item 3, and the validator/repair loop is item 4.

Closing thought rewritten to record one observation from the Faraday pass: my first draft mis-modelled the funder's 'net cash from operations excluding the initial €400k investment' by subtracting marketing/cert/CAPEX from gross profit. The deterministic three-scenario table caught this before Monte Carlo ran — when scenarios read as obviously worse than the plan's own framing, the formula is probably wrong, not the plan. Worth recording as a pattern.
@neoneye neoneye merged commit c491c85 into main May 16, 2026
3 checks passed
@neoneye neoneye deleted the feat/napkin-math-shared-pool branch May 16, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant