feat(napkin-math): teach extractors threshold-friendly output naming by neoneye · Pull Request #706 · PlanExeOrg/PlanExe

neoneye · 2026-05-16T13:43:32Z

Summary

ChatGPT reviewed the v27 Nuuk plan run and verdict was "the best version so far" — pipeline closed end-to-end, Monte Carlo runs clean with zero warnings, validation passes, scenarios produce useful conclusions. One thing flagged: rental_revenue_gap_dkk is directionally confusing. The output is tested with <= 0 ("no gap"), but the reader has to re-derive which sign is the good one. Suggestion: rename to drop_in_capacity_surplus_dkk with threshold >= 0.

This PR generalises that lesson into a new prompt rule rather than hand-patching one entry.

What changed

Both extractor skill prompts (extract-parameters-from-full and extract-parameters-from-digest) gain a Threshold-friendly output naming section:

When a calculation is meant to be compared against a threshold, name it so that positive = pass.
Prefer suffixes: _surplus, _buffer, _margin, _coverage.
Avoid: _gap (direction ambiguous), _deficit (implies always bad), _shortfall (only when the formula matches the name).
Flip the formula sign rather than emit a confusingly-named output. The rule is explicit: "if you would naturally write x_gap = expected - capacity with threshold <= 0, instead write x_capacity_surplus = capacity - expected with threshold >= 0".

Also fixed two pre-existing conflicting examples in the same prompts (revenue_gap = required - expected, funding_gap = required - available) so the "Preferred patterns" list and the "Commercial gate execution rule" no longer contradict the new naming rule.

Demonstration

Regenerated output/v28/20260215_nuuk_clay_workshop/ locally (the output/ tree is gitignored, consistent with v24–v27). Same plan, same digest, applying the new rule by hand as a future LLM run would.

rental_revenue_gap_dkk (threshold <= 0) → drop_in_capacity_surplus_dkk (threshold >= 0)
Formula sign flipped: (rate * hours) - expected_revenue instead of expected_revenue - (rate * hours)
All three scenarios now produce negative surplus (-84k / -326k / -506k) — same finding as v27 but the sign convention now matches the verdict direction.

Headline shift in insights.md:

v27: P(rental_revenue_gap_dkk <= 0) = 0.14% — DOOM
v28: P(drop_in_capacity_surplus_dkk >= 0) = 0.14% — DOOM

Same number, plain-English reading.

Test plan

Smoke 8/8 (tests/run_smoke.py)
Unittest 45/45 (tests/test_run_monte_carlo.py)
v28 end-to-end: scenarios, monte-carlo, and insights all regenerate cleanly
Reviewer to confirm the prompt rule reads as language- and domain-neutral (no English-token bias, no commercial-domain assumption)

🤖 Generated with Claude Code

ChatGPT review of the v27 Nuuk run flagged one thing in the parameters JSON: rental_revenue_gap_dkk had a >= 0 / <= 0 ambiguity. The output is meant to be tested with the threshold operator <= 0 ("no gap"), but a reader has to re-derive which sign is good every time they read the name. ChatGPT proposed renaming to drop_in_capacity_surplus_dkk with threshold >= 0, which reads naturally. Generalised the lesson into a new prompt rule rather than hand-patching one entry. Both extractors (extract-parameters-from-full and extract-parameters-from-digest) now carry a Threshold-friendly output naming section: when a calculation is meant to be compared against a threshold, name it so that positive = pass. Prefer _surplus / _buffer / _margin / _coverage; avoid _gap (direction ambiguous), _deficit (implies always bad), _shortfall (only when the formula matches the name). The rule says explicitly to flip the formula sign rather than emit a confusingly-named output. Also fixed the two pre-existing conflicting examples in the same prompts (revenue_gap = required_revenue - expected_revenue, funding_gap = required_funding - available_cash) to use the surplus framing. The Preferred patterns list and the Commercial gate execution rule now agree with the new naming rule. Smoke 8/8, unittest 45/45. Regenerated v28/20260215_nuuk_clay_workshop locally (gitignored output tree) using the renamed metric: insights.md headline switches from 'P(rental_revenue_gap_dkk <= 0) = 0.14%' to 'P(drop_in_capacity_surplus_dkk >= 0) = 0.14%'. Same finding; reads as plain English.

…ad of sampling them ChatGPT review of v28: 'rental_revenue_shortfall_dkk is still modelled as an independent sampled input while expected_drop_in_revenue_dkk and drop_in_capacity_surplus_dkk are also derived from revenue/share/hours. For a cleaner causal model, derive shortfall from capacity.' Added a 'Derive coupled stressors instead of sampling them independently' section to both extractor prompts. Concrete signs of the anti-pattern: a missing_values entry whose why_needed references another modelled variable by name; a suggested_estimation_method that's effectively a formula; an aggregate test that subtracts both an independent stressor AND the inputs the stressor depends on. Fix: convert the entry to a recommended_first_calculation with max(0, expected - achievable) (or derive from an existing surplus calc). The rule also explains why this matters: independent sampling double-counts the underlying risk, allows physically incoherent simulation runs (a shortfall in a scenario where capacity covers the target), and weakens the sensitivity analysis. Demonstrated locally with v29/20260215_nuuk_clay_workshop (gitignored output tree). v28 -> v29 substance changes (same digest, applying the new rule): - rental_revenue_shortfall_dkk moves from missing_values_to_estimate to recommended_first_calculations with formula max(0, expected_drop_in_revenue_dkk - rate*hours), and is removed from bounds.json. - Deterministic combined_viability_surplus drops in every scenario (low: +160k -> +76k; base: -119k -> -287k; high: -470k -> -626k) because the previously-assumed bounds (0/158k/350k) under-estimated the actual derived shortfall (84k/326k/506k). - P(combined_viability_surplus >= 0): 5.27% -> 1.09%. The Monte Carlo no longer benefits from the artificial independence between the shortfall and the capacity that produces it. - Sensitivity drivers of combined_viability shift from the tautological rental_revenue_shortfall_dkk (r=-0.87 in v28) to the true upstream drivers — revenue_mix_drop_in_share (-0.77), year1_revenue_target_dkk (-0.43), billable_open_studio_hours_year1 (+0.28). The simulation now respects the causal link. Smoke 8/8, unittest 45/45.

ChatGPT review of v29: 'utility_surcharge_cost_share_trigger is extracted as a key value but still unused by any formula. It may be worth keeping as context, but if you want strict no-dead-end-variables, either model it or drop it.' Same anti-pattern applied to instructor_fte_equivalent (extracted as background fact, never multiplied into a cost or capacity formula). Added a 'No dead-end variables' rule to both extractor prompts. Every key_value and missing_values entry must feed at least one calculation directly or transitively. Before emitting JSON, walk the variables and confirm each reaches a recommended_first_calc or derived_question output. Common dead-end patterns called out: a threshold/trigger value extracted without the variable being tested against it AND a margin calculation; a capacity/FTE/staffing constant extracted as background detail; a percentage target the plan mentions but the model has no way to evaluate. Fix in priority order: either model it (for triggers, the natural form is '<x>_margin = actual_share - threshold_share', adding actual_share to missing_values if absent) or drop it. The rule closes with: 'It is better to return six well-connected key_values than eight where two are dead-ends. The caps are a ceiling, not a target.' v30/20260215_nuuk_clay_workshop regenerated locally (gitignored output tree). Dropped utility_surcharge_cost_share_trigger and instructor_fte_equivalent from key_values; six well-connected key_values remain. Threshold probabilities and sensitivity rankings are identical to v29 — those variables were genuinely unused, so dropping them changes the report shape, not the model. Added a programmatic dead-end check that walks the depends_on graph: zero dead-ends in v30. Smoke 8/8, unittest 45/45.

…e model ChatGPT review of v30: 'The modelling frame now says the contingency buffers against labor-law and speculative-rental shocks, dropping utility from v29. That is accurate for the current executable model, but it means the extracted model no longer represents the utility-surcharge risk.' Verdict: 'fine if you intentionally decided utility was second-pass noise' — so no defect, but a real prompt-rule gap. When the No-dead-end rule causes a risk concept to be dropped from the variable set, the modelling_frame text must drop the same concept. Otherwise the frame promises coverage the simulation cannot deliver, and the insights report inherits a discrepancy between the framing and the calculations. Added a small rule to both extractor prompts: modelling_frame describes the scope of what the calculations actually evaluate, not the full plan as written. Concrete pattern stated: 'if the frame says "buffers against A, B, and C shocks" and you only retained calculations that test A and B, the frame must say buffers against A and B shocks. Do not paper over the gap by keeping the dropped concept in the prose.' Smoke 8/8, unittest 45/45.

Original 0516 doc captured the morning's compression + parameter-extraction work and noted the four downstream stages were untouched. The two follow-on sessions (PR #705 and PR #706) covered exactly those, so the doc was stale by 21:00. Re-snapshotted as end-of-day. Covers: Stage 7 lifted to deterministic Python (run_monte_carlo.py); strict schema across the four LLM stages (sampling_discipline, non_negative, default_pass_probability, output_name, output_unit) with no fallback path; summarize-insights skill emitting insights.md with project-manager-facing language and DOOM/FRAGILE/MARGINAL/ROBUST verdict bands; test-napkin-math skill with eight-check smoke runner; 44 CI unittest cases including three regression tests for bugs surfaced during the work (np.triangular crash on low==high, np.std FP false-negative, currency allowlist hardcode); extract-parameters renamed to extract-parameters-from-full; and the four prompt rules accumulated in PR #706 across v27 -> v28 -> v29 -> v30 -> v31 driven by four rounds of ChatGPT review on the Nuuk plan. Outstanding-issues section trimmed: the 0512 'bounds-skip-but-no-calc dependency black hole' and 'dead-end bounded inputs' are addressed (by strict-schema enforcement and the No-dead-end variables prompt rule respectively); 'currency support codification' is addressed (data-driven currency discovery, no allowlist); 'inf/heavy-tail explanation' is partially addressed (insights surfaces blank-runs but doesn't attribute cause yet). The single biggest open issue is cross-plan validation — every prompt rule was validated against the Nuuk case in one currency in one domain. The 'compressed-vs-full-HTML head-to-head' comparison the whole experiment is here to support still has not been run.

neoneye added 5 commits May 16, 2026 15:43

neoneye merged commit 6bf217f into main May 16, 2026
3 checks passed

neoneye deleted the feat/napkin-math-metric-naming branch May 16, 2026 14:25

neoneye mentioned this pull request May 16, 2026

feat(napkin-math): shared-pool legitimacy check for aggregate surplus tests #707

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(napkin-math): teach extractors threshold-friendly output naming#706

feat(napkin-math): teach extractors threshold-friendly output naming#706
neoneye merged 5 commits into
mainfrom
feat/napkin-math-metric-naming

neoneye commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented May 16, 2026

Summary

What changed

Demonstration

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant