feat(napkin-math): teach extractors threshold-friendly output naming#706
Merged
Conversation
ChatGPT review of the v27 Nuuk run flagged one thing in the parameters JSON: rental_revenue_gap_dkk had a >= 0 / <= 0 ambiguity. The output is meant to be tested with the threshold operator <= 0 ("no gap"), but a reader has to re-derive which sign is good every time they read the name. ChatGPT proposed renaming to drop_in_capacity_surplus_dkk with threshold >= 0, which reads naturally.
Generalised the lesson into a new prompt rule rather than hand-patching one entry. Both extractors (extract-parameters-from-full and extract-parameters-from-digest) now carry a Threshold-friendly output naming section: when a calculation is meant to be compared against a threshold, name it so that positive = pass. Prefer _surplus / _buffer / _margin / _coverage; avoid _gap (direction ambiguous), _deficit (implies always bad), _shortfall (only when the formula matches the name). The rule says explicitly to flip the formula sign rather than emit a confusingly-named output.
Also fixed the two pre-existing conflicting examples in the same prompts (revenue_gap = required_revenue - expected_revenue, funding_gap = required_funding - available_cash) to use the surplus framing. The Preferred patterns list and the Commercial gate execution rule now agree with the new naming rule.
Smoke 8/8, unittest 45/45. Regenerated v28/20260215_nuuk_clay_workshop locally (gitignored output tree) using the renamed metric: insights.md headline switches from 'P(rental_revenue_gap_dkk <= 0) = 0.14%' to 'P(drop_in_capacity_surplus_dkk >= 0) = 0.14%'. Same finding; reads as plain English.
…ad of sampling them ChatGPT review of v28: 'rental_revenue_shortfall_dkk is still modelled as an independent sampled input while expected_drop_in_revenue_dkk and drop_in_capacity_surplus_dkk are also derived from revenue/share/hours. For a cleaner causal model, derive shortfall from capacity.' Added a 'Derive coupled stressors instead of sampling them independently' section to both extractor prompts. Concrete signs of the anti-pattern: a missing_values entry whose why_needed references another modelled variable by name; a suggested_estimation_method that's effectively a formula; an aggregate test that subtracts both an independent stressor AND the inputs the stressor depends on. Fix: convert the entry to a recommended_first_calculation with max(0, expected - achievable) (or derive from an existing surplus calc). The rule also explains why this matters: independent sampling double-counts the underlying risk, allows physically incoherent simulation runs (a shortfall in a scenario where capacity covers the target), and weakens the sensitivity analysis. Demonstrated locally with v29/20260215_nuuk_clay_workshop (gitignored output tree). v28 -> v29 substance changes (same digest, applying the new rule): - rental_revenue_shortfall_dkk moves from missing_values_to_estimate to recommended_first_calculations with formula max(0, expected_drop_in_revenue_dkk - rate*hours), and is removed from bounds.json. - Deterministic combined_viability_surplus drops in every scenario (low: +160k -> +76k; base: -119k -> -287k; high: -470k -> -626k) because the previously-assumed bounds (0/158k/350k) under-estimated the actual derived shortfall (84k/326k/506k). - P(combined_viability_surplus >= 0): 5.27% -> 1.09%. The Monte Carlo no longer benefits from the artificial independence between the shortfall and the capacity that produces it. - Sensitivity drivers of combined_viability shift from the tautological rental_revenue_shortfall_dkk (r=-0.87 in v28) to the true upstream drivers — revenue_mix_drop_in_share (-0.77), year1_revenue_target_dkk (-0.43), billable_open_studio_hours_year1 (+0.28). The simulation now respects the causal link. Smoke 8/8, unittest 45/45.
ChatGPT review of v29: 'utility_surcharge_cost_share_trigger is extracted as a key value but still unused by any formula. It may be worth keeping as context, but if you want strict no-dead-end-variables, either model it or drop it.' Same anti-pattern applied to instructor_fte_equivalent (extracted as background fact, never multiplied into a cost or capacity formula). Added a 'No dead-end variables' rule to both extractor prompts. Every key_value and missing_values entry must feed at least one calculation directly or transitively. Before emitting JSON, walk the variables and confirm each reaches a recommended_first_calc or derived_question output. Common dead-end patterns called out: a threshold/trigger value extracted without the variable being tested against it AND a margin calculation; a capacity/FTE/staffing constant extracted as background detail; a percentage target the plan mentions but the model has no way to evaluate. Fix in priority order: either model it (for triggers, the natural form is '<x>_margin = actual_share - threshold_share', adding actual_share to missing_values if absent) or drop it. The rule closes with: 'It is better to return six well-connected key_values than eight where two are dead-ends. The caps are a ceiling, not a target.' v30/20260215_nuuk_clay_workshop regenerated locally (gitignored output tree). Dropped utility_surcharge_cost_share_trigger and instructor_fte_equivalent from key_values; six well-connected key_values remain. Threshold probabilities and sensitivity rankings are identical to v29 — those variables were genuinely unused, so dropping them changes the report shape, not the model. Added a programmatic dead-end check that walks the depends_on graph: zero dead-ends in v30. Smoke 8/8, unittest 45/45.
…e model ChatGPT review of v30: 'The modelling frame now says the contingency buffers against labor-law and speculative-rental shocks, dropping utility from v29. That is accurate for the current executable model, but it means the extracted model no longer represents the utility-surcharge risk.' Verdict: 'fine if you intentionally decided utility was second-pass noise' — so no defect, but a real prompt-rule gap. When the No-dead-end rule causes a risk concept to be dropped from the variable set, the modelling_frame text must drop the same concept. Otherwise the frame promises coverage the simulation cannot deliver, and the insights report inherits a discrepancy between the framing and the calculations. Added a small rule to both extractor prompts: modelling_frame describes the scope of what the calculations actually evaluate, not the full plan as written. Concrete pattern stated: 'if the frame says "buffers against A, B, and C shocks" and you only retained calculations that test A and B, the frame must say buffers against A and B shocks. Do not paper over the gap by keeping the dropped concept in the prose.' Smoke 8/8, unittest 45/45.
Original 0516 doc captured the morning's compression + parameter-extraction work and noted the four downstream stages were untouched. The two follow-on sessions (PR #705 and PR #706) covered exactly those, so the doc was stale by 21:00. Re-snapshotted as end-of-day. Covers: Stage 7 lifted to deterministic Python (run_monte_carlo.py); strict schema across the four LLM stages (sampling_discipline, non_negative, default_pass_probability, output_name, output_unit) with no fallback path; summarize-insights skill emitting insights.md with project-manager-facing language and DOOM/FRAGILE/MARGINAL/ROBUST verdict bands; test-napkin-math skill with eight-check smoke runner; 44 CI unittest cases including three regression tests for bugs surfaced during the work (np.triangular crash on low==high, np.std FP false-negative, currency allowlist hardcode); extract-parameters renamed to extract-parameters-from-full; and the four prompt rules accumulated in PR #706 across v27 -> v28 -> v29 -> v30 -> v31 driven by four rounds of ChatGPT review on the Nuuk plan. Outstanding-issues section trimmed: the 0512 'bounds-skip-but-no-calc dependency black hole' and 'dead-end bounded inputs' are addressed (by strict-schema enforcement and the No-dead-end variables prompt rule respectively); 'currency support codification' is addressed (data-driven currency discovery, no allowlist); 'inf/heavy-tail explanation' is partially addressed (insights surfaces blank-runs but doesn't attribute cause yet). The single biggest open issue is cross-plan validation — every prompt rule was validated against the Nuuk case in one currency in one domain. The 'compressed-vs-full-HTML head-to-head' comparison the whole experiment is here to support still has not been run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ChatGPT reviewed the v27 Nuuk plan run and verdict was "the best version so far" — pipeline closed end-to-end, Monte Carlo runs clean with zero warnings, validation passes, scenarios produce useful conclusions. One thing flagged:
rental_revenue_gap_dkkis directionally confusing. The output is tested with<= 0("no gap"), but the reader has to re-derive which sign is the good one. Suggestion: rename todrop_in_capacity_surplus_dkkwith threshold>= 0.This PR generalises that lesson into a new prompt rule rather than hand-patching one entry.
What changed
Both extractor skill prompts (
extract-parameters-from-fullandextract-parameters-from-digest) gain a Threshold-friendly output naming section:_surplus,_buffer,_margin,_coverage._gap(direction ambiguous),_deficit(implies always bad),_shortfall(only when the formula matches the name).x_gap = expected - capacitywith threshold<= 0, instead writex_capacity_surplus = capacity - expectedwith threshold>= 0".Also fixed two pre-existing conflicting examples in the same prompts (
revenue_gap = required - expected,funding_gap = required - available) so the "Preferred patterns" list and the "Commercial gate execution rule" no longer contradict the new naming rule.Demonstration
Regenerated
output/v28/20260215_nuuk_clay_workshop/locally (theoutput/tree is gitignored, consistent with v24–v27). Same plan, same digest, applying the new rule by hand as a future LLM run would.rental_revenue_gap_dkk(threshold<= 0) →drop_in_capacity_surplus_dkk(threshold>= 0)(rate * hours) - expected_revenueinstead ofexpected_revenue - (rate * hours)Headline shift in
insights.md:P(rental_revenue_gap_dkk <= 0) = 0.14%— DOOMP(drop_in_capacity_surplus_dkk >= 0) = 0.14%— DOOMSame number, plain-English reading.
Test plan
tests/run_smoke.py)tests/test_run_monte_carlo.py)🤖 Generated with Claude Code