Skip to content

feat(napkin-math): replace LLM-driven monte-carlo skill with Python runner#705

Merged
neoneye merged 12 commits into
mainfrom
feat/napkin-math-monte-carlo-runner
May 16, 2026
Merged

feat(napkin-math): replace LLM-driven monte-carlo skill with Python runner#705
neoneye merged 12 commits into
mainfrom
feat/napkin-math-monte-carlo-runner

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 16, 2026

Summary

  • The previous monte-carlo skill asked the LLM to internally simulate n_runs draws, compute Pearson correlations, and emit summary stats. That wasn't a simulation — it produced plausible-looking numbers and was non-deterministic regardless of the seed setting.
  • Adds experiments/napkin_math/run_monte_carlo.py: seeded NumPy RNG, distribution rules ported from the old system prompt (triangular/uniform/Bernoulli, integer/fraction/non-neg clamping), dependency-ordered calculation execution via inspect.signature, threshold operators, Pearson sensitivity (top 5, dependency-filtered, ≥20 finite paired samples), aggregated warnings.
  • SKILL.md becomes a thin wrapper that locates inputs, optionally builds a settings file, and invokes the runner. system-prompt.txt is deleted — the rules now live in code instead of drifting alongside it.

Why

LLMs cannot run Monte Carlo. They can describe distributions, but they cannot sample 10k correlated draws and compute Pearson coefficients in-prompt. The same physical limitation applies to any pipeline step that requires real computation (this likely affects run-scenarios too, as a follow-up).

Test plan

  • Smoke test against /Users/neoneye/git/neoneye_lab/planexe_simulator/output/v23/20260215_nuuk_clay_workshop/ (7 outputs, 10k runs, ~1s)
  • Determinism: two runs with seed=12345 produce byte-identical JSON (diff clean)
  • Threshold computation: rental_revenue_gap_dkk >= 0 → 99.94% pass, labor_law_shock_dkk <= 250000 → 40.86% pass
  • Sensitivity: identifies hourly_rental_revenue_share (r=0.87) as dominant driver of expected_hourly_rental_revenue_dkk
  • Aggregated warnings: unresolved-dependency cascade produces 2 warnings, not 20000
  • Reviewer to confirm no run-scenarios-equivalent regression (the runner doesn't touch run-scenarios)

🤖 Generated with Claude Code

neoneye added 12 commits May 16, 2026 13:51
…unner

The previous skill asked the LLM to internally simulate n_runs draws, run Pearson correlations, and emit summary stats. That isn't a simulation — it's plausible-looking numbers, non-deterministic across calls regardless of the seed setting.

Adds run_monte_carlo.py: seeded NumPy RNG, triangular/uniform/Bernoulli sampling per the original distribution rules, dependency-ordered calculation execution via inspect.signature, threshold operators, Pearson sensitivity (top 5, dependency-filtered, >=20 finite paired samples), aggregated warnings. Same seed -> byte-identical montecarlo.json.

SKILL.md becomes a thin wrapper that locates inputs, optionally builds a settings file, and invokes the runner. system-prompt.txt is deleted; the rules now live in code instead of drifting alongside it.
…rom input data

PlanExe users submit plans from anywhere — the previous 10-currency allowlist (eur/usd/gbp/dkk/nok/sek/isk/chf/jpy/cny/inr) silently mislabeled outputs for projects in IDR, BRL, KES, ZAR, etc.

Two changes:

1. is_monetary_unit removed. Bernoulli gate detection now relies only on the structural triple (low==0, base==high, rationale mentions binary/gate/release/tranche/pass/fail/withheld/conditional) — which is the actual semantic signal. Side benefit: non-monetary gates (permit toggles, regulatory pass/fail) are now detected too.

2. UNIT_INFERENCE_RULES no longer has a currency suffix allowlist. Instead, discover_currency_codes() scans declared key_values + missing_values_to_estimate units for 3-letter uppercase ISO-4217-style tokens and uses whatever the project actually declared. Verified: a synthetic INR project now produces 'INR' for derived outputs without code changes.
…e LLM stages

The Monte Carlo runner previously held fragile pattern-matching for sampling discipline (English unit-token allowlists for integer/fraction/non-negative), Bernoulli-gate detection (English keyword scan over rationale text), and output-unit inference. These decisions are semantic and belong to the upstream LLM, not to runtime pattern matching.

Schema additions:

- bounds.json entries gain required sampling_discipline (one of fixed|bernoulli_gate|integer|fraction|continuous), required non_negative bool, and default_pass_probability (number in [0,1] when bernoulli_gate, null otherwise).

- recommended_first_calculations and derived_questions entries with non-null formula_hint gain required output_name (snake_case id of the computed value) and output_unit (unit string).

Skills updated to emit and consume the new fields: extract-parameters, extract-parameters-from-digest, generate-bounds, generate-calculations, run-scenarios. The lexical unit-inference rule table is gone from run-scenarios; the rationale-keyword Bernoulli heuristic is gone from generate-bounds.

Runner refactor: drops is_integer_unit, is_fraction_unit, is_bernoulli_gate, NON_CURRENCY_UNIT_RULES, discover_currency_codes, infer_unit, and formula_hint LHS parsing. Replaced with strict reads of the declared fields. Missing fields cause SCHEMA ERROR exit code 2 with a message naming the upstream stage to re-run — no fallback path.

Smoke test: experiments/napkin_math/tests/fixtures/smoke_v2/ exercises every sampling discipline (fixed, bernoulli_gate at p=0.6, integer, fraction, continuous) on an INR-denominated synthetic plan. Two runs with seed 12345 produce byte-identical output. Bernoulli arithmetic checks: total_budget_with_gate_inr mean = 1,150,525 vs expected 1,000,000 + 0.6*250,000 = 1,150,000.
…unner

tests/run_smoke.py drives 7 checks (15 assertions): end-to-end runner output, determinism, Bernoulli arithmetic spot-check, sensitivity ranking, fail-fast schema validation for every required field, prepare_extract_input import sanity, and the compress_report_section pytest suite. Exits 0 on full pass, 1 on any failure.

test-napkin-math SKILL.md is a thin wrapper that invokes the script and reports the one-line summary. Intended to be run after any change under experiments/napkin_math/ or to the upstream skill prompts that share the artifact schema.
Covers: settings merging (n_runs clamping, seed/distribution validation), each of the five sampling disciplines including Bernoulli p=0/p=1/override edge cases, schema validation for every required field, calculation execution failure modes (missing fn, exception, NaN return, multi-stage dependency), sensitivity ranking invariants (single-input perfect correlation, all-fixed empty, unrelated-input excluded), threshold operators (each of >, >=, <, <=, ==, !=, unsupported, unknown output), determinism (same/different seeds), unit propagation (verbatim, outputs_of_interest filtering and unknown-output warning), degenerate inputs (empty plan, low==base==high continuous), and distribution_default uniform-vs-triangular variance ordering.

Tests use unittest.TestCase so the repo's python test.py discovers them. Required adding __init__.py to experiments/, experiments/napkin_math/, and experiments/napkin_math/tests/ — the package chain is what unittest.discover walks.

The test_low_equals_high_degenerate_continuous case caught a real bug: rng.triangular raises ValueError when low == high, and the existing inner try/except didn't recover. Fixed in sample_one by returning low directly when low == high regardless of discipline.

Verified: 44 napkin tests show up in repo-wide unittest discovery (367 total vs 323 before), all pass under worker_plan/.venv (Python 3.13 + numpy 2.4).
…nd Python Monte Carlo runner

Populated v27/20260215_nuuk_clay_workshop end-to-end as a third reference plan (alongside Leipzig + Faraday) and recorded what was unclear:

1. README only mentioned extract-parameters; it never said extract-parameters-from-digest exists or how it relates. Added both extractors to the pipeline diagram and explained why the digest variant exists (large reports vs the smaller compressed bundle from prepare_extract_input.py).

2. Stage 1 key_value example was missing the now-required output_name and output_unit fields. Added them with explanation that downstream consumers read them directly and never re-parse formula_hint.

3. Bounds example had no sampling_discipline, non_negative, or default_pass_probability fields. Replaced the example with one that exercises both fraction and bernoulli_gate disciplines, plus a table of required fields and their semantics.

4. Stage 7 implied LLM-driven simulation. Clarified that monte-carlo is the only stage with a Python runner (run_monte_carlo.py) — the LLM cannot sample 10k correlated draws in-prompt.

5. depends_on rule was silent on cross-entry references via output_name. Added the 'id == output_name' convention for recommended_first_calculations and derived_questions so chained calculations resolve.

6. Current status only listed two plans. Added Nuuk (DKK, digest variant) as the third reference, noted the smoke fixture and CI tests.
…ameters-from-full

Pairs cleanly with extract-parameters-from-digest. The old bare name was ambiguous once the digest variant existed — both skills produce the same output schema, they just differ in input format (raw HTML vs the pre-compressed digest).

Mechanical rename: git mv on the skill directory, then a Python sweep replacing 'extract-parameters' with 'extract-parameters-from-full' across 20 files, with a negative lookahead so 'extract-parameters-from-digest' is untouched. 61 replacements.

Verified: 7/7 smoke checks ALL GREEN, 44 unittest cases pass, frontmatter name field updated to match directory name.
… verdicts

summarize_insights.py reads any subset of the four pipeline artifacts (parameters/bounds/scenarios/montecarlo) and writes insights.md next to them. Verdicts come entirely from user-declared thresholds: pass probability >= 80% ROBUST, >= 50% MARGINAL, >= 20% FRAGILE, < 20% DOOM. No identifier-string or unit-string interpretation — the script does not invent thresholds or judge outputs the user didn't mark.

Output sections: plan summary, threshold verdicts with a 'bottom line: doom signals' callout, Monte Carlo p05/p50/p95 table with model-collapse callouts, Pearson sensitivity drivers (top 3 per output with direction arrows), deterministic scenarios from scenarios.json with scenario warnings, and missing-data status (bounded vs unbounded). Sections are skipped gracefully if the corresponding artifact is absent.

Caught and fixed a runner bug along the way: np.std of a constant array returns ~2.8e-17 for some values (e.g. 0.15) due to floating-point precision, so the 'np.std == 0' check failed to filter constant inputs out of sensitivity rankings. Switched to np.ptp (peak-to-peak), which is exactly 0 for any constant array. Added test_fixed_input_with_fp_artifact_value_excluded_from_sensitivity as a regression test.

Smoke suite now 8/8 (added end-to-end check for the summarizer against the synthetic fixture). unittest count 45/45 across the runner.
…rop jargon

The previous output explained the 'missing' column with 'a calculation that produced NaN/Infinity in some runs — usually a divide-by-zero or an unresolved dependency'. Project managers and non-developers don't read code, so words like NaN, Infinity, divide-by-zero, and unresolved dependency belong in commit messages and code comments, not in the artifact a stakeholder will skim.

Re-wrote every section's intro and column headers in plain English: 'Range of outcomes' (was 'Monte Carlo distributions'), 'Headline verdicts' (was 'Threshold verdicts'), 'Which inputs move the outcome the most' (was 'What drives the uncertainty'), 'Three hand-picked scenarios' (was 'Deterministic scenarios'), 'Inputs the plan did not supply' (was 'Missing data flagged by extract-parameters'), 'Source files' (was 'Inputs'). Percentile columns become 'Worst (5%) / Typical / Best (95%)' instead of 'p05/p50/p95'. Pearson 'correlation' becomes 'score' on a -1 to +1 scale. 'Non-finite' / 'model collapse' becomes 'blank runs' with a 'Numbers the model could not compute' callout that names the realistic causes (a missing input or a denominator that hit zero) without code jargon.

Verdict callouts now lead with 'likely deal-breakers' and 'coin-flip territory' instead of 'doom signals' and 'fragile signals' — the verdict labels themselves stay (the user explicitly asked for DOOM/FRAGILE/MARGINAL/ROBUST) but the headings tell a non-technical reader what they actually mean.

SKILL.md updated to describe the new section names; smoke test 'contains the verdict table' check switched from 'Threshold verdicts' to 'Headline verdicts'.
Prefacing a direct claim with 'honest' implies that the default mode is not honest. Removed. The sentence now starts with the conditional ('If a number that the plan needs to stay positive is already negative in the middle column ...') and reads more directly.

Saved as memory feedback_no_honest_framing so the same hedge does not creep back in elsewhere — also covers frankly / to be fair / candidly.
… the skill

Restructured the report so every signal that the plan does not survive its own assumptions surfaces in a single 'Bad news first' block right after the plan summary. That block consolidates DOOM thresholds (Likely deal-breakers), FRAGILE thresholds (Coin-flip territory), scenario warnings (Already broken in the three-scenario sanity check), high-blank-run outputs (Numbers the model could not compute), and unbounded missing inputs. Items are ordered by severity. If nothing qualifies, the section is omitted — silence is the only acceptable form of good news.

The detail tables that follow no longer carry their own per-section callouts (those were duplicating Bad news first), and the verdict table is now sorted DOOM -> FRAGILE -> MARGINAL -> ROBUST instead of declaration order.

Codified the editorial rules in the skill's SKILL.md so they survive future edits to either the script or the LLM-side reporting. The rules cover: bad news first, no sugar-coating, no sycophancy, ban on hedging phrases (the honest read is / frankly / to be fair / in fairness / candidly / let's be real), and the distinction between hedges-about-data (fine) and hedges-about-the-speaker (not). The rules explicitly apply to both the script's emitted text and any conversational reporting.

These rules previously lived only in my personal claude-code memory; moved into the skill where they belong since they are part of how this skill should communicate.
@neoneye neoneye merged commit 7fe0fff into main May 16, 2026
3 checks passed
@neoneye neoneye deleted the feat/napkin-math-monte-carlo-runner branch May 16, 2026 13:35
neoneye added a commit that referenced this pull request May 16, 2026
Original 0516 doc captured the morning's compression + parameter-extraction work and noted the four downstream stages were untouched. The two follow-on sessions (PR #705 and PR #706) covered exactly those, so the doc was stale by 21:00.

Re-snapshotted as end-of-day. Covers: Stage 7 lifted to deterministic Python (run_monte_carlo.py); strict schema across the four LLM stages (sampling_discipline, non_negative, default_pass_probability, output_name, output_unit) with no fallback path; summarize-insights skill emitting insights.md with project-manager-facing language and DOOM/FRAGILE/MARGINAL/ROBUST verdict bands; test-napkin-math skill with eight-check smoke runner; 44 CI unittest cases including three regression tests for bugs surfaced during the work (np.triangular crash on low==high, np.std FP false-negative, currency allowlist hardcode); extract-parameters renamed to extract-parameters-from-full; and the four prompt rules accumulated in PR #706 across v27 -> v28 -> v29 -> v30 -> v31 driven by four rounds of ChatGPT review on the Nuuk plan.

Outstanding-issues section trimmed: the 0512 'bounds-skip-but-no-calc dependency black hole' and 'dead-end bounded inputs' are addressed (by strict-schema enforcement and the No-dead-end variables prompt rule respectively); 'currency support codification' is addressed (data-driven currency discovery, no allowlist); 'inf/heavy-tail explanation' is partially addressed (insights surfaces blank-runs but doesn't attribute cause yet). The single biggest open issue is cross-plan validation — every prompt rule was validated against the Nuuk case in one currency in one domain. The 'compressed-vs-full-HTML head-to-head' comparison the whole experiment is here to support still has not been run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant