feat(napkin-math): deterministic Python validator for parameters.json by neoneye · Pull Request #711 · PlanExeOrg/PlanExe

neoneye · 2026-05-17T00:52:02Z

Summary

While running the napkin_math pipeline on a new plan (20251026_casino_royale), the existing validate-parameters skill turned out to be unrunnable:

It's LLM-driven and written against an earlier schema. It rejects every entry of key_values / derived_questions / recommended_first_calculations because output_name and output_unit count as "extra keys" under its S003/S004/S006 rules.
The digest extractor's system prompt requires those fields.
The existing v34-v38 validation.json files use a different 16-named-check schema (json_parse, top_level_structure, …, no_dead_end_variables, shared_pool_legitimacy) that no producer in the tree emits. Each pipeline run needed validation.json to be hand-written.

This PR ships a deterministic Python validator that produces the 16-check validation.json the rest of the pipeline already assumes, and replaces the LLM-driven skill with a thin Python wrapper.

Changes

experiments/napkin_math/validate_parameters.py — new. Runs the 16 named checks; produces the exact shape summarize_assessment.py consumes; exits 0 on valid / 1 on invalid / 2 on JSON parse failure.
.claude/skills/validate-parameters/SKILL.md — rewritten as a Python-wrapper skill (mirrors monte-carlo, summarize-assessment). Documents all 16 checks.
.claude/skills/validate-parameters/system-prompt.txt — deleted (obsolete LLM ruleset for an older schema; not what summarize_assessment.py reads anyway).
README.md — Stage 2 section gains an "Implementation" note pointing at the script; Validation categories now lists all 16 named checks.
tests/run_smoke.py — new check validate_parameters_end_to_end exercises the script against the smoke fixture.
tests/fixtures/smoke/{parameters.json,calculations.py} — added a third calculation (annual_instructor_payroll_inr = operating_weeks_per_year * instructor_hourly_rate_inr * 30) so the previously dead-ended operating_weeks_per_year and instructor_hourly_rate_inr variables feed a calc. The smoke fixture is now a clean reference. Smoke updated from "two outputs computed" to "three outputs computed".

Design notes

depends_on accepts both ids and output_names. The digest extractor often gives derived questions q_*-style ids whose output_name differs from the id (e.g. Faraday's q_weakest_program_gate has output_name: weakest_program_gate_surplus_eur). depends_on routinely references the output name; the validator treats both as legitimate.
threshold_friendly_naming is WARN-level, not ERROR. The validator can't see montecarlo_settings.json, so it can't tell which outputs will actually be threshold-tested. It flags outputs ending in _gap / _deficit / _shortfall for review.
shared_pool_legitimacy is a no-op at the structural level. The rule requires reading the source plan's narrative to verify pool legitimacy — it's enforced upstream in the extractor's system prompt. The check stays in checks_performed for completeness; the validator does not invent violations from structure alone.

Test plan

Smoke 9/9, unit 50/50 green.
Cross-plan validation across 13 reference plans:
- v38 datacenter / media_rescue → valid
- v39 casino_royale → valid
- v33 india_census / faraday → valid
- v31 nuuk_clay_workshop → valid
- v33 cross_border_rail_ticketing → INVALID (1 error: source_text_word_caps, 24 words vs 20-word cap)
- v31 cross_border_rail_ticketing → INVALID (same)
The two source_text_word_caps violations are real schema violations the old LLM validator was lenient about. The current digest extractor produces clean output (every recent reference plan passes).

🤖 Generated with Claude Code

When running the pipeline on a new plan (Casino Royale), the LLM-driven validate-parameters skill rejected every entry of key_values / derived_questions / recommended_first_calculations because output_name and output_unit are 'extra keys' per its S003/S004/S006 rules. But the digest extractor's system prompt requires those fields, and the existing v34-v38 validation.json files use a different 16-named-check schema with no producer in the tree. The validator was unrunnable, so each pipeline run needed validation.json to be hand-written. Adds experiments/napkin_math/validate_parameters.py — a deterministic Python validator that runs the 16 named checks the rest of the pipeline assumes (json_parse, top_level_structure, required_fields, array_length_caps, global_id_uniqueness, snake_case_ids, depends_on_declared, formula_rhs_declared, fraction_value_range, comment_word_caps, source_text_word_caps, output_name_present_when_formula_hint, output_unit_present_when_formula_hint, no_dead_end_variables, threshold_friendly_naming, shared_pool_legitimacy). Output shape matches what summarize_assessment.py already consumes. Exit 0 valid / 1 invalid / 2 parse-failed. depends_on accepts both ids and output_names (the digest extractor uses q_*-style ids whose output_name differs from the id, e.g. Faraday's q_weakest_program_gate -> weakest_program_gate_surplus_eur). Replaces .claude/skills/validate-parameters/SKILL.md with a thin Python-wrapper skill matching how monte-carlo and summarize-assessment work. Deletes the obsolete LLM system-prompt.txt (its S/C/E/V/I/F rule scheme was written for an earlier schema and isn't what summarize_assessment.py reads anyway). Smoke fixture grew one calculation (annual_instructor_payroll_inr = operating_weeks_per_year * instructor_hourly_rate_inr * 30) so its previously-dead operating_weeks and instructor_hourly_rate variables feed a real calc. New smoke check validate_parameters_end_to_end exercises the validator on the cleaned fixture. Smoke 9/9, unit 50/50. Cross-plan validation across 13 reference plans: v38 / v39 / v33 india_census / v33 faraday / v31 nuuk all pass cleanly. v33 rail and v31 rail flag a legitimate source_text_word_caps violation (24 words vs 20-word cap) that the old LLM validator was lenient about — a real catch.

…xit codes, and downstream consumption Adds: How-to-run CLI block; full output shape including checks_performed and per-violation example; 16-row table explaining what each check does and why it's ERROR vs WARN vs no-op; depends_on accepts ids AND output_names (with the Faraday q_*-style example); how no_dead_end_variables is computed (direct + transitive); exit codes; what summarize_assessment.py does with the report.

…ial gates section (schema v5) ChatGPT v39 casino_royale review: the casino plan's dominant failure modes are legal/political/AML-compliance, but every modelled gate is financial. The financial model fails its long-term coverage gate, but that may not be the primary failure mode at all — if regulatory authorization or political reversal fires, the financial gates never get to matter. The assessment had no way to communicate this. Adds an optional unmodelled_gates array to parameters.json. Each entry has id (snake_case ending in _gate), label, why_it_matters, source_anchor (one of the eight source sections), consequence_if_false. Cap is 5. Extractor system prompts (digest + full) describe when to populate it and warn against using it as a risk dumping ground. Validator: accepts the optional field via OPTIONAL_TOP_LEVEL_KEYS; validates entry shape when present; includes ids in uniqueness/snake_case checks and array_length_caps cap of 5. Backward-compatible with the seven existing reference plans that don't have the field. Summarizer: renders a new ## Known unmodelled existential gates section after Modelling frame (table of Gate / Why it matters / Source anchor / Consequence if false); appends a bold Note caveat under Modelling frame when the array is non-empty; adds unmodelled_gates_summary {count, ids} to the JSON manifest. assessment_schema_version bumped to 5. Casino_royale v39 parameters.json populated with three unmodelled gates (regulatory_authorization_gate, political_reversal_gate, aml_banking_consortium_gate) demonstrating the new section. Re-rendered assessment is now 242 lines and includes the caveat ChatGPT asked for. Smoke 9/9, unit 50/50.

…cal findings (schema v6) ChatGPT v39-b review's two production-ready asks: the machine summary still centered on primary_failed_gates: [monthly_revenue_coverage_ratio], so an AI reading only the JSON manifest would miss the scope warning; and Critical findings led with the DOOM bullet, so the scope caveat needed prose visibility there too. Manifest: dropped nested unmodelled_gates_summary {count, ids}; replaced with flat fields known_unmodelled_existential_gates (list of ids; empty when none) and assessment_scope_warning (string when the list is non-empty, null when empty). Both ChatGPT specifically called out by name. Critical findings: when parameters.unmodelled_gates is non-empty, the section now opens with a SCOPE WARNING bullet naming the unmodelled gate labels and pointing at ## Known unmodelled existential gates. The DOOM/FRAGILE/scenario/blank-run/still-missing bullets follow. assessment_schema_version bumped 5 -> 6. Smoke fixture assertion updated. SKILL.md updated. Regenerated v40/20251026_casino_royale (per project convention of new-version-per-feedback-round; saved that convention to memory). Smoke 9/9, unit 50/50.

…nmodelled gates table ChatGPT v40 review: two small wording tweaks to lock down production readiness. SCOPE WARNING bullet: lowercase the first letter of each label so the line reads as prose rather than proper-noun list, and join with Oxford comma + 'or' before the last item. Acronym-aware: if the label starts with a multi-uppercase word (e.g. 'AML/KYC compliant banking infrastructure'), the casing is preserved. Before: 'does not evaluate Federal land-use authorization, Political acceptance and non-reversal, AML/KYC compliant banking infrastructure'. After: 'does not evaluate federal land-use authorization, political acceptance and non-reversal, or AML/KYC compliant banking infrastructure'. Known unmodelled existential gates table: prefix the source_anchor column value with 'report.html / ' so an agent reading the assessment can trace the claim back to a specific section of the source report. 'expert_criticism' -> 'report.html / expert_criticism'. No schema bump (manifest unchanged; pure rendering). Regenerated v41/20251026_casino_royale per the new-version-per-feedback-round convention. Smoke 9/9.

…lowed by units Surfaced during the v43 casino_royale clean-room re-extraction: the validator's threshold_friendly_naming check used endswith('_shortfall') but real output names append the unit suffix (e.g. phase2_funding_shortfall_usd). The bad token was the second-to-last underscore-segment, not the trailing one, so the check never fired. Same gap on _gap and _deficit. Replaces the suffix-list endswith() with a regex that matches the bad word as a token followed by either end-of-string OR a trailing unit segment: r'_(gap|deficit|shortfall)(_[a-z0-9_]+)?$'. The suggested-rename in the violation message preserves the unit suffix (phase2_funding_shortfall_usd -> phase2_funding_surplus_usd, not just _surplus). Smoke 9/9. Verified the regression on the fresh v43 casino_royale parameters.json: the agent's first pass emitted phase2_funding_shortfall_usd; the fixed validator caught it with a WARN; a follow-up agent renamed to _surplus and flipped the formula sign; downstream bounds, calculations, scenarios, settings, Monte Carlo, and assessment all re-ran cleanly under the new name.

…room run, and the 'fix prompts, not outputs' lesson Two new sections at the end of the doc: ## Later on 0517 — PR #710 merged, PR #711 in flight, casino_royale run. Covers the deterministic Python validator (replacing the LLM skill that was unrunnable on digest-extractor output), the optional unmodelled_gates field on parameters.json, the schema v4 -> v5 -> v6 bumps (known_unmodelled_existential_gates + assessment_scope_warning flat fields, SCOPE WARNING bullet in Critical findings, source-anchor prefixing), and the casino_royale v39-v43 iteration history that drove the changes. ## Process insight — fix prompts, not outputs. Documents the v40-v43 sequence as a concrete violation-and-recovery of the feedback_fix_prompts_not_outputs.md memory rule I had saved and ignored: I hand-patched parameters.json across v40/v41/v42 instead of re-running the extractor against the updated prompt, which hid three real bugs (validator regex on _suffix tokens, LLM emission of _shortfall-named outputs, LLM emission of dead-end variables) until the clean-room v43 run surfaced them. Adds five process rules for future runs. Stale schema-version mentions updated to v6: cross-plan reference set header, Cross-plan generalisation row, regression-test outstanding-issue, schema-v6 basis_enum entry, 'manifest regression test' next-step. Historical references to v3/v4 inside the iteration history table are left as-is.

…d-basis fallback, contractual-gate naming hint ChatGPT v43 review: three changes that route to code or prompt fixes, not to hand-patching the output. 1) Saturated-failure detection (summarize_assessment.py). When a DOOM gate's pass rate is below 0.5% AND no single quartile movement shifts it by >=0.5 pp, the gate is structurally unreachable across the sampled space — any quartile-driver row is misleading optimisation advice. New is_saturated_failure() helper used by Decision implications and Failure drivers: lever becomes 'Saturated failure: pass rate is X.X% and no single input quartile movement changes that. Quartile sensitivity is not informative; audit the input bounds and the threshold definition.' Failure drivers row shows '(saturated failure) | n/a | saturated failure: no single input restriction can lift the pass rate; revisit bounds and threshold'. 2) Threshold-basis fallback (summarize_assessment.py). threshold_basis_for() previously returned 'unknown' when the gate's depends_on had no key_value to anchor against. That conflated two cases: gate not declared at all (genuinely unknown) vs gate declared but threshold is a bare numerical comparison from montecarlo_settings.json (model-defined). lookup_gate_metadata() now returns gate_found=True/False; threshold_basis_for() returns 'model_defined' when gate_found AND no anchor, reserves 'unknown' for the genuinely-missing case. On v43 casino_royale this flips contingency_remediation_buffer_usd from 'unknown' to 'model_defined'. 3) Contractual-gate naming hint (extractor system prompts, digest + full). New paragraph in the threshold-friendly naming section: when a calculated window/margin/trigger represents a contractual gate enforced by a specific counterparty (sponsor, lender, regulator, agency, court, investor, grantor, prime contractor), prefix the id with the counterparty so downstream consumers see the contractual rather than operational nature. Generic rule; covers the casino_royale 'effective_profitability_window_days' -> 'sponsor_profitability_trigger_window_days' case ChatGPT flagged, plus future counterparty gates. v43 assessment regenerated locally to verify all three fixes land. Smoke 9/9, unit 50/50. No schema bump (manifest unchanged; pure rendering + prompt rule).

neoneye added 8 commits May 17, 2026 02:51

neoneye merged commit a401990 into main May 17, 2026
3 checks passed

neoneye deleted the feat/napkin-math-validate-parameters-python branch May 17, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(napkin-math): deterministic Python validator for parameters.json#711

feat(napkin-math): deterministic Python validator for parameters.json#711
neoneye merged 8 commits into
mainfrom
feat/napkin-math-validate-parameters-python

neoneye commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented May 17, 2026

Summary

Changes

Design notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant