Skip to content

feat(napkin-math): deterministic Python validator for parameters.json#711

Merged
neoneye merged 8 commits into
mainfrom
feat/napkin-math-validate-parameters-python
May 17, 2026
Merged

feat(napkin-math): deterministic Python validator for parameters.json#711
neoneye merged 8 commits into
mainfrom
feat/napkin-math-validate-parameters-python

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 17, 2026

Summary

While running the napkin_math pipeline on a new plan (20251026_casino_royale), the existing validate-parameters skill turned out to be unrunnable:

  • It's LLM-driven and written against an earlier schema. It rejects every entry of key_values / derived_questions / recommended_first_calculations because output_name and output_unit count as "extra keys" under its S003/S004/S006 rules.
  • The digest extractor's system prompt requires those fields.
  • The existing v34-v38 validation.json files use a different 16-named-check schema (json_parse, top_level_structure, …, no_dead_end_variables, shared_pool_legitimacy) that no producer in the tree emits. Each pipeline run needed validation.json to be hand-written.

This PR ships a deterministic Python validator that produces the 16-check validation.json the rest of the pipeline already assumes, and replaces the LLM-driven skill with a thin Python wrapper.

Changes

  • experiments/napkin_math/validate_parameters.py — new. Runs the 16 named checks; produces the exact shape summarize_assessment.py consumes; exits 0 on valid / 1 on invalid / 2 on JSON parse failure.
  • .claude/skills/validate-parameters/SKILL.md — rewritten as a Python-wrapper skill (mirrors monte-carlo, summarize-assessment). Documents all 16 checks.
  • .claude/skills/validate-parameters/system-prompt.txt — deleted (obsolete LLM ruleset for an older schema; not what summarize_assessment.py reads anyway).
  • README.md — Stage 2 section gains an "Implementation" note pointing at the script; Validation categories now lists all 16 named checks.
  • tests/run_smoke.py — new check validate_parameters_end_to_end exercises the script against the smoke fixture.
  • tests/fixtures/smoke/{parameters.json,calculations.py} — added a third calculation (annual_instructor_payroll_inr = operating_weeks_per_year * instructor_hourly_rate_inr * 30) so the previously dead-ended operating_weeks_per_year and instructor_hourly_rate_inr variables feed a calc. The smoke fixture is now a clean reference. Smoke updated from "two outputs computed" to "three outputs computed".

Design notes

  • depends_on accepts both ids and output_names. The digest extractor often gives derived questions q_*-style ids whose output_name differs from the id (e.g. Faraday's q_weakest_program_gate has output_name: weakest_program_gate_surplus_eur). depends_on routinely references the output name; the validator treats both as legitimate.
  • threshold_friendly_naming is WARN-level, not ERROR. The validator can't see montecarlo_settings.json, so it can't tell which outputs will actually be threshold-tested. It flags outputs ending in _gap / _deficit / _shortfall for review.
  • shared_pool_legitimacy is a no-op at the structural level. The rule requires reading the source plan's narrative to verify pool legitimacy — it's enforced upstream in the extractor's system prompt. The check stays in checks_performed for completeness; the validator does not invent violations from structure alone.

Test plan

  • Smoke 9/9, unit 50/50 green.
  • Cross-plan validation across 13 reference plans:
    • v38 datacenter / media_rescue → valid
    • v39 casino_royale → valid
    • v33 india_census / faraday → valid
    • v31 nuuk_clay_workshop → valid
    • v33 cross_border_rail_ticketing → INVALID (1 error: source_text_word_caps, 24 words vs 20-word cap)
    • v31 cross_border_rail_ticketing → INVALID (same)
  • The two source_text_word_caps violations are real schema violations the old LLM validator was lenient about. The current digest extractor produces clean output (every recent reference plan passes).

🤖 Generated with Claude Code

neoneye added 8 commits May 17, 2026 02:51
When running the pipeline on a new plan (Casino Royale), the LLM-driven validate-parameters skill rejected every entry of key_values / derived_questions / recommended_first_calculations because output_name and output_unit are 'extra keys' per its S003/S004/S006 rules. But the digest extractor's system prompt requires those fields, and the existing v34-v38 validation.json files use a different 16-named-check schema with no producer in the tree. The validator was unrunnable, so each pipeline run needed validation.json to be hand-written.

Adds experiments/napkin_math/validate_parameters.py — a deterministic Python validator that runs the 16 named checks the rest of the pipeline assumes (json_parse, top_level_structure, required_fields, array_length_caps, global_id_uniqueness, snake_case_ids, depends_on_declared, formula_rhs_declared, fraction_value_range, comment_word_caps, source_text_word_caps, output_name_present_when_formula_hint, output_unit_present_when_formula_hint, no_dead_end_variables, threshold_friendly_naming, shared_pool_legitimacy). Output shape matches what summarize_assessment.py already consumes. Exit 0 valid / 1 invalid / 2 parse-failed. depends_on accepts both ids and output_names (the digest extractor uses q_*-style ids whose output_name differs from the id, e.g. Faraday's q_weakest_program_gate -> weakest_program_gate_surplus_eur).

Replaces .claude/skills/validate-parameters/SKILL.md with a thin Python-wrapper skill matching how monte-carlo and summarize-assessment work. Deletes the obsolete LLM system-prompt.txt (its S/C/E/V/I/F rule scheme was written for an earlier schema and isn't what summarize_assessment.py reads anyway).

Smoke fixture grew one calculation (annual_instructor_payroll_inr = operating_weeks_per_year * instructor_hourly_rate_inr * 30) so its previously-dead operating_weeks and instructor_hourly_rate variables feed a real calc. New smoke check validate_parameters_end_to_end exercises the validator on the cleaned fixture. Smoke 9/9, unit 50/50.

Cross-plan validation across 13 reference plans: v38 / v39 / v33 india_census / v33 faraday / v31 nuuk all pass cleanly. v33 rail and v31 rail flag a legitimate source_text_word_caps violation (24 words vs 20-word cap) that the old LLM validator was lenient about — a real catch.
…xit codes, and downstream consumption

Adds: How-to-run CLI block; full output shape including checks_performed and per-violation example; 16-row table explaining what each check does and why it's ERROR vs WARN vs no-op; depends_on accepts ids AND output_names (with the Faraday q_*-style example); how no_dead_end_variables is computed (direct + transitive); exit codes; what summarize_assessment.py does with the report.
…ial gates section (schema v5)

ChatGPT v39 casino_royale review: the casino plan's dominant failure modes are legal/political/AML-compliance, but every modelled gate is financial. The financial model fails its long-term coverage gate, but that may not be the primary failure mode at all — if regulatory authorization or political reversal fires, the financial gates never get to matter. The assessment had no way to communicate this.

Adds an optional unmodelled_gates array to parameters.json. Each entry has id (snake_case ending in _gate), label, why_it_matters, source_anchor (one of the eight source sections), consequence_if_false. Cap is 5. Extractor system prompts (digest + full) describe when to populate it and warn against using it as a risk dumping ground.

Validator: accepts the optional field via OPTIONAL_TOP_LEVEL_KEYS; validates entry shape when present; includes ids in uniqueness/snake_case checks and array_length_caps cap of 5. Backward-compatible with the seven existing reference plans that don't have the field.

Summarizer: renders a new ## Known unmodelled existential gates section after Modelling frame (table of Gate / Why it matters / Source anchor / Consequence if false); appends a bold Note caveat under Modelling frame when the array is non-empty; adds unmodelled_gates_summary {count, ids} to the JSON manifest. assessment_schema_version bumped to 5.

Casino_royale v39 parameters.json populated with three unmodelled gates (regulatory_authorization_gate, political_reversal_gate, aml_banking_consortium_gate) demonstrating the new section. Re-rendered assessment is now 242 lines and includes the caveat ChatGPT asked for.

Smoke 9/9, unit 50/50.
…cal findings (schema v6)

ChatGPT v39-b review's two production-ready asks: the machine summary still centered on primary_failed_gates: [monthly_revenue_coverage_ratio], so an AI reading only the JSON manifest would miss the scope warning; and Critical findings led with the DOOM bullet, so the scope caveat needed prose visibility there too.

Manifest: dropped nested unmodelled_gates_summary {count, ids}; replaced with flat fields known_unmodelled_existential_gates (list of ids; empty when none) and assessment_scope_warning (string when the list is non-empty, null when empty). Both ChatGPT specifically called out by name.

Critical findings: when parameters.unmodelled_gates is non-empty, the section now opens with a SCOPE WARNING bullet naming the unmodelled gate labels and pointing at ## Known unmodelled existential gates. The DOOM/FRAGILE/scenario/blank-run/still-missing bullets follow.

assessment_schema_version bumped 5 -> 6. Smoke fixture assertion updated. SKILL.md updated.

Regenerated v40/20251026_casino_royale (per project convention of new-version-per-feedback-round; saved that convention to memory). Smoke 9/9, unit 50/50.
…nmodelled gates table

ChatGPT v40 review: two small wording tweaks to lock down production readiness.

SCOPE WARNING bullet: lowercase the first letter of each label so the line reads as prose rather than proper-noun list, and join with Oxford comma + 'or' before the last item. Acronym-aware: if the label starts with a multi-uppercase word (e.g. 'AML/KYC compliant banking infrastructure'), the casing is preserved. Before: 'does not evaluate Federal land-use authorization, Political acceptance and non-reversal, AML/KYC compliant banking infrastructure'. After: 'does not evaluate federal land-use authorization, political acceptance and non-reversal, or AML/KYC compliant banking infrastructure'.

Known unmodelled existential gates table: prefix the source_anchor column value with 'report.html / ' so an agent reading the assessment can trace the claim back to a specific section of the source report. 'expert_criticism' -> 'report.html / expert_criticism'.

No schema bump (manifest unchanged; pure rendering).

Regenerated v41/20251026_casino_royale per the new-version-per-feedback-round convention. Smoke 9/9.
…lowed by units

Surfaced during the v43 casino_royale clean-room re-extraction: the validator's threshold_friendly_naming check used endswith('_shortfall') but real output names append the unit suffix (e.g. phase2_funding_shortfall_usd). The bad token was the second-to-last underscore-segment, not the trailing one, so the check never fired. Same gap on _gap and _deficit.

Replaces the suffix-list endswith() with a regex that matches the bad word as a token followed by either end-of-string OR a trailing unit segment: r'_(gap|deficit|shortfall)(_[a-z0-9_]+)?$'. The suggested-rename in the violation message preserves the unit suffix (phase2_funding_shortfall_usd -> phase2_funding_surplus_usd, not just _surplus).

Smoke 9/9. Verified the regression on the fresh v43 casino_royale parameters.json: the agent's first pass emitted phase2_funding_shortfall_usd; the fixed validator caught it with a WARN; a follow-up agent renamed to _surplus and flipped the formula sign; downstream bounds, calculations, scenarios, settings, Monte Carlo, and assessment all re-ran cleanly under the new name.
…room run, and the 'fix prompts, not outputs' lesson

Two new sections at the end of the doc:

## Later on 0517 — PR #710 merged, PR #711 in flight, casino_royale run. Covers the deterministic Python validator (replacing the LLM skill that was unrunnable on digest-extractor output), the optional unmodelled_gates field on parameters.json, the schema v4 -> v5 -> v6 bumps (known_unmodelled_existential_gates + assessment_scope_warning flat fields, SCOPE WARNING bullet in Critical findings, source-anchor prefixing), and the casino_royale v39-v43 iteration history that drove the changes.

## Process insight — fix prompts, not outputs. Documents the v40-v43 sequence as a concrete violation-and-recovery of the feedback_fix_prompts_not_outputs.md memory rule I had saved and ignored: I hand-patched parameters.json across v40/v41/v42 instead of re-running the extractor against the updated prompt, which hid three real bugs (validator regex on _suffix tokens, LLM emission of _shortfall-named outputs, LLM emission of dead-end variables) until the clean-room v43 run surfaced them. Adds five process rules for future runs.

Stale schema-version mentions updated to v6: cross-plan reference set header, Cross-plan generalisation row, regression-test outstanding-issue, schema-v6 basis_enum entry, 'manifest regression test' next-step. Historical references to v3/v4 inside the iteration history table are left as-is.
…d-basis fallback, contractual-gate naming hint

ChatGPT v43 review: three changes that route to code or prompt fixes, not to hand-patching the output.

1) Saturated-failure detection (summarize_assessment.py). When a DOOM gate's pass rate is below 0.5% AND no single quartile movement shifts it by >=0.5 pp, the gate is structurally unreachable across the sampled space — any quartile-driver row is misleading optimisation advice. New is_saturated_failure() helper used by Decision implications and Failure drivers: lever becomes 'Saturated failure: pass rate is X.X% and no single input quartile movement changes that. Quartile sensitivity is not informative; audit the input bounds and the threshold definition.' Failure drivers row shows '(saturated failure) | n/a | saturated failure: no single input restriction can lift the pass rate; revisit bounds and threshold'.

2) Threshold-basis fallback (summarize_assessment.py). threshold_basis_for() previously returned 'unknown' when the gate's depends_on had no key_value to anchor against. That conflated two cases: gate not declared at all (genuinely unknown) vs gate declared but threshold is a bare numerical comparison from montecarlo_settings.json (model-defined). lookup_gate_metadata() now returns gate_found=True/False; threshold_basis_for() returns 'model_defined' when gate_found AND no anchor, reserves 'unknown' for the genuinely-missing case. On v43 casino_royale this flips contingency_remediation_buffer_usd from 'unknown' to 'model_defined'.

3) Contractual-gate naming hint (extractor system prompts, digest + full). New paragraph in the threshold-friendly naming section: when a calculated window/margin/trigger represents a contractual gate enforced by a specific counterparty (sponsor, lender, regulator, agency, court, investor, grantor, prime contractor), prefix the id with the counterparty so downstream consumers see the contractual rather than operational nature. Generic rule; covers the casino_royale 'effective_profitability_window_days' -> 'sponsor_profitability_trigger_window_days' case ChatGPT flagged, plus future counterparty gates.

v43 assessment regenerated locally to verify all three fixes land. Smoke 9/9, unit 50/50. No schema bump (manifest unchanged; pure rendering + prompt rule).
@neoneye neoneye merged commit a401990 into main May 17, 2026
3 checks passed
@neoneye neoneye deleted the feat/napkin-math-validate-parameters-python branch May 17, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant