Skip to content

napkin-math: dropped_signals schema + validator + audit consumption (proposal 141 PR 2)#752

Merged
neoneye merged 2 commits into
mainfrom
napkin-math/dropped-signals-schema-141-pr2
May 21, 2026
Merged

napkin-math: dropped_signals schema + validator + audit consumption (proposal 141 PR 2)#752
neoneye merged 2 commits into
mainfrom
napkin-math/dropped-signals-schema-141-pr2

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 21, 2026

Summary

Builds on PR #751 (Fork B advisory audit) by giving the LLM an optional vocabulary to explain absences that would otherwise show up as absent_unexplained in the audit. Three coordinated changes:

1. Extract prompts grow an optional dropped_signals array

Both extract-parameters-from-digest and extract-parameters-from-full get the same schema and rule. Each entry must name a structural reason from a closed enum and reference the current signal it was replaced by, made redundant with, moved to, or capped under.

reason requires
replaced_by replacement_id resolves to a current id or output_name
redundant_with redundant_with_id resolves to a current id or output_name
moved_to_unmodelled_gate replacement_id resolves to an unmodelled_gates id
cap_pressure cap_kind names a capped array, AND that array is actually at its cap
out_of_scope structural rationale only

Hard limit 8 entries; rationale ≤25 words; corpus-agnostic wording (no plan literals).

2. validate_parameters.py ERRORs on malformed dropped_signals

Per the proposal: malformed entries are audit failures, not acceptable explanations. New 19th check dropped_signals_schema enforces the closed enum, resolves references, rejects cap_pressure claims on arrays that aren't at cap, etc. Optional field — absent or empty is the expected default for cleanly-preserved iterations.

3. audit_source_preservation.py reclassifies via strictly validated dropped_signals

New status explained_drop ranks above likely_renamed and absent_unexplained. The audit consumes an entry only when it would also pass the validatoris_audit_consumable_drop mirrors check_dropped_signals_schema so an invalid explanation cannot hide a real regression. Precedence:

preserved_by_id > preserved_by_output_name > preserved_as_formula_dependency
> explained_drop > likely_renamed > absent_unexplained

Specifically, the audit requires for explained_drop:

  • origin == "prior_baseline" (source_digest entries are Fork A territory)
  • reason in the closed enum
  • replaced_by / moved_to_unmodelled_gatereplacement_id resolves
  • redundant_withredundant_with_id resolves
  • cap_pressurecap_kind names a capped array AND that array is actually at cap

Malformed entries fall through to likely_renamed / absent_unexplained instead of being silently reclassified. Double-counting is avoided by ignoring entries whose id is actually preserved in current.

Empirical posture

  • 56 unit tests pass (28 validator + 28 audit). 23 new for the schema and consumption; 5 new for strict rejection paths (unresolved replacement_id, wrong origin, unresolved redundant_with_id, unjustified cap_pressure, moved_to_unmodelled_gate pointing at a key_value).
  • 9/9 smoke checks pass.
  • All 6 v51 parameters.json validate clean with 19 checks (was 18) — the new dropped_signals_schema check is correctly inert on legacy outputs that don't have the optional field.

Discovered limitation (honest surface)

The LLM can only meaningfully emit dropped_signals with origin: "prior_baseline" if the orchestrator passes it the prior parameters.json as additional input. The current extract skill reads only the source digest, so a same-LLM same-session regeneration of v51 would emit zero prior_baseline drops — the LLM has no information about what the prior baseline contained.

This PR therefore ships the schema/validator/audit-consumption infrastructure. The orchestrator/skill wiring that lets the LLM see prior baselines is a candidate for proposal 141 PR 3 (alongside, or before, Fork A). The audit's Fork B comparison itself does not need the LLM to know about the prior — the audit reads both files externally; this PR's job is just to let the audit consume LLM-emitted explanations once they exist.

Out of scope (later proposal 141 PRs)

  • Fork A: source-digest regex scan against the current artifact (independent advisory line).
  • Orchestrator wiring to pass prior parameters.json to the extract skill.
  • Strict mode / CI gating policy.
  • source_claim_ids per-entry field (per-claim grounding).

Test plan

  • pytest experiments/napkin_math/tests/test_validate_parameters.py experiments/napkin_math/tests/test_audit_source_preservation.py (56 pass; 28 new across both files)
  • python3 experiments/napkin_math/tests/run_smoke.py (9/9 pass)
  • All 6 v51 plans still validate clean with 19 checks (new schema check is no-op when field absent)
  • Audit strictly consumes only valid dropped_signals (mirrors validator checks)
  • CI green on latest head

🤖 Generated with Claude Code

neoneye and others added 2 commits May 21, 2026 20:10
…proposal 141 PR 2)

Builds on PR #751 (Fork B advisory audit) by giving the LLM an optional vocabulary to explain absences. Three coordinated changes:

(1) Both extract prompts (from-digest and from-full) gain an optional top-level dropped_signals array. Each entry must name a structural reason from a closed enum (replaced_by, cap_pressure, out_of_scope, moved_to_unmodelled_gate, redundant_with) and reference the current signal it was replaced by, made redundant with, moved to, or capped under. Hard limit 8 entries; rationale ≤25 words. Corpus-agnostic wording — no plan literals.

(2) validate_parameters.py grows a 19th check, dropped_signals_schema, that ERRORs on malformed entries: unknown reason, unresolved replacement_id/redundant_with_id, cap_pressure on an array that isn't actually at its cap, moved_to_unmodelled_gate replacement_id not pointing at an unmodelled_gates entry, rationale over the 25-word cap, total over the 8-entry cap. Per the proposal, malformed dropped_signals entries are audit failures — they should not be accepted as explanations.

(3) audit_source_preservation.py adds a new explained_drop classification status that ranks above likely_renamed and absent_unexplained. When current parameters.json's dropped_signals records the prior signal with a valid reason, the audit reclassifies the disappearance as explained_drop with the structured reason and reference. Malformed dropped_signals entries are silently skipped by the audit (validate_parameters surfaces them); double-counting is avoided by ignoring drops_signals entries whose id is actually preserved in current.

23 new unit tests added: 9 validator (replaced_by clean, unknown reason, unresolved references, cap_pressure must match a capped array at cap, redundant_with required field, moved_to_unmodelled_gate must point at unmodelled_gates, rationale word cap, entry count cap, absent field is clean) + 6 audit (reclassification from absent to explained_drop, explained_drop outranks likely_renamed, ignored when prior is actually preserved, silently skips malformed entries, cap_pressure handling, moved_to_unmodelled_gate handling). 51 total tests (28 validator + 23 audit) all pass. 9/9 smoke checks. All 6 v51 parameters.json validate clean with 19 checks — the new schema check is correctly inert on legacy outputs without the optional field.

Discovered limitation worth surfacing: the LLM can only meaningfully emit dropped_signals with origin=prior_baseline when the orchestrator passes it the prior parameters.json as additional input. The current extract skill reads only the source digest, so a same-LLM same-session regeneration of v51 would emit zero prior_baseline drops. The schema/validator/audit-consumption infrastructure is in place; the orchestrator/skill wiring that lets the LLM see prior baselines is a separate PR (proposal 141 PR 3 candidate). The audit's Fork B comparison itself does not need the LLM to know about the prior — the audit reads both files externally.

Out of scope (later proposal 141 PRs):

  - Fork A: source-digest regex scan against the current artifact (independent advisory line)

  - Orchestrator wiring to pass prior parameters.json to the extract skill

  - Strict mode / CI gating policy

  - source_claim_ids per-entry field

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mirror validator checks)

Review feedback on PR #752: the audit's dropped_signals consumer only checked id+reason, so an entry whose replacement_id failed to resolve (or whose cap_pressure claim was unjustified) would still be reclassified as explained_drop — even though validate_parameters.py would reject it. That weakens the audit because an invalid explanation can hide a real regression.

Fix: introduce is_audit_consumable_drop(entry, current_params, current_index) that mirrors validate_parameters.check_dropped_signals_schema. The audit consumes a dropped_signals entry only when it would also pass validation. Specifically:

  - origin must be 'prior_baseline' (Fork B; source_digest drops are Fork A territory and not consumed by this audit)

  - reason must be in the closed enum

  - id must be a non-empty string

  - replaced_by replacement_id must resolve to current id or output_name

  - redundant_with redundant_with_id must resolve to current id or output_name

  - moved_to_unmodelled_gate replacement_id must match an unmodelled_gates id

  - cap_pressure cap_kind must name a capped array AND that array must actually be at its cap

Malformed entries are silently skipped (validate_parameters surfaces them as ERRORs). The prior signal falls through to likely_renamed / absent_unexplained instead of being hidden by the invalid explanation.

Five new synthetic tests cover the rejection paths exactly as the reviewer requested: unresolved replacement_id (with the prior signal correctly falling through to likely_renamed), wrong origin (source_digest entries rejected), unresolved redundant_with_id, unjustified cap_pressure (array not at cap), and moved_to_unmodelled_gate pointing at a key_value instead of an unmodelled_gates entry. Updated the previously-passing cap_pressure positive test to fill key_values to its cap so the claim is justified.

Also updated the stale module docstring: explained_drop is now listed in the status enumeration, the dropped_signals consumption is documented under audit behaviour, and the out-of-scope list correctly excludes dropped_signals (which this PR adds). 56 unit tests pass total (28 validator + 28 audit; 5 new for strict consumption). 9/9 smoke checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye
Copy link
Copy Markdown
Member Author

neoneye commented May 21, 2026

Addressed the strict-consumption blocker (`87a6e7fa`).

Fix: introduced `is_audit_consumable_drop(entry, current_params, current_index)` that mirrors the validator's checks. The audit now consumes a `dropped_signals` entry only when it would also pass validation:

  • `origin == "prior_baseline"` (source_digest entries are Fork A territory and not consumed here)
  • `reason` in the closed enum
  • `id` is non-empty
  • `replaced_by` → `replacement_id` resolves to current id or output_name
  • `redundant_with` → `redundant_with_id` resolves to current id or output_name
  • `moved_to_unmodelled_gate` → `replacement_id` matches an `unmodelled_gates` id
  • `cap_pressure` → `cap_kind` names a capped array AND that array is actually at its cap

Malformed entries fall through to `likely_renamed` / `absent_unexplained` instead of being hidden by the invalid explanation.

Five new synthetic tests cover the exact rejection paths you asked for:

  • `test_explained_drop_rejects_unresolved_replacement_id` — the literal failing case from your message; the prior signal correctly falls through to `likely_renamed`.
  • `test_explained_drop_rejects_source_digest_origin` — wrong origin → no consumption.
  • `test_explained_drop_rejects_unresolved_redundant_with_id`
  • `test_explained_drop_rejects_unjustified_cap_pressure` — array not at cap → no consumption.
  • `test_explained_drop_rejects_moved_to_unmodelled_gate_pointing_at_kv` — pointing at a key_value instead of an unmodelled_gates entry → no consumption.

Also updated the previously-passing positive `cap_pressure` test to fill `key_values` to its cap of 8 so the claim is justified.

Module docstring refresh: `explained_drop` is now listed in the status enumeration, the `dropped_signals` consumption is documented under audit behaviour, and the out-of-scope list correctly excludes `dropped_signals` (which this PR adds).

56 unit tests pass total (28 validator + 28 audit; 5 new for strict consumption). 9/9 smoke checks. CI re-run pending.

@neoneye neoneye merged commit b2c0681 into main May 21, 2026
3 checks passed
@neoneye neoneye deleted the napkin-math/dropped-signals-schema-141-pr2 branch May 21, 2026 18:39
neoneye added a commit that referenced this pull request May 21, 2026
Three review fixes:

1. plan: update the stale 'No formal source-preservation audit implementation' bullet — Fork B shipped in PR #751/#752/#753; Fork A, orchestrator-side prior-baseline injection, and strict-mode are the actual still-pending follow-ups.

2. plan: bump the document title from 2026-05-20 to 2026-05-22; add an italicised note that the doc was originally drafted 2026-05-20 and renamed/refreshed for the post-#753 ship-set.

3. methology: stop overclaiming what the assessment Basis column exposes. summarize_assessment.py maps source:'data' → 'report_derived' and source:'assumption' → 'model_assumption', and that is what the column shows; the finer 'plan-internal gap forecast vs bare commitment' distinction lives in the rationale string, not the column.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…posal 141 PR 3)

Builds on PR PlanExeOrg#751 (Fork B audit) and PR PlanExeOrg#752 (dropped_signals schema + strict consumption) by closing the prior_baseline loop: the extract skill now has a way to see what the previous iteration emitted, so it can decide what to preserve and what to explain-drop.

Per review direction, the orchestration is intentionally NARROW: the full prior parameters.json is NOT passed in. Instead, prepare_extract_input.py builds a compact Prior Signal Ledger and appends it to the combined digest at the end. The ledger contains only:

  - signal names (entry ids and output_names)

  - section and kind (id or output_name)

  - formula_hint when present

  - depends_on when non-empty

Intentionally excluded: source_text, label, comment, value. These would anchor the LLM on old phrasings and old framings — the ledger is a preservation BUDGET, not a phrasing TARGET. The source digest above the ledger remains the authoritative input.

Changes:

(1) prepare_extract_input.py grows a --prior CLI flag pointing at a prior parameters.json. When omitted (first-iteration extraction), no ledger is appended and behavior is unchanged. When provided, build_prior_signal_ledger emits a compact markdown section appended after the bundle.

(2) Both extract prompts (from-digest and from-full) gain a 'Prior Signal Ledger' subsection in the dropped_signals area. Posture: ledger is advisory metadata, source remains authoritative; preserve when source-supported, record dropped_signals when not; do NOT invent dropped_signals entries for signals not in the ledger or source.

(3) 12 synthetic unit tests cover ledger construction: key_value ids with section/kind tags, output_names tracked separately when distinct from ids, formula_hint and depends_on inclusion, formula_hint omission when null, id-equals-output_name dedupe (kind=id wins), unmodelled_gates inclusion, first-iteration empty-ledger message, and explicit exclusion of label/source_text/comment/value. Plus 3 end-to-end tests covering build_combined_digest with and without --prior.

End-to-end empirical check: ran prepare_extract_input.py --prior on paperclip's v49 parameters.json. The ledger lands at the end of the digest with all 16 prior signals — including the latency-tripwire trio (api_latency_p99_threshold_ms, api_latency_margin_ms, actual_api_p99_latency_ms) that v51 silently dropped per the v49→v51 audit on main. The infrastructure is now in place for the LLM extract skill to see these prior signals and either preserve them OR record dropped_signals.

What this PR explicitly does NOT do:

  - Does not re-run the LLM extract skill end-to-end (that is the user's next step via the standard skill workflow). The skill re-run plus audit comparison is the empirical validation of whether the ledger actually helps the LLM populate dropped_signals usefully.

  - Does not pass the full prior parameters.json (per review direction — anchoring risk).

  - Does not change strict mode, CI gating, or Fork A scope (those land in later PRs once this loop is proven useful).

  - Does not bundle Phase 5 verify-bounds-citations or different-LLM validation.

Empirical posture: 12 new unit tests pass. 9/9 smoke checks pass. End-to-end smoke run on paperclip produced a clean digest with the ledger appended. No corpus literals introduced (the ledger emits the actual prior_baseline ids, but those are extracted ids from gitignored corpus outputs, not literals embedded in the prompt or code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…hip-set

Updates two docs to reflect the post-PlanExeOrg#753 state of the napkin-math pipeline.

methology.md: describe the current pipeline behaviour — two-batch compress with paraphrase-tolerant quote match and cross-bucket promoter; extract's source-arithmetic preservation, threshold-pairing, and dropped_signals field; 19-check validator (added aggregate_not_bounded, requirement_has_margin, dropped_signals_schema); bounds' asymmetric source label on commitment defaults, calculation-output strip, reserved correlations block, reserved lognormal/pert disciplines with loud NotImplementedError; advisory audit_source_preservation.py step.

20260520_plan.md → 20260522_plan.md: bump status date; mark PR PlanExeOrg#750 merged; add PR PlanExeOrg#751/PlanExeOrg#752/PlanExeOrg#753 entries (proposal 141 implementation); update Phase status table (added 4.5 audit row, reclassified Phase 8 as partially done, Phase 10 marked done for current ship-set); add v58 14-plan empirical snapshot (1 viable / 5 fragile / 8 doom); reorder Next likely move now that proposal 141 has shipped — Phase 5 citation verifier promoted to PlanExeOrg#1, Phase 8 samplers added as PlanExeOrg#2 with v58 cases that bite now, Phase 9 composite-band cap as PlanExeOrg#3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant