Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 23 additions & 12 deletions experiments/napkin_math/docs/20260520_plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,9 @@ separated below.
- `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
All edits applied symmetrically to both extract skills. No corpus literals introduced.
- **PR #743** (merged) — Compress emission-layer second pass. `compress_report_section.py` now makes a second LLM call per saturated bucket with the first-pass items as context, asking only for items the first pass missed. `merge_second_pass_items` deduplicates by normalised `source_quote`. Honest framing: this closes the *emission* side of the run-to-run variance problem (when a tripwire is skipped by the first pass, the second pass often catches it) but does not close the *ranking* side — items that emit with `quote_verified=False` can still be outranked at the deterministic top-N filter.
- **PR #744** (open, CI green, awaiting merge) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
- **PR #744** (merged) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
- **PR #746** (merged) — Phase 3 validate-parameters. Two new deterministic structural checks added to `validate_parameters.py`. `aggregate_not_bounded` (R1.1): when an entry's `formula_hint` is a pure sum of named identifiers and its `output_name` also appears in `missing_values_to_estimate`, the aggregate is sampled independently of its constituents (a single Monte Carlo trial can pair sub-component p95s with a total p05). Detection is syntactic — RHS contains `+`, no other operators, every operand is a snake_case identifier. `requirement_has_margin` (R2.5): when a `key_value`'s id ends in `_required`, at least one calculation must reference it via formula RHS, contain a subtraction or ratio operator, AND emit an `output_name` with a positive-pass margin suffix (`_margin`/`_surplus`/`_buffer`/`_coverage`). All three properties required so a bare reference inside a sum (`combined = actual + required`) no longer satisfies the rule. 19 unit tests, 6 v51 plans still validate clean. Out of scope: the `sampling_discipline` enum expansion (lives in `run_monte_carlo.py`, deferred to Phase 4 to avoid a silent-shim antipattern).
- **PR #747** (merged) — Phase 4 runtime + schema readiness. Code-side half of Phase 4: (1) `strip_threshold_bounds` extended with a `calculation-output` reason — variables that are the declared `output_name` of any calculation get stripped from bounds (R1.1 backstop); (2) `lognormal` and `pert` reserved in `VALID_DISCIPLINES`, `sample_one` raises `NotImplementedError` loudly when sampling is attempted (no silent fall-back to triangular); (3) the optional `correlations` top-level key reserved and preserved through the stripper; (4) the warning text after `strip_threshold_bounds` is now reason-branched (calculation-output strips say "simulation will compute it from calculations.py", suffix/formula-side strips keep the original threshold wording). 73 unit tests, 9/9 smoke checks, 0 false positives on the v48 corpus. The LLM-rule changes (base-anchoring conditional, citation-context-leak self-audit examples, plan_type-driven lognormal default, detailed correlations selection rules) ship in the Phase 4 follow-up.

### PR #737 detail (already on main)

Expand Down Expand Up @@ -229,10 +231,10 @@ too:

| Phase | Skill / module | Status |
|---|---|---|
| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance). **PR #744 (open)** adds paraphrase-tolerant quote verification on the ranking layer. |
| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743 + PR #744** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance; paraphrase-tolerant quote verification on the ranking layer). |
| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
| 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. |
| 4 | `generate-bounds` | not started |
| 3 | `validate-parameters` | **DONE on main via PR #746** (R1.1 `aggregate_not_bounded` and R2.5 `requirement_has_margin` structural checks). The `sampling_discipline` enum expansion bullet that the original plan tucked here actually lives in `run_monte_carlo.py` and was routed to Phase 4 (PR #747) to avoid a silent shim. |
| 4 | `generate-bounds` | **Code-side runtime DONE on main via PR #747** (R1.1 calculation-output strip extension; R1.4 / R2.2 `lognormal` and `pert` reserved in `VALID_DISCIPLINES` with sample-time `NotImplementedError`; R1.3 `correlations` top-level key reserved and preserved). **Prompt-side LLM-rule changes (R1.2 base-anchoring conditional, R1.5 self-audit citation examples, R2.2 plan_type lognormal default, R1.3 detailed correlations rules) remain a follow-up.** |
| 5 | `verify-bounds-citations` (new) | not started |
| 6 | `generate-calculations` | no change required per the original plan |
| 7 | `run-scenarios` | not started |
Expand All @@ -242,31 +244,40 @@ too:

### Next likely move

After PR #743 (emission-layer second pass) and PR #744 (paraphrase-
tolerant quote verification), the remaining work is ordered by what
improves napkin_math output quality most directly:

1. **Bucket-categorisation discipline in compress.** The residual
After PR #746 (validate-parameters structural checks) and PR #747
(generate-bounds runtime + schema readiness), the remaining work is
ordered by what improves napkin_math output quality most directly:

1. **Phase 4 prompt-side follow-up.** With the bounds runtime now
accepting `lognormal`/`pert` and stripping calculation outputs,
the next move is the LLM-rule layer: base-anchoring conditional
rewrite (R1.2 — source: assumption on commitment-default,
source: data on a named anchor), self-audit examples for
citation context-leak (R1.5), `plan_type`-driven lognormal
default for megaproject CAPEX (R2.2), and the detailed
correlations-emission rules (R1.3). Same-LLM same-session
posture: regression check, not improvement claim.
2. **Bucket-categorisation discipline in compress.** The residual
public-output miss in paperclip v53c is the LLM filing a
`$X exceeds threshold` tripwire under `risks_and_shocks` instead
of `gates_and_thresholds`. The bucket-prompt for
`gates_and_thresholds` could be tightened to claim any
"If <metric> <comparator> <numeric threshold>, then ..." sentence,
even when the source frames it as a downside risk. Verify across
the 6-plan probe set; do not overfit to the paperclip OPC UA case.
2. **Implement proposal 141** (`dropped_signals` schema in extract
3. **Implement proposal 141** (`dropped_signals` schema in extract
prompts + `audit_source_preservation.py` deterministic script).
This is the right guardrail for v49/v51 absences and
cap-pressure tradeoffs. Now that the upstream variance fixes
landed (#743, #744), the audit's classification of preserved /
replaced / dropped signals will be measuring against a less
leaky pipeline.
3. **Different-LLM behavioural validation** of the rules now on
4. **Different-LLM behavioural validation** of the rules now on
main. A Self-Improve run with the default napkin_math LLM
(Gemini Flash Lite) against the same digests would close the
same-LLM same-session confound. This should be treated as
validation of prompt generality, not as the next quality fix.
4. **Prompt-hygiene pass** for the remaining domain-specific
5. **Prompt-hygiene pass** for the remaining domain-specific
examples (e.g. `european_prepper_active_buyers`) in either
extract prompt. This is worthwhile and small, but not
load-bearing for the currently observed napkin_math failures.
Expand Down