Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 36 additions & 26 deletions experiments/napkin_math/docs/20260520_plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,13 +76,11 @@ separated below.

- **PR #737** (merged) — Phase 1 compress prompts + initial extract threshold-pairing rule + `OPTIMIZE_INSTRUCTIONS` discipline banner. Substantive content described below under "PR #737 detail" for continuity.
- **PR #739** (merged) — Proposal 141 ("Source-Preservation Audit for the Napkin Math Pipeline") landed as design only. No code or prompt change. Implementation deferred.

### Open for merge

- **PR #740** — Phase 2 extract-prompt rules. Three commits land in `extract-parameters-from-digest` and `extract-parameters-from-full`:
- **PR #740** (merged) — Phase 2 extract-prompt rules. Four commits landed in `extract-parameters-from-digest` and `extract-parameters-from-full`:
- `4cda70ba` — Source-arithmetic preservation rule (Patterns 1/2/3: aggregate sum, burn rate × duration, explicit decomposition block) + threshold-pairing parity backfill into the full-extract skill.
- `19f927b7` — Tightened aggregate-sum wording so independent caps/envelopes are NOT collapsed into derived sums; reconciled discipline-shared paragraph with the cap-pressure paragraph.
- `8f94c8cd` — 20-word `source_text` cap reinforced with explicit truncation discipline (drop the consequence clause, end with ellipsis if mid-sentence).
- `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
All edits applied symmetrically to both extract skills. No corpus literals introduced.

### PR #737 detail (already on main)
Expand Down Expand Up @@ -225,7 +223,7 @@ too:
| Phase | Skill / module | Status |
|---|---|---|
| 1 | `compress_report_section.py` | **DONE on main via PR #737** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner) |
| 2 | `extract-parameters-from-{full,digest}` | **PARTIAL** — threshold-pairing on `from-digest` shipped in PR #737. Source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline are in PR #740, open for merge. After #740 merges, Phase 2's prompt-side work is complete for the original directives. |
| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
| 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. |
| 4 | `generate-bounds` | not started |
| 5 | `verify-bounds-citations` (new) | not started |
Expand All @@ -237,30 +235,42 @@ too:

### Next likely move

After PR #740 merges, the next-most-load-bearing follow-ups, in
preferred order:

1. **Implement proposal 141** (`dropped_signals` schema in extract
After PR #740, the next work should be ordered by what improves
napkin_math output quality most directly, not by what is easiest to
measure. Preferred order:

1. **Compress-LLM variance handling.** Deterministic retry/merge or
lower-temperature reruns for high-impact compress buckets should
come next. The clearest driver is the paperclip OPC UA / latency
tripwires that v49 surfaced and v50/v51 drop at the compress
layer. This is upstream of extraction: if the digest does not
carry the tripwire, no extract prompt can recover it. Proposal
141 would classify this loss, but variance handling is the piece
that can restore the missing source signal.
2. **Implement proposal 141** (`dropped_signals` schema in extract
prompts + `audit_source_preservation.py` deterministic script).
The design is on main; without the implementation, v49 absences
across the probe set cannot be mechanically classified, and the
yellowstone-style cap-pressure tradeoffs have no place to record
their structural rationale in the artifact itself.
2. **Compress-LLM variance handling.** Deterministic retry/merge or
lower-temperature reruns for high-impact buckets. The clearest
driver: the paperclip OPC UA / latency tripwires that v49
surfaced and v50/v51 drop at the compress layer.
3. **Different-LLM behavioural validation** of the rules now in
#740. A Self-Improve run with the default napkin_math LLM
(Gemini Flash Lite) against the same v51 digests would close
the same-LLM same-session confound.
This should follow close behind variance handling. It is the
right guardrail for v49/v51 absences and cap-pressure tradeoffs,
but it is primarily a measurement and accountability layer: it
classifies preserved / replaced / dropped signals and records
rationale in the artifact. It does not by itself make the
compressor less lossy.
3. **Different-LLM behavioural validation** of the rules now on
main. A Self-Improve run with the default napkin_math LLM
(Gemini Flash Lite) against the same digests would close the
same-LLM same-session confound. This should be treated as
validation of prompt generality, not as the next quality fix.
4. **Prompt-hygiene pass** for the remaining domain-specific
examples (e.g. `european_prepper_active_buyers`) in either
extract prompt. Small, scoped, can ride alongside any of the
above.

These are four separate PRs, not one. Bundling them re-creates the
scope creep PR #740 was extracted from.
extract prompt. This is worthwhile and small, but not
load-bearing for the currently observed napkin_math failures.

These are separate PRs. The next PR should be compress variance only:
no corpus literals, no hand-patched outputs, rerun compress + extract
through the skills, validate regenerated `parameters.json`, and
compare against v49/v50/v51 honestly. Bundling the audit,
behavioural validation, or prompt hygiene into that PR would obscure
whether the upstream signal-loss fix actually worked.

## Per-theme mapping

Expand Down