Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 36 additions & 32 deletions experiments/napkin_math/docs/20260520_plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ separated below.
- `8f94c8cd` — 20-word `source_text` cap reinforced with explicit truncation discipline (drop the consequence clause, end with ellipsis if mid-sentence).
- `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
All edits applied symmetrically to both extract skills. No corpus literals introduced.
- **PR #743** (merged) — Compress emission-layer second pass. `compress_report_section.py` now makes a second LLM call per saturated bucket with the first-pass items as context, asking only for items the first pass missed. `merge_second_pass_items` deduplicates by normalised `source_quote`. Honest framing: this closes the *emission* side of the run-to-run variance problem (when a tripwire is skipped by the first pass, the second pass often catches it) but does not close the *ranking* side — items that emit with `quote_verified=False` can still be outranked at the deterministic top-N filter.
- **PR #744** (open, CI green, awaiting merge) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.

### PR #737 detail (already on main)

Expand Down Expand Up @@ -158,13 +160,18 @@ either; each gets its own follow-up:

- **Compress-LLM run-to-run variance.** Same prompts, same source,
two compress passes can produce materially different bucket
selections. The paperclip premortem tripwires (`$75k OPC UA bid`,
`100ms` p99 latency, both in v49) drop at the compress stage in
v50 and v51 and cannot be recovered at the extract layer because
the digest itself does not surface them. This is the **clearest
unresolved regression** across the probe set. The fix belongs in
orchestration: deterministic retry/merge across N compress
passes, or lower-temperature reruns for high-impact buckets.
selections. PR #743 (second pass) closed the emission side: when
the first pass skips a tripwire, the second pass given the
first-pass items as context often catches it. PR #744 (paraphrase-
tolerant quote match) closed the verification side: paraphrased
quotes whose tokens are all in the source no longer flip to
`qv=False` and lose the +10 verified-quote bonus at the ranking
layer. Residual modes: (a) bucket-categorisation variance — the
LLM occasionally files a `$X exceeds threshold` tripwire under
`risks_and_shocks` rather than `gates_and_thresholds` (paperclip
v53c); (b) the second pass itself sometimes also misses (paperclip
v52c). Both are at the LLM's emission/categorisation layer and
cannot be fixed by deterministic post-processing alone.
- **Threshold-pairing rule × `missing_values_to_estimate` 5-cap.**
When a plan names many independent thresholds, every-threshold
pairing collides with the cap and forces a tradeoff. The
Expand Down Expand Up @@ -222,7 +229,7 @@ too:

| Phase | Skill / module | Status |
|---|---|---|
| 1 | `compress_report_section.py` | **DONE on main via PR #737** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner) |
| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance). **PR #744 (open)** adds paraphrase-tolerant quote verification on the ranking layer. |
| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
| 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. |
| 4 | `generate-bounds` | not started |
Expand All @@ -235,26 +242,25 @@ too:

### Next likely move

After PR #740, the next work should be ordered by what improves
napkin_math output quality most directly, not by what is easiest to
measure. Preferred order:

1. **Compress-LLM variance handling.** Deterministic retry/merge or
lower-temperature reruns for high-impact compress buckets should
come next. The clearest driver is the paperclip OPC UA / latency
tripwires that v49 surfaced and v50/v51 drop at the compress
layer. This is upstream of extraction: if the digest does not
carry the tripwire, no extract prompt can recover it. Proposal
141 would classify this loss, but variance handling is the piece
that can restore the missing source signal.
After PR #743 (emission-layer second pass) and PR #744 (paraphrase-
tolerant quote verification), the remaining work is ordered by what
improves napkin_math output quality most directly:

1. **Bucket-categorisation discipline in compress.** The residual
public-output miss in paperclip v53c is the LLM filing a
`$X exceeds threshold` tripwire under `risks_and_shocks` instead
of `gates_and_thresholds`. The bucket-prompt for
`gates_and_thresholds` could be tightened to claim any
"If <metric> <comparator> <numeric threshold>, then ..." sentence,
even when the source frames it as a downside risk. Verify across
the 6-plan probe set; do not overfit to the paperclip OPC UA case.
2. **Implement proposal 141** (`dropped_signals` schema in extract
prompts + `audit_source_preservation.py` deterministic script).
This should follow close behind variance handling. It is the
right guardrail for v49/v51 absences and cap-pressure tradeoffs,
but it is primarily a measurement and accountability layer: it
classifies preserved / replaced / dropped signals and records
rationale in the artifact. It does not by itself make the
compressor less lossy.
This is the right guardrail for v49/v51 absences and
cap-pressure tradeoffs. Now that the upstream variance fixes
landed (#743, #744), the audit's classification of preserved /
replaced / dropped signals will be measuring against a less
leaky pipeline.
3. **Different-LLM behavioural validation** of the rules now on
main. A Self-Improve run with the default napkin_math LLM
(Gemini Flash Lite) against the same digests would close the
Expand All @@ -265,12 +271,10 @@ measure. Preferred order:
extract prompt. This is worthwhile and small, but not
load-bearing for the currently observed napkin_math failures.

These are separate PRs. The next PR should be compress variance only:
no corpus literals, no hand-patched outputs, rerun compress + extract
through the skills, validate regenerated `parameters.json`, and
compare against v49/v50/v51 honestly. Bundling the audit,
behavioural validation, or prompt hygiene into that PR would obscure
whether the upstream signal-loss fix actually worked.
These are separate PRs. Each ships independently; bundling
categorisation, audit-implementation, behavioural validation, or
prompt hygiene into one PR would obscure which piece moved which
metric.

## Per-theme mapping

Expand Down