Skip to content

Merge Whisper segments into prose-shaped paragraphs#12

Merged
roelvangils merged 1 commit into
mainfrom
feature/text-layer-cleanup
May 5, 2026
Merged

Merge Whisper segments into prose-shaped paragraphs#12
roelvangils merged 1 commit into
mainfrom
feature/text-layer-cleanup

Conversation

@roelvangils
Copy link
Copy Markdown
Member

Summary

  • Post-process Whisper output before injection: merge ~10–30 s segments into prose-shaped paragraphs of ~3–6 sentences each, instead of one <p> per segment.
  • Pure single-pass greedy state machine. Sentence-terminator detection guards against decimal numbers (3.14) and Dutch+English abbreviations (bv., enz., o.a., Dr., blz., …). Capitalises the first letter at finalize. Force-flushes at max-sentences/max-chars to bound the worst case (Whisper hallucinations with no punctuation).
  • Each cleaned paragraph carries a stable id="tx-<section>-<para>" so a future per-paragraph Media Overlay sync milestone (M6.5 in the design notes) can reference it without a re-render.
  • Default-on. Pass --no-text-cleanup to keep the raw per-segment output for debugging.
  • Zero new dependencies — ASCII-only checks, std::char::to_uppercase for the capitalisation fix.

Why

The current --transcribe output is technically correct but unreadable: paragraph breaks every twenty seconds, often mid-sentence, often starting with a lowercase letter. Apple Books (and any reader that doesn't play Media Overlays) shows the resulting EPUB as text-only — and that text is currently a wall of fragmented paragraphs. This change makes it read like prose.

The merged paragraph's [start, end] interval is the union of its constituents (first.start_seconds, last.end_seconds). Whisper segments are non-overlapping and chronological, so this is the correct span for any future per-paragraph audio sync. Today's SMIL Media Overlays anchor on <h1> and pagebreak IDs, not on the injected <p>s, so this change is a no-op for current overlay behaviour. The paragraph IDs are infrastructure for the M6.5 follow-up.

Algorithm in one paragraph

Append segments to a running paragraph builder. After each append, check whether the buffer ends at a real sentence terminator (. ! ? …, with the digit-decimal and abbreviation guards). If yes AND the paragraph has ≥ 3 sentences AND ≥ 300 chars: emit and start a new paragraph. If the paragraph reaches ≥ 6 sentences OR ≥ 600 chars: emit anyway (safety valve). At end of input, flush whatever's still open.

Test plan

  • 12 unit tests cover: basic 3-sentence merge, min-chars hold, max-sentences split, max-chars safety valve, tail flush, decimal-number false positive, Dutch abbreviation false positive (enz.), Dr. abbreviation, first-letter capitalisation, empty-segment skip, timing preservation, empty input.
  • Existing synthetic-fixture tests stay green (3 tests, EPUBCheck-clean assertion intact).
  • Existing real-book tests stay green when run with DPUB_TEST_BOOK (3 tests, including EPUBCheck-clean on the reference book).
  • cargo clippy --all-targets -- -D warnings clean on both touched crates.
  • cargo test --workspace all green.
  • Manual end-to-end: re-run dpub convert --transcribe nl --whisper-model ggml-medium.bin --audio opus --bitrate 32 against the reference book once the in-flight comparison run finishes; eyeball one section's XHTML to confirm the cleanup reads as prose. (Will land as a comment on this PR.)

Out of scope (filed separately)

The current `--transcribe` output emits one `<p>` per Whisper segment,
which is roughly one paragraph every 10–30 seconds of audio, often
starting mid-sentence. The result is correct text but unreadable as
prose.

Add a `text_cleanup` module that merges segments into ~3–6 sentence
paragraphs via a single-pass greedy state machine: append until a
sentence terminator AND min-sentences/min-chars are met; force-flush
at max-sentences/max-chars. Sentence-terminator detection has guards
for decimal numbers (3.14) and Dutch+English abbreviations (bv. enz.
o.a. Dr. Mr. blz.). First-letter capitalisation is fixed at finalize.

The merged paragraph's [start, end] interval is the union of its
constituents — so future per-paragraph Media Overlay sync is trivially
correct. Each paragraph carries `id="tx-<section>-<para>"` already, so
the SMIL-side change can land later as a single-crate edit.

Default-on: the per-segment behaviour is bad for everyone except a
debugger of raw model output. `--no-text-cleanup` keeps that path.

12 unit tests cover the algorithm rules; existing integration tests
(synthetic fixture, real-book conversion, EPUBCheck-clean assertions)
stay green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roelvangils roelvangils merged commit 68e3c0f into main May 5, 2026
4 of 5 checks passed
@roelvangils roelvangils deleted the feature/text-layer-cleanup branch May 5, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant