Merge Whisper segments into prose-shaped paragraphs by roelvangils · Pull Request #12 · 11ways/dpub

roelvangils · 2026-05-05T20:53:52Z

Summary

Post-process Whisper output before injection: merge ~10–30 s segments into prose-shaped paragraphs of ~3–6 sentences each, instead of one <p> per segment.
Pure single-pass greedy state machine. Sentence-terminator detection guards against decimal numbers (3.14) and Dutch+English abbreviations (bv., enz., o.a., Dr., blz., …). Capitalises the first letter at finalize. Force-flushes at max-sentences/max-chars to bound the worst case (Whisper hallucinations with no punctuation).
Each cleaned paragraph carries a stable id="tx-<section>-<para>" so a future per-paragraph Media Overlay sync milestone (M6.5 in the design notes) can reference it without a re-render.
Default-on. Pass --no-text-cleanup to keep the raw per-segment output for debugging.
Zero new dependencies — ASCII-only checks, std::char::to_uppercase for the capitalisation fix.

Why

The current --transcribe output is technically correct but unreadable: paragraph breaks every twenty seconds, often mid-sentence, often starting with a lowercase letter. Apple Books (and any reader that doesn't play Media Overlays) shows the resulting EPUB as text-only — and that text is currently a wall of fragmented paragraphs. This change makes it read like prose.

The merged paragraph's [start, end] interval is the union of its constituents (first.start_seconds, last.end_seconds). Whisper segments are non-overlapping and chronological, so this is the correct span for any future per-paragraph audio sync. Today's SMIL Media Overlays anchor on <h1> and pagebreak IDs, not on the injected <p>s, so this change is a no-op for current overlay behaviour. The paragraph IDs are infrastructure for the M6.5 follow-up.

Algorithm in one paragraph

Append segments to a running paragraph builder. After each append, check whether the buffer ends at a real sentence terminator (. ! ? …, with the digit-decimal and abbreviation guards). If yes AND the paragraph has ≥ 3 sentences AND ≥ 300 chars: emit and start a new paragraph. If the paragraph reaches ≥ 6 sentences OR ≥ 600 chars: emit anyway (safety valve). At end of input, flush whatever's still open.

Test plan

12 unit tests cover: basic 3-sentence merge, min-chars hold, max-sentences split, max-chars safety valve, tail flush, decimal-number false positive, Dutch abbreviation false positive (enz.), Dr. abbreviation, first-letter capitalisation, empty-segment skip, timing preservation, empty input.
Existing synthetic-fixture tests stay green (3 tests, EPUBCheck-clean assertion intact).
Existing real-book tests stay green when run with DPUB_TEST_BOOK (3 tests, including EPUBCheck-clean on the reference book).
cargo clippy --all-targets -- -D warnings clean on both touched crates.
cargo test --workspace all green.
Manual end-to-end: re-run dpub convert --transcribe nl --whisper-model ggml-medium.bin --audio opus --bitrate 32 against the reference book once the in-flight comparison run finishes; eyeball one section's XHTML to confirm the cleanup reads as prose. (Will land as a comment on this PR.)

Out of scope (filed separately)

dpub-whisper: transcribe() reloads the GGML model on every call (~3–5 min wasted per book) #10: per-file Whisper model reload (perf bug; independent fix).
Cover lookup: --cover <path> + --auto-cover via Open Library #11: cover-image lookup (--cover and --auto-cover).
M6.5: per-paragraph Media Overlay anchor wiring — the tx-NNN-MMM IDs land in this PR but the SMIL doesn't reference them yet.

The current `--transcribe` output emits one `<p>` per Whisper segment, which is roughly one paragraph every 10–30 seconds of audio, often starting mid-sentence. The result is correct text but unreadable as prose. Add a `text_cleanup` module that merges segments into ~3–6 sentence paragraphs via a single-pass greedy state machine: append until a sentence terminator AND min-sentences/min-chars are met; force-flush at max-sentences/max-chars. Sentence-terminator detection has guards for decimal numbers (3.14) and Dutch+English abbreviations (bv. enz. o.a. Dr. Mr. blz.). First-letter capitalisation is fixed at finalize. The merged paragraph's [start, end] interval is the union of its constituents — so future per-paragraph Media Overlay sync is trivially correct. Each paragraph carries `id="tx-<section>-<para>"` already, so the SMIL-side change can land later as a single-crate edit. Default-on: the per-segment behaviour is bad for everyone except a debugger of raw model output. `--no-text-cleanup` keeps that path. 12 unit tests cover the algorithm rules; existing integration tests (synthetic fixture, real-book conversion, EPUBCheck-clean assertions) stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roelvangils merged commit 68e3c0f into main May 5, 2026
4 of 5 checks passed

roelvangils deleted the feature/text-layer-cleanup branch May 5, 2026 20:59

roelvangils mentioned this pull request May 6, 2026

M6.5: word-level Media Overlay sync #29

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Whisper segments into prose-shaped paragraphs#12

Merge Whisper segments into prose-shaped paragraphs#12
roelvangils merged 1 commit into
mainfrom
feature/text-layer-cleanup

roelvangils commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

roelvangils commented May 5, 2026

Summary

Why

Algorithm in one paragraph

Test plan

Out of scope (filed separately)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant