Merge Whisper segments into prose-shaped paragraphs#12
Merged
Conversation
The current `--transcribe` output emits one `<p>` per Whisper segment, which is roughly one paragraph every 10–30 seconds of audio, often starting mid-sentence. The result is correct text but unreadable as prose. Add a `text_cleanup` module that merges segments into ~3–6 sentence paragraphs via a single-pass greedy state machine: append until a sentence terminator AND min-sentences/min-chars are met; force-flush at max-sentences/max-chars. Sentence-terminator detection has guards for decimal numbers (3.14) and Dutch+English abbreviations (bv. enz. o.a. Dr. Mr. blz.). First-letter capitalisation is fixed at finalize. The merged paragraph's [start, end] interval is the union of its constituents — so future per-paragraph Media Overlay sync is trivially correct. Each paragraph carries `id="tx-<section>-<para>"` already, so the SMIL-side change can land later as a single-crate edit. Default-on: the per-segment behaviour is bad for everyone except a debugger of raw model output. `--no-text-cleanup` keeps that path. 12 unit tests cover the algorithm rules; existing integration tests (synthetic fixture, real-book conversion, EPUBCheck-clean assertions) stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
<p>per segment.3.14) and Dutch+English abbreviations (bv.,enz.,o.a.,Dr.,blz., …). Capitalises the first letter at finalize. Force-flushes at max-sentences/max-chars to bound the worst case (Whisper hallucinations with no punctuation).id="tx-<section>-<para>"so a future per-paragraph Media Overlay sync milestone (M6.5 in the design notes) can reference it without a re-render.--no-text-cleanupto keep the raw per-segment output for debugging.std::char::to_uppercasefor the capitalisation fix.Why
The current
--transcribeoutput is technically correct but unreadable: paragraph breaks every twenty seconds, often mid-sentence, often starting with a lowercase letter. Apple Books (and any reader that doesn't play Media Overlays) shows the resulting EPUB as text-only — and that text is currently a wall of fragmented paragraphs. This change makes it read like prose.The merged paragraph's
[start, end]interval is the union of its constituents (first.start_seconds,last.end_seconds). Whisper segments are non-overlapping and chronological, so this is the correct span for any future per-paragraph audio sync. Today's SMIL Media Overlays anchor on<h1>and pagebreak IDs, not on the injected<p>s, so this change is a no-op for current overlay behaviour. The paragraph IDs are infrastructure for the M6.5 follow-up.Algorithm in one paragraph
Append segments to a running paragraph builder. After each append, check whether the buffer ends at a real sentence terminator (
. ! ? …, with the digit-decimal and abbreviation guards). If yes AND the paragraph has ≥ 3 sentences AND ≥ 300 chars: emit and start a new paragraph. If the paragraph reaches ≥ 6 sentences OR ≥ 600 chars: emit anyway (safety valve). At end of input, flush whatever's still open.Test plan
enz.), Dr. abbreviation, first-letter capitalisation, empty-segment skip, timing preservation, empty input.DPUB_TEST_BOOK(3 tests, including EPUBCheck-clean on the reference book).cargo clippy --all-targets -- -D warningsclean on both touched crates.cargo test --workspaceall green.dpub convert --transcribe nl --whisper-model ggml-medium.bin --audio opus --bitrate 32against the reference book once the in-flight comparison run finishes; eyeball one section's XHTML to confirm the cleanup reads as prose. (Will land as a comment on this PR.)Out of scope (filed separately)
transcribe()reloads the GGML model on every call (~3–5 min wasted per book) #10: per-file Whisper model reload (perf bug; independent fix).--coverand--auto-cover).tx-NNN-MMMIDs land in this PR but the SMIL doesn't reference them yet.