Summary
dpub_whisper::transcribe() constructs a fresh WhisperContext (and re-loads the GGML model into GPU/CPU buffers) on every invocation. dpub_convert::inject_transcripts() calls it once per audio file in a sequential loop, so a typical audiobook conversion reloads the same 1.5 GB model 30+ times back-to-back.
Repro
Convert any multi-section DAISY book with --transcribe:
dpub convert /path/to/ncc.html -o out.epub \
--audio opus --bitrate 32 \
--transcribe nl --whisper-model ~/models/ggml-medium.bin
In the whisper.cpp stderr stream you'll see this block repeat once per audio file:
ggml_metal_free: deallocating
whisper_init_from_file_with_params_no_state: loading model from '…/ggml-medium.bin'
…
whisper_model_load: Metal total size = 1533.14 MB
ggml_metal_init: picking default device: Apple M2
…
whisper_init_state: kv self size = 50.33 MB
whisper_init_state: kv cross size = 150.99 MB
…
Observed against Ontmoetingen in het donker (30 sections, 30 distinct MP3 files, medium model, M2 + Metal): the model-load block recurs after each file's transcription completes.
Cause
crates/dpub-whisper/src/lib.rs:79 — transcribe() calls WhisperContext::new_with_params(...) every time, then drops the context at function exit.
crates/dpub-convert/src/lib.rs:624–631 — inject_transcripts() calls dpub_whisper::transcribe(...) inside a per-file for loop with no shared context.
Each load:
- Reads the full GGML file from disk (~1.5 GB for medium, ~3 GB for large-v3).
- Allocates Metal buffers and copies the weights onto the GPU.
- Initialises BLAS, kv-self, kv-cross, conv/encode/cross/decode compute buffers.
On Apple Silicon + Metal this is ~5–10 s per file. For a 30-section book that's ~3–5 minutes of avoidable wallclock per conversion, on top of (and serialised with) the actual transcription work. It scales linearly with section count, so larger books pay more.
Expected
Construct the WhisperContext once per inject_transcripts() call (or once per CLI invocation), and reuse it across all files. Per-file state = ctx.create_state() is cheap (just the kv/compute buffers, ~330 MB) and gives a fresh decoder state per file, which is the right granularity.
Suggested fix
Two reasonable shapes:
-
Stateful API in dpub-whisper — the cleanest. Introduce a Transcriber that owns the WhisperContext, with a transcribe(&self, audio_path) method. The free function stays as a one-shot convenience for the smoke test.
pub struct Transcriber {
ctx: WhisperContext,
language: String,
}
impl Transcriber {
pub fn new(opts: &TranscribeOptions) -> Result<Self> { … }
pub fn transcribe(&self, audio_path: &Path) -> Result<Vec<Segment>> { … }
}
Then inject_transcripts constructs one Transcriber and reuses it across the loop.
-
Cache in dpub-convert — leave dpub-whisper as-is and hoist context construction into inject_transcripts. Less idiomatic (the context is a dpub-whisper concern leaking upward), but a smaller patch.
Either way, the behaviour-preserving correctness check is the existing smoke test in crates/dpub-whisper/tests/smoke.rs plus the DPUB_TEST_BOOK end-to-end run.
Impact
- Wallclock: a few minutes per conversion today; more on books with many short sections.
- Disk: re-reads the 1.5 GB / 3 GB model file from disk once per audio file (mostly absorbed by the OS page cache after the first read, but still avoidable).
- Metal: repeated allocation/deallocation churn of GPU buffers; not a leak, but stresses the unified-memory pool.
Encountered on a real --transcribe nl --whisper-model ggml-medium.bin --audio opus --bitrate 32 run against the 11 h 45 m reference book on M2 + Metal.
Summary
dpub_whisper::transcribe()constructs a freshWhisperContext(and re-loads the GGML model into GPU/CPU buffers) on every invocation.dpub_convert::inject_transcripts()calls it once per audio file in a sequential loop, so a typical audiobook conversion reloads the same 1.5 GB model 30+ times back-to-back.Repro
Convert any multi-section DAISY book with
--transcribe:dpub convert /path/to/ncc.html -o out.epub \ --audio opus --bitrate 32 \ --transcribe nl --whisper-model ~/models/ggml-medium.binIn the whisper.cpp stderr stream you'll see this block repeat once per audio file:
Observed against
Ontmoetingen in het donker(30 sections, 30 distinct MP3 files, medium model, M2 + Metal): the model-load block recurs after each file's transcription completes.Cause
crates/dpub-whisper/src/lib.rs:79—transcribe()callsWhisperContext::new_with_params(...)every time, then drops the context at function exit.crates/dpub-convert/src/lib.rs:624–631—inject_transcripts()callsdpub_whisper::transcribe(...)inside a per-fileforloop with no shared context.Each load:
On Apple Silicon + Metal this is ~5–10 s per file. For a 30-section book that's ~3–5 minutes of avoidable wallclock per conversion, on top of (and serialised with) the actual transcription work. It scales linearly with section count, so larger books pay more.
Expected
Construct the
WhisperContextonce perinject_transcripts()call (or once per CLI invocation), and reuse it across all files. Per-filestate = ctx.create_state()is cheap (just the kv/compute buffers, ~330 MB) and gives a fresh decoder state per file, which is the right granularity.Suggested fix
Two reasonable shapes:
Stateful API in
dpub-whisper— the cleanest. Introduce aTranscriberthat owns theWhisperContext, with atranscribe(&self, audio_path)method. The free function stays as a one-shot convenience for the smoke test.Then
inject_transcriptsconstructs oneTranscriberand reuses it across the loop.Cache in
dpub-convert— leavedpub-whisperas-is and hoist context construction intoinject_transcripts. Less idiomatic (the context is adpub-whisperconcern leaking upward), but a smaller patch.Either way, the behaviour-preserving correctness check is the existing smoke test in
crates/dpub-whisper/tests/smoke.rsplus theDPUB_TEST_BOOKend-to-end run.Impact
Encountered on a real
--transcribe nl --whisper-model ggml-medium.bin --audio opus --bitrate 32run against the 11 h 45 m reference book on M2 + Metal.