Skip to content

dpub-whisper: transcribe() reloads the GGML model on every call (~3–5 min wasted per book) #10

@roelvangils

Description

@roelvangils

Summary

dpub_whisper::transcribe() constructs a fresh WhisperContext (and re-loads the GGML model into GPU/CPU buffers) on every invocation. dpub_convert::inject_transcripts() calls it once per audio file in a sequential loop, so a typical audiobook conversion reloads the same 1.5 GB model 30+ times back-to-back.

Repro

Convert any multi-section DAISY book with --transcribe:

dpub convert /path/to/ncc.html -o out.epub \
    --audio opus --bitrate 32 \
    --transcribe nl --whisper-model ~/models/ggml-medium.bin

In the whisper.cpp stderr stream you'll see this block repeat once per audio file:

ggml_metal_free: deallocating
whisper_init_from_file_with_params_no_state: loading model from '…/ggml-medium.bin'
…
whisper_model_load: Metal total size = 1533.14 MB
ggml_metal_init: picking default device: Apple M2
…
whisper_init_state: kv self size  = 50.33 MB
whisper_init_state: kv cross size = 150.99 MB
…

Observed against Ontmoetingen in het donker (30 sections, 30 distinct MP3 files, medium model, M2 + Metal): the model-load block recurs after each file's transcription completes.

Cause

  • crates/dpub-whisper/src/lib.rs:79transcribe() calls WhisperContext::new_with_params(...) every time, then drops the context at function exit.
  • crates/dpub-convert/src/lib.rs:624–631inject_transcripts() calls dpub_whisper::transcribe(...) inside a per-file for loop with no shared context.

Each load:

  1. Reads the full GGML file from disk (~1.5 GB for medium, ~3 GB for large-v3).
  2. Allocates Metal buffers and copies the weights onto the GPU.
  3. Initialises BLAS, kv-self, kv-cross, conv/encode/cross/decode compute buffers.

On Apple Silicon + Metal this is ~5–10 s per file. For a 30-section book that's ~3–5 minutes of avoidable wallclock per conversion, on top of (and serialised with) the actual transcription work. It scales linearly with section count, so larger books pay more.

Expected

Construct the WhisperContext once per inject_transcripts() call (or once per CLI invocation), and reuse it across all files. Per-file state = ctx.create_state() is cheap (just the kv/compute buffers, ~330 MB) and gives a fresh decoder state per file, which is the right granularity.

Suggested fix

Two reasonable shapes:

  1. Stateful API in dpub-whisper — the cleanest. Introduce a Transcriber that owns the WhisperContext, with a transcribe(&self, audio_path) method. The free function stays as a one-shot convenience for the smoke test.

    pub struct Transcriber {
        ctx: WhisperContext,
        language: String,
    }
    
    impl Transcriber {
        pub fn new(opts: &TranscribeOptions) -> Result<Self> {}
        pub fn transcribe(&self, audio_path: &Path) -> Result<Vec<Segment>> {}
    }

    Then inject_transcripts constructs one Transcriber and reuses it across the loop.

  2. Cache in dpub-convert — leave dpub-whisper as-is and hoist context construction into inject_transcripts. Less idiomatic (the context is a dpub-whisper concern leaking upward), but a smaller patch.

Either way, the behaviour-preserving correctness check is the existing smoke test in crates/dpub-whisper/tests/smoke.rs plus the DPUB_TEST_BOOK end-to-end run.

Impact

  • Wallclock: a few minutes per conversion today; more on books with many short sections.
  • Disk: re-reads the 1.5 GB / 3 GB model file from disk once per audio file (mostly absorbed by the OS page cache after the first read, but still avoidable).
  • Metal: repeated allocation/deallocation churn of GPU buffers; not a leak, but stresses the unified-memory pool.

Encountered on a real --transcribe nl --whisper-model ggml-medium.bin --audio opus --bitrate 32 run against the 11 h 45 m reference book on M2 + Metal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions