dpub-whisper: `transcribe()` reloads the GGML model on every call (~3–5 min wasted per book)

## Summary

`dpub_whisper::transcribe()` constructs a fresh `WhisperContext` (and re-loads the GGML model into GPU/CPU buffers) on every invocation. `dpub_convert::inject_transcripts()` calls it once per audio file in a sequential loop, so a typical audiobook conversion reloads the same 1.5 GB model 30+ times back-to-back.

## Repro

Convert any multi-section DAISY book with `--transcribe`:

```sh
dpub convert /path/to/ncc.html -o out.epub \
    --audio opus --bitrate 32 \
    --transcribe nl --whisper-model ~/models/ggml-medium.bin
```

In the whisper.cpp stderr stream you'll see this block repeat once per audio file:

```
ggml_metal_free: deallocating
whisper_init_from_file_with_params_no_state: loading model from '…/ggml-medium.bin'
…
whisper_model_load: Metal total size = 1533.14 MB
ggml_metal_init: picking default device: Apple M2
…
whisper_init_state: kv self size  = 50.33 MB
whisper_init_state: kv cross size = 150.99 MB
…
```

Observed against `Ontmoetingen in het donker` (30 sections, 30 distinct MP3 files, medium model, M2 + Metal): the model-load block recurs after each file's transcription completes.

## Cause

- `crates/dpub-whisper/src/lib.rs:79` — `transcribe()` calls `WhisperContext::new_with_params(...)` every time, then drops the context at function exit.
- `crates/dpub-convert/src/lib.rs:624–631` — `inject_transcripts()` calls `dpub_whisper::transcribe(...)` inside a per-file `for` loop with no shared context.

Each load:

1. Reads the full GGML file from disk (~1.5 GB for medium, ~3 GB for large-v3).
2. Allocates Metal buffers and copies the weights onto the GPU.
3. Initialises BLAS, kv-self, kv-cross, conv/encode/cross/decode compute buffers.

On Apple Silicon + Metal this is ~5–10 s per file. For a 30-section book that's **~3–5 minutes of avoidable wallclock per conversion**, on top of (and serialised with) the actual transcription work. It scales linearly with section count, so larger books pay more.

## Expected

Construct the `WhisperContext` once per `inject_transcripts()` call (or once per CLI invocation), and reuse it across all files. Per-file `state = ctx.create_state()` is cheap (just the kv/compute buffers, ~330 MB) and gives a fresh decoder state per file, which is the right granularity.

## Suggested fix

Two reasonable shapes:

1. **Stateful API in `dpub-whisper`** — the cleanest. Introduce a `Transcriber` that owns the `WhisperContext`, with a `transcribe(&self, audio_path)` method. The free function stays as a one-shot convenience for the smoke test.

   ```rust
   pub struct Transcriber {
       ctx: WhisperContext,
       language: String,
   }

   impl Transcriber {
       pub fn new(opts: &TranscribeOptions) -> Result<Self> { … }
       pub fn transcribe(&self, audio_path: &Path) -> Result<Vec<Segment>> { … }
   }
   ```

   Then `inject_transcripts` constructs one `Transcriber` and reuses it across the loop.

2. **Cache in `dpub-convert`** — leave `dpub-whisper` as-is and hoist context construction into `inject_transcripts`. Less idiomatic (the context is a `dpub-whisper` concern leaking upward), but a smaller patch.

Either way, the behaviour-preserving correctness check is the existing smoke test in `crates/dpub-whisper/tests/smoke.rs` plus the `DPUB_TEST_BOOK` end-to-end run.

## Impact

- Wallclock: a few minutes per conversion today; more on books with many short sections.
- Disk: re-reads the 1.5 GB / 3 GB model file from disk once per audio file (mostly absorbed by the OS page cache after the first read, but still avoidable).
- Metal: repeated allocation/deallocation churn of GPU buffers; not a leak, but stresses the unified-memory pool.

Encountered on a real `--transcribe nl --whisper-model ggml-medium.bin --audio opus --bitrate 32` run against the 11 h 45 m reference book on M2 + Metal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpub-whisper: `transcribe()` reloads the GGML model on every call (~3–5 min wasted per book) #10

Summary

Repro

Cause

Expected

Suggested fix

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

dpub-whisper: transcribe() reloads the GGML model on every call (~3–5 min wasted per book) #10

Description

Summary

Repro

Cause

Expected

Suggested fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

dpub-whisper: `transcribe()` reloads the GGML model on every call (~3–5 min wasted per book) #10