On-disk Whisper transcription cache by roelvangils · Pull Request #34 · 11ways/dpub

roelvangils · 2026-05-07T20:24:21Z

Summary

Repeat runs of `dpub convert --transcribe` against the same audio + model + language combination now skip Whisper entirely.
One JSON file per (audio, model, language) tuple in `~/.cache/dpub/transcripts/` keyed by SHA-256 of the inputs.
Modifying any input invalidates the entry naturally; failures are non-fatal (corrupt cache → re-transcribe).
`DPUB_NO_TRANSCRIPT_CACHE=1` bypasses both reads and writes for debugging.

Measured speedup (cavia book, 4h 22m audio, 109 sections)

Run	Wall time
Cold (Whisper on every file)	722 s
Warm (109/109 cache hits)	21 s
Speedup	34×

EPUBCheck stays clean (0/0/0).

Implementation

`Segment` and `Word` in `dpub-whisper` gain `serde::Deserialize` alongside the existing `Serialize` (round-trip prerequisite).
New `transcript_cache` module in `dpub-convert` (~280 lines, 8 unit tests). `CachedTranscriber` wraps `Transcriber`, hashes the model once at construction, hashes audio per call, stores a JSON envelope with diagnostic metadata + segment payload.
`inject_transcripts` swaps in `CachedTranscriber`; the existing in-memory `HashMap` cache stays to avoid re-hashing audio across sections that share a file.

Test plan

`cargo test --workspace` — all green, 8 new unit tests
Cold run on cavia book — 722 s, populates 109 cache files (2.6 MB)
Warm run — 21 s, 109/109 cache hits, EPUBCheck clean
(Manual) Modify a single byte of one audio file, re-run: expect cache miss for that file only

🤖 Generated with Claude Code

Repeat runs of `dpub convert --transcribe` against the same audio + model + language combination skip Whisper entirely. The cache lives in `~/.cache/dpub/transcripts/` (Unix) / `%LOCALAPPDATA%\dpub\transcripts\` (Windows); one JSON file per (audio, model, language) tuple keyed by SHA-256 of the inputs. Modifying any input invalidates the entry naturally — no manual cache management. Failures are non-fatal: corrupt cache files, IO errors, disk-full all log a warning and degrade silently to a fresh transcription. Set `DPUB_NO_TRANSCRIPT_CACHE=1` to bypass entirely (debugging). End-to-end measured on the 4h22m cavia book: - cold run: 722 s (Whisper on 109 audio files) - warm run: 21 s (109/109 cache hits) - 34× speedup Most of the warm-run time is Opus re-encoding + ZIP write; the cache lookup is dominated by audio file hashing (~ms per MB). Implementation: - `Segment` and `Word` in dpub-whisper now derive `serde::Deserialize` alongside the existing `Serialize`. Round-trip prerequisite. - New `transcript_cache` module in dpub-convert (~280 lines, 8 unit tests). `CachedTranscriber` wraps `dpub_whisper::Transcriber`, hashes the model once at construction, hashes audio per call, and stores a JSON envelope with diagnostic metadata + the segment payload. - `inject_transcripts` swaps in `CachedTranscriber`; the existing in-memory `HashMap<basename, Vec<Segment>>` cache stays so we don't re-hash audio across sections that share a file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roelvangils merged commit 401332d into main May 7, 2026
5 of 6 checks passed

roelvangils deleted the transcript-cache branch May 7, 2026 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-disk Whisper transcription cache#34

On-disk Whisper transcription cache#34
roelvangils merged 1 commit into
mainfrom
transcript-cache

roelvangils commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

roelvangils commented May 7, 2026

Summary

Measured speedup (cavia book, 4h 22m audio, 109 sections)

Implementation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant