Skip to content

On-disk Whisper transcription cache#34

Merged
roelvangils merged 1 commit into
mainfrom
transcript-cache
May 7, 2026
Merged

On-disk Whisper transcription cache#34
roelvangils merged 1 commit into
mainfrom
transcript-cache

Conversation

@roelvangils
Copy link
Copy Markdown
Member

Summary

  • Repeat runs of `dpub convert --transcribe` against the same audio + model + language combination now skip Whisper entirely.
  • One JSON file per (audio, model, language) tuple in `~/.cache/dpub/transcripts/` keyed by SHA-256 of the inputs.
  • Modifying any input invalidates the entry naturally; failures are non-fatal (corrupt cache → re-transcribe).
  • `DPUB_NO_TRANSCRIPT_CACHE=1` bypasses both reads and writes for debugging.

Measured speedup (cavia book, 4h 22m audio, 109 sections)

Run Wall time
Cold (Whisper on every file) 722 s
Warm (109/109 cache hits) 21 s
Speedup 34×

EPUBCheck stays clean (0/0/0).

Implementation

  • `Segment` and `Word` in `dpub-whisper` gain `serde::Deserialize` alongside the existing `Serialize` (round-trip prerequisite).
  • New `transcript_cache` module in `dpub-convert` (~280 lines, 8 unit tests). `CachedTranscriber` wraps `Transcriber`, hashes the model once at construction, hashes audio per call, stores a JSON envelope with diagnostic metadata + segment payload.
  • `inject_transcripts` swaps in `CachedTranscriber`; the existing in-memory `HashMap` cache stays to avoid re-hashing audio across sections that share a file.

Test plan

  • `cargo test --workspace` — all green, 8 new unit tests
  • Cold run on cavia book — 722 s, populates 109 cache files (2.6 MB)
  • Warm run — 21 s, 109/109 cache hits, EPUBCheck clean
  • (Manual) Modify a single byte of one audio file, re-run: expect cache miss for that file only

🤖 Generated with Claude Code

Repeat runs of `dpub convert --transcribe` against the same audio +
model + language combination skip Whisper entirely. The cache lives
in `~/.cache/dpub/transcripts/` (Unix) / `%LOCALAPPDATA%\dpub\transcripts\`
(Windows); one JSON file per (audio, model, language) tuple keyed by
SHA-256 of the inputs. Modifying any input invalidates the entry
naturally — no manual cache management.

Failures are non-fatal: corrupt cache files, IO errors, disk-full all
log a warning and degrade silently to a fresh transcription. Set
`DPUB_NO_TRANSCRIPT_CACHE=1` to bypass entirely (debugging).

End-to-end measured on the 4h22m cavia book:

- cold run: 722 s (Whisper on 109 audio files)
- warm run: 21 s (109/109 cache hits)
- 34× speedup

Most of the warm-run time is Opus re-encoding + ZIP write; the cache
lookup is dominated by audio file hashing (~ms per MB).

Implementation:

- `Segment` and `Word` in dpub-whisper now derive `serde::Deserialize`
  alongside the existing `Serialize`. Round-trip prerequisite.
- New `transcript_cache` module in dpub-convert (~280 lines, 8 unit
  tests). `CachedTranscriber` wraps `dpub_whisper::Transcriber`,
  hashes the model once at construction, hashes audio per call, and
  stores a JSON envelope with diagnostic metadata + the segment payload.
- `inject_transcripts` swaps in `CachedTranscriber`; the existing
  in-memory `HashMap<basename, Vec<Segment>>` cache stays so we don't
  re-hash audio across sections that share a file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roelvangils roelvangils merged commit 401332d into main May 7, 2026
5 of 6 checks passed
@roelvangils roelvangils deleted the transcript-cache branch May 7, 2026 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant