feat(diarizer): add opt-in embedding skip strategy for offline pipeline#480
Conversation
Add EmbeddingSkipStrategy to OfflineDiarizerConfig that skips redundant speaker embedding model calls when consecutive segmentation windows have highly similar speaker masks. At the default config (stepRatio=0.20) this has minimal effect. At higher-overlap configs (e.g., stepRatio=0.15) it provides 1.4-2.3x embedding speedup with zero quality loss.
There was a problem hiding this comment.
🟡 Missing validation for EmbeddingSkipStrategy threshold in validate()
The validate() method at Sources/FluidAudio/Diarizer/Offline/Core/OfflineDiarizerTypes.swift:299 validates every other config parameter (clustering threshold, step ratio, batch size, onset/offset thresholds, etc.) but does not validate the new embedding.skipStrategy. A .maskSimilarity(threshold:) with a NaN, negative, or > 1.0 value will pass validation uncaught. This is called by the pipeline at Sources/FluidAudio/Diarizer/Offline/Core/OfflineDiarizerManager.swift:120 before processing begins. A NaN threshold would cause maskCosineSimilarity(...) >= threshold to always evaluate false (harmless but surprising), while a negative threshold would cause every mask comparison to hit the cache (skipping nearly all embeddings, severely degrading diarization quality).
(Refers to lines 388-389)
Was this helpful? React with 👍 or 👎 to provide feedback.
| public enum EmbeddingSkipStrategy: Sendable { | ||
| /// No skipping — extract every embedding (default). | ||
| case none | ||
| /// Skip if the speaker mask has cosine similarity ≥ threshold compared to the mask | ||
| /// that produced the currently cached embedding for this speaker. Prevents drift by | ||
| /// always comparing against the mask that generated the cached embedding, not a | ||
| /// rolling previous mask. | ||
| /// | ||
| /// Recommended threshold: 0.95 (≤1pp DER cost across VoxConverse/SCOTUS/Earnings-21). | ||
| case maskSimilarity(threshold: Float) | ||
| } |
There was a problem hiding this comment.
🔴 No unit tests added for new EmbeddingSkipStrategy feature (AGENTS.md rule violation)
AGENTS.md mandates "Add unit tests when writing new code." This PR introduces a new public enum EmbeddingSkipStrategy, a new config field Embedding.skipStrategy, a new convenience accessor embeddingSkipStrategy, and non-trivial caching logic with maskCosineSimilarity — but no test files are included in the change. There is an existing Tests/FluidAudioTests/Diarizer/Offline/OfflineConfigTests.swift that tests other config parameters, making this a natural place to add coverage for the new strategy (e.g., config round-trip, validation of threshold bounds, default value).
Was this helpful? React with 👍 or 👎 to provide feedback.
Why is this change needed?
This PR adds an opt-in
EmbeddingSkipStrategyto the offline diarization pipeline. When consecutive segmentation windows produce highly similar speaker masks, the embedding model call is skipped and the previously computed embedding is reused.At the current default config (
stepRatio=0.20), this has minimal effect — windows don't overlap enough to produce significant redundancy. The feature becomes valuable at higher-overlap configurations (e.g.,stepRatio=0.15) where it recovers the extra embedding cost with zero quality loss.What changed
EmbeddingSkipStrategyenum onOfflineDiarizerConfig.Embedding(.nonedefault,.maskSimilarity(threshold:))embeddingSkipStrategyonOfflineDiarizerConfigskipStrategyparameter added to the flat initializer with.nonedefault (backward compatible)OfflineEmbeddingExtractorwith cache clearing between FBANK batchesmaskCosineSimilarityhelper using existingVDSPOperations.dotProductDesign decisions
Cache-pinned comparison, not rolling: The similarity check compares against the mask that produced the cached embedding, not the most recent mask. This prevents drift accumulation — if masks M1→M2→M3 each differ by 5%, M3 vs M1 could differ by 15%, but a rolling comparison would always pass.
Cache cleared between FBANK batches: Speaker indices are local to each powerset chunk (0, 1, 2), not global IDs. Within a batch, consecutive overlapping windows share audio so the ordering is stable. Across batch boundaries, speaker assignments may change.
Recommended threshold: 0.95 based on cross-corpus benchmarking (VoxConverse, SCOTUS oral arguments, Earnings-21 calls).
Benchmarks
All benchmarks on Apple M1 Max, macOS 26.5, 4 files across 3 corpora.
At default config (
stepRatio=0.20,excludeOverlap=true)Quality: identical SAA/DER on all files. No effect at default overlap.
At higher-overlap config (
stepRatio=0.15,excludeOverlap=false)Embedding model time only:
Quality (DER scored with pyannote.metrics, collar=0.25s):
Zero quality loss across all files. Skip rate scales with audio stability — long monologues (SCOTUS) skip 57%, frequent speaker changes (Earnings) skip 12%.