ASR tech debt cleanup: remove dead code, fix bugs, add benchmark script 28/03/2026#460
Conversation
…mingChunk AsrManager is the low-level inference engine and shouldn't own vocabulary boosting state or carry streaming semantics. This removes the duplicated vocabulary config/rescoring from AsrManager, leaving SlidingWindowAsrManager as the single owner. CTC head inference and cached logit storage remain in AsrManager since they need encoder_output MLMultiArray access. Closes #457
|
Claude finished @Alex-Wengg's task —— View job PR Review: AsrManager vocabulary boosting refactor ✅I've completed a comprehensive review of this PR focusing on code quality, architecture, and potential issues. Review Tasks:
Review SummaryThis is a well-executed refactoring that significantly improves the codebase architecture: ✅ Key Strengths
📊 Code Quality Rating: ★★★★★
🧪 Testing Status
Overall: APPROVE - This refactoring improves maintainability while preserving functionality and performance. |
There was a problem hiding this comment.
This is a well-executed refactoring that improves the codebase architecture. Here's my comprehensive review:
✅ Excellent Changes
1. Clear Separation of Concerns
- AsrManager: Now focused purely on low-level inference (appropriate)
- SlidingWindowAsrManager: Single owner of vocabulary boosting (appropriate)
- This follows the Single Responsibility Principle and eliminates code duplication
2. Method Rename is Semantically Correct
transcribeStreamingChunk→transcribeChunkremoves misleading "streaming" semantics- AsrManager is the stateless inference engine, not a streaming component
- The method preserves decoder state between calls, which is what matters
3. Smart Preservation of Performance Features
- Correctly keeps
getCachedCtcRawLogits()in AsrManager - CTC logit caching requires direct access to
encoder_outputMLMultiArray - This ensures benchmarks can still access raw logits efficiently
🔍 Code Quality Analysis
Architecture Quality: ★★★★★
- Clean elimination of 59 lines from AsrManager (vocabulary state + methods)
- 143 lines removed from AsrTranscription.swift (duplicate rescoring logic)
- Single method call change in SlidingWindowAsrManager maintains functionality
API Design: ★★★★★
- Public API remains unchanged for end users
- Internal refactoring doesn't break existing consumers
- Method renaming improves API clarity without functional changes
Memory Management: ★★★★★
- Proper cleanup in
resetState()andcleanup() - Removes vocabulary boosting cleanup call since state is no longer held
🧪 Testing & Validation
The test plan looks comprehensive:
- ✅ Build compiles cleanly
- ✅ All tests pass
- ✅ No swift-format violations
- ✅ No dangling references to old methods
- ⏳ Pending: ASR benchmark WER validation
- ⏳ Pending: CTC earnings benchmark validation
📝 Minor Observations
Variable Declaration Style
- Good change from
var result = ...tolet result = ...since vocabulary rescoring is removed - Shows the code is now more functional (immutable results)
Documentation Impact
- The removed methods had good documentation - ensure equivalent docs exist in SlidingWindowAsrManager
🚀 Recommendation: APPROVE
This refactoring:
- Improves maintainability by eliminating duplicate code paths
- Clarifies responsibilities between inference engine vs. high-level features
- Maintains performance by keeping critical optimizations in place
- Preserves functionality while improving code organization
The only remaining items are benchmark validations, which are appropriate to verify no regressions were introduced.
Nit: Consider running the benchmarks mentioned in the test plan to complete the validation before merging.
PocketTTS Smoke Test ✅
Runtime: 0m28s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 3m26s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 2m 31s • 2026-03-29T03:34:09.031Z |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m15s • 03/28/2026, 11:42 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.3s diarization time • Test runtime: 1m 40s • 03/28/2026, 11:30 PM EST |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m49s • 03/28/2026, 11:30 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 228.7s processing • Test runtime: 3m 51s • 03/28/2026, 11:37 PM EST |
- Remove deprecated calculateStartFrameOffset and its tests - Add explicit parakeetTdtCtc110m case to Repo.folderName - Extract duplicated defaultConfiguration() and defaultModelsDirectory() into shared MLModelConfigurationUtils, replacing 5+3 copy-pasted methods - Rename StreamingAudioSourceFactory/SampleSource/Error to drop misleading "Streaming" prefix (types are used by both ASR and Diarizer) - Rename files to match their type names (SortformerDiarizer, LSEENDDiarizer, NemotronStreamingAsrManager+Pipeline) - Remove stale TODO and duplicate vocabularyFileArray constant - Remove outdated nonisolated(unsafe) from SlidingWindowAsrManager - Replace force unwraps in RnntDecoder with guard let + throw
- Delete dead loadModel() and getDefaultModelsDirectory() (legacy Models/Parakeet path) - Remove dangling doc comment from deleted property - Rename transcribeStreaming → transcribeDiskBacked (avoids confusion with the real streaming API in SlidingWindowAsrManager) - Convert getDecoderLayers() to decoderLayerCount computed property - Move AudioSource enum from AsrManager.swift to Shared/AudioSource.swift - Mark pure utility methods as nonisolated: normalizedTimingToken, calculateConfidence, sliceEncoderOutput, removeDuplicateTokenSequence
ANEOptimizer was a thin wrapper over ANEMemoryUtils in the wrong location (ASR/Parakeet/ instead of Shared/). All callers now use ANEMemoryUtils directly. - Replace ANEOptimizer.createANEAlignedArray → ANEMemoryUtils.createAlignedArray - Replace ANEOptimizer.prefetchToNeuralEngine(x) → x.prefetchToNeuralEngine() (MLMultiArray extension already in ANEMemoryOptimizer) - Move convertToFloat16 to ANEMemoryUtils (throws proper ANEMemoryError) - Move ZeroCopyFeatureProvider to Shared/ZeroCopyFeatureProvider.swift - Inline optimalComputeUnits (always returned .cpuAndNeuralEngine), delete ModelType enum - Simplify AsrModels.optimizedConfiguration to use shared utility - Delete ANEOptimizer.swift
…te duplication - Remove unused import OSLog and dead sliceEncoderOutput method - Add clearCachedCtcData() helper to eliminate repeated nil assignments - Add decoderState(for:)/setDecoderState(_:for:) to eliminate switch duplication - Extract frameAlignedAudio() helper for duplicated frame-alignment logic - Add ASRConstants.secondsPerEncoderFrame to replace magic number 0.08 - Replace hardcoded 16_000 with config.sampleRate - Remove unused duration parameter from calculateConfidence - Simplify processTranscriptionResult by removing dead tokenTimings parameter - Replace convertTokensWithExistingTimings with simpler convertTokensToText
…ad code - Apply enableFP16 to allowLowPrecisionAccumulationOnGPU in optimizedConfiguration (fixes review feedback on PR #460) - Remove dead loadWithANEOptimization method (no callers) - Remove unused import OSLog
- Delete PerformanceMonitor actor (never instantiated, component times hardcoded to 0) - Delete AggregatedMetrics struct (only used by dead monitor) - Remove unused imports (os, MachTaskSelfWrapper) - Move ASRPerformanceMetrics to Shared/ (not Parakeet-specific) - Remove dead PerformanceMonitor tests, keep ASRPerformanceMetrics tests
- Remove redundant currentStream() wrapper (callers use ensureSession) - Fix finishSession: return early when inactive instead of creating an orphan stream - Remove auto-create in resetAndPrepareNextSession (renamed to reset); next ensureSession() creates on demand - Remove onTermination closure with unnecessary weak self - Move from ASR/Parakeet/ to Shared/ (generic async stream utility)
…hared - Remove unused `import os` and logger - Delete dead `getFloat16Array` method and its 2 tests - Fix `returnArray` to reset data for all types, not just float32 - Remove debug logging from hot path - Move from ASR/Parakeet/ to Shared/ (used by ASR and shared cache)
…ff bug - Remove unused `import CoreML` and `import OSLog` - Replace hardcoded `sampleRate = 16000` with `ASRConstants.sampleRate` - Replace manual frameDuration calculation with `ASRConstants.secondsPerEncoderFrame` - Fix duplicate token at cutoff boundary in mergeByMidpoint (`<=` to `<`)
Track the benchmark orchestration script that runs all 4 Parakeet model benchmarks (v3, v2, TDT-CTC-110M, CTC earnings) with asset verification and sleep prevention. Link it from the benchmark results doc for reproducibility. Whitelist the script in .gitignore (scripts/ was ignored).
- Add explicit folderName cases for parakeetCtc110m and parakeetCtc06b in ModelNames.swift. The default case strips "-coreml" which broke auto-detection of the CTC model directory. - Add EOU 320ms and Nemotron 1120ms streaming benchmarks to the script - Add WER comparison table against benchmarks100.md baselines - Fix CTC earnings to use v2 TDT (matching baseline config) - Fix WER extraction for fields stored as percentages vs decimals Verified: all 6 benchmarks match baselines (v3 2.6%, v2 3.8%, TDT-CTC 3.6%, earnings 16.5%, EOU 7.11%, Nemotron 1.99%).
Adds Scripts/run_diarizer_benchmarks.sh that runs all 4 diarization systems (Offline VBx, Streaming 5s, Sortformer, LS-EEND) on a 4-meeting AMI SDM subset for regression testing. Adds Documentation/Diarization/BenchmarkAMISubset.md recording the baseline DER/RTFx for each system on EN2002a, ES2004a, IS1009a, TS3003a.
run_parakeet_benchmarks.sh → run_parakeet_subset.sh run_diarizer_benchmarks.sh → run_diarizer_subset.sh Updates all references in .gitignore, benchmarks100.md, and BenchmarkAMISubset.md.
run_diarizer_subset.sh → diarizer_subset_benchmark.sh run_parakeet_subset.sh → parakeet_subset_benchmark.sh
The first python3 call wrote to an unintended timestamp-less path and imported unused glob. Keep only the second invocation that correctly writes to the output_file parameter.
Summary
Systematic cleanup of the ASR module addressing tech debt items from #457. Net reduction of ~430 lines while fixing real bugs and improving maintainability.
Bug fixes
enableFP16silently ignored —optimizedConfiguration(enableFP16:)delegated to a shared factory that hardcodedallowLowPrecisionAccumulationOnGPU = true, ignoring the caller's parameterMLArrayCache.returnArrayonly reset float32 data — cached arrays of other types (float16, int32) retained stale data from previous useRepo.parakeetCtc110m.folderNamereturned"parakeet-ctc-110m"instead of"parakeet-ctc-110m-coreml"because thefolderNameswitch fell through to adefaultcase that stripped the-coremlsuffix. Same forparakeetCtc06b.mergeByMidpointused<=/>=so tokens exactly at the cutoff appeared in both left and right chunksDead code removal
ANEOptimizerindirection layer (166 lines) — was a pass-through wrappingMLModelwith no optimizationPerformanceMonitoractor andAggregatedMetrics— never instantiated, component times hardcoded to 0getFloat16Arrayfrom MLArrayCache — never calledsliceEncoderOutputfrom AsrTranscription — never called (30 lines)loadWithANEOptimizationfrom AsrModels — never calledtokenTimingsparameter chain throughprocessTranscriptionResultimport OSLog/import CoreMLacross 5 filesnonisolated(unsafe)from SlidingWindowAsrManager (types already Sendable)Duplication elimination
clearCachedCtcData()helper (replaced 3× triple-nil assignments)decoderState(for:)/setDecoderState(_:for:)(replaced 4× switch blocks)frameAlignedAudio()(replaced 2× duplicated frame-alignment blocks)ASRConstants.secondsPerEncoderFrame(replaced 5× magic0.08)16_000withconfig.sampleRate/ASRConstants.sampleRateMLModelConfigurationUtils.defaultConfiguration()(replaced 5× copy-pasted config methods)MLModelConfigurationUtils.defaultModelsDirectory()(replaced 3× copy-pasted directory methods)vocabularyFile/vocabularyFileArrayconstantsFile organization
PerformanceMetrics.swift,ProgressEmitter.swift,MLArrayCache.swiftfromASR/Parakeet/toShared/(used by multiple modules)StreamingAudioSourceFactory→AudioSourceFactory,StreamingAudioSampleSource→AudioSampleSource(types used by both ASR and Diarizer)SortformerDiarizerPipeline.swift→SortformerDiarizer.swift,LSEENDDiarizerAPI.swift→LSEENDDiarizer.swift,NemotronPipeline.swift→NemotronStreamingAsrManager+Pipeline.swiftRnntDecoder.swiftwithguard let+ descriptive errorsBenchmark script
Scripts/run_parakeet_benchmarks.sh— runs all 6 benchmarks (v3, v2, TDT-CTC-110M, CTC earnings, EOU 320ms, Nemotron 1120ms) with WER comparison againstbenchmarks100.mdbaselines and regression detectionDocumentation/ASR/benchmarks100.mdVerified — no regressions
Test plan
swift buildpassesswift testpasses (all existing tests, updated for removed dead code)swift format lintpasses