Skip to content

Conversation

@Alex-Wengg
Copy link
Contributor

@Alex-Wengg Alex-Wengg commented Jan 25, 2026

Summary

  • Fixes macOS 26 BNNS compiler error for Sortformer models
  • Updates to use V2 models with renamed output tensors
  • Adds Float16 support for head module outputs

Problem

macOS 26 introduced stricter validation in the BNNS graph compiler that rejects CoreML models where input and output tensors share the same name:

Failed to configure ML Program for the feature types declared in the model description.
Function main has tensor chunk_pre_encoder_embs as both an input and output.
Inputs and outputs must be distinct, please add an explicit identity op.

Solution

Code Changes

  1. SortformerModelInference.swift:

    • Read from renamed outputs (chunk_pre_encoder_embs_out, chunk_pre_encoder_lengths_out)
    • Handle Float16 output (head module uses fp16 precision)
  2. ModelNames.swift:

    • Update model names to V2 (SortformerV2, SortformerNvidiaLowV2, SortformerNvidiaHighV2)

Model Changes (separate PR)

V2 models converted with:

  • Explicit identity ops (+ 0.0) to create distinct output tensors
  • Renamed output tensor names to *_out suffix

See: FluidInference/mobius#11

Testing

Tested on macOS 26.1 (Build 25B78), Apple M2:

  • Model loads successfully (~90ms warm, ~1.2s cold)
  • Inference works correctly (9-10x RTFx)
  • DER results match expected values

Fixes #265

@claude
Copy link
Contributor

claude bot commented Jan 25, 2026

Claude finished @Alex-Wengg's task —— View job


PR Review Complete ✅

Reviewed PR #266: fix: macOS 26 Sortformer compatibility - use V2 models

Todo List:

  • Create pending PR review
  • Get PR diff to analyze exact changes
  • Read modified files for context
  • Analyze code changes for quality and issues
  • Add inline review comments
  • Submit final review

Review Summary

This PR successfully addresses the macOS 26 BNNS compiler error by updating to V2 models with distinct input/output tensor names and adding Float16 support. The changes are well-implemented with proper error handling and backward compatibility.

Key Findings:

  • ✅ Resolves the specific macOS 26 compatibility issue
  • ✅ Maintains backward compatibility with Float16/Float32 fallback
  • ✅ Clear documentation and minimal surface area changes
  • ✅ No performance or security concerns identified

Minor suggestions: Better documentation of Float16 requirements and more descriptive error messages, but these are non-blocking improvements.

The review has been submitted as a non-blocking comment to provide feedback while allowing the PR to proceed.


Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR successfully addresses the macOS 26 BNNS compiler error by updating to V2 models with distinct input/output tensor names and adding Float16 support. The changes are well-implemented and follow good practices.

Positive Aspects

  1. Clear Problem Resolution: The fix directly addresses the BNNS compiler error where input/output tensors shared the same name
  2. Backward Compatibility: Graceful fallback from Float16 to Float32 with proper availability checks
  3. Good Documentation: Clear comments explaining the reasoning behind the _out suffix
  4. Minimal Surface Area: Only necessary changes were made, keeping the diff focused

Minor Suggestions

  1. Float16 Availability Comment: Consider adding a comment explaining why macOS 15.0+/iOS 18.0+ is required for Float16 CoreML support at line 253
  2. Error Message Clarity: The error message "Missing chunk_pre_encoder_embs_out" could be more descriptive about the Float16/Float32 expectation

Technical Assessment

  • Thread Safety: ✅ No concurrency issues introduced
  • Memory Management: ✅ Proper use of existing memory optimization patterns
  • Error Handling: ✅ Appropriate error handling with meaningful messages
  • Performance: ✅ No performance regressions, maintains efficient tensor access
  • API Consistency: ✅ No breaking changes to public interfaces

Architecture Review

The changes maintain the existing architecture while adapting to new model requirements. The dual Float16/Float32 handling is implemented cleanly without adding complexity to the calling code.

Overall, this is a solid fix that resolves the immediate compatibility issue while maintaining code quality and backward compatibility.

@github-actions
Copy link

github-actions bot commented Jan 25, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 590.9x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 592.9x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link

github-actions bot commented Jan 25, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 8.69x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 56.6s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.057s Average chunk processing time
Max Chunk Time 0.113s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m4s • 01/25/2026, 08:34 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@Alex-Wengg Alex-Wengg force-pushed the fix/macos26-sortformer-bnns branch from 69d642d to 06b8232 Compare January 25, 2026 22:56
@github-actions
Copy link

github-actions bot commented Jan 25, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 17.33x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 8.566 14.1 Fetching diarization models
Model Compile 3.671 6.1 CoreML compilation
Audio Load 0.065 0.1 Loading audio file
Segmentation 18.152 30.0 Detecting speech regions
Embedding 30.254 50.0 Extracting speaker voices
Clustering 12.102 20.0 Grouping same speakers
Total 60.537 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 60.5s diarization time • Test runtime: 2m 15s • 01/25/2026, 08:40 PM EST

@github-actions
Copy link

github-actions bot commented Jan 25, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.1% - -
Speaker Error 8.9% - -
RTFx 14.4x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 33s • 2026-01-26T01:37:22.684Z

@github-actions
Copy link

github-actions bot commented Jan 25, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 2.95x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 15.138 4.3 Fetching diarization models
Model Compile 6.488 1.8 CoreML compilation
Audio Load 0.076 0.0 Loading audio file
Segmentation 41.259 11.6 VAD + speech detection
Embedding 351.064 98.8 Speaker embedding extraction
Clustering (VBx) 3.272 0.9 Hungarian algorithm + VBx clustering
Total 355.201 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 395.6s processing • Test runtime: 7m 44s • 01/25/2026, 08:48 PM EST

@github-actions
Copy link

github-actions bot commented Jan 25, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 3.73x
test-other 1.40% 0.00% 2.80x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 3.49x
test-other 1.00% 0.00% 2.48x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.49x Streaming real-time factor
Avg Chunk Time 1.825s Average time to process each chunk
Max Chunk Time 2.678s Maximum chunk processing time
First Token 2.226s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.49x Streaming real-time factor
Avg Chunk Time 1.819s Average time to process each chunk
Max Chunk Time 2.889s Maximum chunk processing time
First Token 1.985s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 7m11s • 01/25/2026, 08:40 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

macOS 26 introduced stricter BNNS compiler validation that rejects
CoreML models where input and output tensors share the same name.

Changes:
- SortformerModelInference.swift: Read from renamed outputs
  (chunk_pre_encoder_embs_out, chunk_pre_encoder_lengths_out)
- SortformerModelInference.swift: Handle Float16 output from head module
- ModelNames.swift: Update to download V2 models from HuggingFace
- SortformerTypes.swift: Remove unused nestEncoderDims property

The V2 models were converted with explicit identity ops to create
distinct output tensors, fixing the BNNS compiler error:
"Function main has tensor chunk_pre_encoder_embs as both an input
and output. Inputs and outputs must be distinct."

Fixes #265
@Alex-Wengg Alex-Wengg force-pushed the fix/macos26-sortformer-bnns branch from 15d136c to df8a963 Compare January 26, 2026 01:30
@Alex-Wengg Alex-Wengg enabled auto-merge (squash) January 26, 2026 01:49
@Alex-Wengg Alex-Wengg disabled auto-merge January 26, 2026 01:51
@Alex-Wengg Alex-Wengg merged commit b598f43 into main Jan 26, 2026
10 checks passed
@Alex-Wengg Alex-Wengg deleted the fix/macos26-sortformer-bnns branch January 26, 2026 01:52
@Josscii
Copy link
Contributor

Josscii commented Jan 30, 2026

hi, this is not compiled when use mac catalyst:

Conformance of 'Float16' to 'MLShapedArrayScalar' is unavailable in Mac Catalyst

@Alex-Wengg
Copy link
Contributor Author

hi @Josscii in that case can you create an issue for mac catalyst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sortformer model fails to compile - BNNS Graph Compile errors

3 participants