fix: macOS 26 Sortformer compatibility - use V2 models #266

Alex-Wengg · 2026-01-25T22:48:43Z

Summary

Fixes macOS 26 BNNS compiler error for Sortformer models
Updates to use V2 models with renamed output tensors
Adds Float16 support for head module outputs

Problem

macOS 26 introduced stricter validation in the BNNS graph compiler that rejects CoreML models where input and output tensors share the same name:

Failed to configure ML Program for the feature types declared in the model description.
Function main has tensor chunk_pre_encoder_embs as both an input and output.
Inputs and outputs must be distinct, please add an explicit identity op.

Solution

Code Changes

SortformerModelInference.swift:
- Read from renamed outputs (chunk_pre_encoder_embs_out, chunk_pre_encoder_lengths_out)
- Handle Float16 output (head module uses fp16 precision)
ModelNames.swift:
- Update model names to V2 (SortformerV2, SortformerNvidiaLowV2, SortformerNvidiaHighV2)

Model Changes (separate PR)

V2 models converted with:

Explicit identity ops (+ 0.0) to create distinct output tensors
Renamed output tensor names to *_out suffix

See: FluidInference/mobius#11

Testing

Tested on macOS 26.1 (Build 25B78), Apple M2:

Model loads successfully (~90ms warm, ~1.2s cold)
Inference works correctly (9-10x RTFx)
DER results match expected values

Fixes #265

claude · 2026-01-25T22:48:58Z

Claude finished @Alex-Wengg's task —— View job

PR Review Complete ✅

Reviewed PR #266: fix: macOS 26 Sortformer compatibility - use V2 models

Todo List:

Create pending PR review
Get PR diff to analyze exact changes
Read modified files for context
Analyze code changes for quality and issues
Add inline review comments
Submit final review

Review Summary

This PR successfully addresses the macOS 26 BNNS compiler error by updating to V2 models with distinct input/output tensor names and adding Float16 support. The changes are well-implemented with proper error handling and backward compatibility.

Key Findings:

✅ Resolves the specific macOS 26 compatibility issue
✅ Maintains backward compatibility with Float16/Float32 fallback
✅ Clear documentation and minimal surface area changes
✅ No performance or security concerns identified

Minor suggestions: Better documentation of Float16 requirements and more descriptive error messages, but these are non-blocking improvements.

The review has been submitted as a non-blocking comment to provide feedback while allowing the PR to proceed.

claude

Code Review Summary

This PR successfully addresses the macOS 26 BNNS compiler error by updating to V2 models with distinct input/output tensor names and adding Float16 support. The changes are well-implemented and follow good practices.

Positive Aspects

Clear Problem Resolution: The fix directly addresses the BNNS compiler error where input/output tensors shared the same name
Backward Compatibility: Graceful fallback from Float16 to Float32 with proper availability checks
Good Documentation: Clear comments explaining the reasoning behind the _out suffix
Minimal Surface Area: Only necessary changes were made, keeping the diff focused

Minor Suggestions

Float16 Availability Comment: Consider adding a comment explaining why macOS 15.0+/iOS 18.0+ is required for Float16 CoreML support at line 253
Error Message Clarity: The error message "Missing chunk_pre_encoder_embs_out" could be more descriptive about the Float16/Float32 expectation

Technical Assessment

Thread Safety: ✅ No concurrency issues introduced
Memory Management: ✅ Proper use of existing memory optimization patterns
Error Handling: ✅ Appropriate error handling with meaningful messages
Performance: ✅ No performance regressions, maintains efficient tensor access
API Consistency: ✅ No breaking changes to public interfaces

Architecture Review

The changes maintain the existing architecture while adapting to new model requirements. The dual Float16/Float32 handling is implemented cleanly without adding complexity to the calling code.

Overall, this is a solid fix that resolves the immediate compatibility issue while maintaining code quality and backward compatibility.

github-actions · 2026-01-25T22:54:16Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	590.9x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	592.9x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-01-25T22:55:23Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	8.69x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	56.6s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.057s	Average chunk processing time
Max Chunk Time	0.113s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 1m4s • 01/25/2026, 08:34 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-01-25T22:56:33Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	17.33x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	8.566	14.1	Fetching diarization models
Model Compile	3.671	6.1	CoreML compilation
Audio Load	0.065	0.1	Loading audio file
Segmentation	18.152	30.0	Detecting speech regions
Embedding	30.254	50.0	Extracting speaker voices
Clustering	12.102	20.0	Grouping same speakers
Total	60.537	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 60.5s diarization time • Test runtime: 2m 15s • 01/25/2026, 08:40 PM EST}

github-actions · 2026-01-25T22:56:45Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.1%	-	-
Speaker Error	8.9%	-	-
RTFx	14.4x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 2m 33s • 2026-01-26T01:37:22.684Z}

github-actions · 2026-01-25T22:56:46Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	14.5%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	2.95x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	15.138	4.3	Fetching diarization models
Model Compile	6.488	1.8	CoreML compilation
Audio Load	0.076	0.0	Loading audio file
Segmentation	41.259	11.6	VAD + speech detection
Embedding	351.064	98.8	Speaker embedding extraction
Clustering (VBx)	3.272	0.9	Hungarian algorithm + VBx clustering
Total	355.201	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	14.5%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 395.6s processing • Test runtime: 7m 44s • 01/25/2026, 08:48 PM EST}

github-actions · 2026-01-25T23:01:19Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	3.73x	✅
test-other	1.40%	0.00%	2.80x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	3.49x	✅
test-other	1.00%	0.00%	2.48x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.49x	Streaming real-time factor
Avg Chunk Time	1.825s	Average time to process each chunk
Max Chunk Time	2.678s	Maximum chunk processing time
First Token	2.226s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.49x	Streaming real-time factor
Avg Chunk Time	1.819s	Average time to process each chunk
Max Chunk Time	2.889s	Maximum chunk processing time
First Token	1.985s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 7m11s • 01/25/2026, 08:40 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

macOS 26 introduced stricter BNNS compiler validation that rejects CoreML models where input and output tensors share the same name. Changes: - SortformerModelInference.swift: Read from renamed outputs (chunk_pre_encoder_embs_out, chunk_pre_encoder_lengths_out) - SortformerModelInference.swift: Handle Float16 output from head module - ModelNames.swift: Update to download V2 models from HuggingFace - SortformerTypes.swift: Remove unused nestEncoderDims property The V2 models were converted with explicit identity ops to create distinct output tensors, fixing the BNNS compiler error: "Function main has tensor chunk_pre_encoder_embs as both an input and output. Inputs and outputs must be distinct." Fixes #265

Josscii · 2026-01-30T05:40:26Z

hi, this is not compiled when use mac catalyst:

Conformance of 'Float16' to 'MLShapedArrayScalar' is unavailable in Mac Catalyst

Alex-Wengg · 2026-01-30T05:42:42Z

hi @Josscii in that case can you create an issue for mac catalyst

claude bot reviewed Jan 25, 2026

View reviewed changes

Alex-Wengg force-pushed the fix/macos26-sortformer-bnns branch from 69d642d to 06b8232 Compare January 25, 2026 22:56

Alex-Wengg mentioned this pull request Jan 25, 2026

Sortformer model fails to compile - BNNS Graph Compile errors #265

Closed

Alex-Wengg force-pushed the fix/macos26-sortformer-bnns branch from 06b8232 to 15d136c Compare January 26, 2026 01:20

Alex-Wengg force-pushed the fix/macos26-sortformer-bnns branch from 15d136c to df8a963 Compare January 26, 2026 01:30

Alex-Wengg enabled auto-merge (squash) January 26, 2026 01:49

Alex-Wengg disabled auto-merge January 26, 2026 01:51

Alex-Wengg merged commit b598f43 into main Jan 26, 2026
10 checks passed

Alex-Wengg deleted the fix/macos26-sortformer-bnns branch January 26, 2026 01:52

Josscii mentioned this pull request Jan 30, 2026

hi, this is not compiled when use mac catalyst: #279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: macOS 26 Sortformer compatibility - use V2 models #266

fix: macOS 26 Sortformer compatibility - use V2 models #266

Uh oh!

Alex-Wengg commented Jan 25, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

github-actions bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Josscii commented Jan 30, 2026

Uh oh!

Alex-Wengg commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: macOS 26 Sortformer compatibility - use V2 models #266

fix: macOS 26 Sortformer compatibility - use V2 models #266

Uh oh!

Conversation

Alex-Wengg commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Code Changes

Model Changes (separate PR)

Testing

Uh oh!

claude bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete ✅

Todo List:

Review Summary

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Code Review Summary

Positive Aspects

Minor Suggestions

Technical Assessment

Architecture Review

Uh oh!

github-actions bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

Uh oh!

Josscii commented Jan 30, 2026

Uh oh!

Alex-Wengg commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alex-Wengg commented Jan 25, 2026 •

edited

Loading

claude bot commented Jan 25, 2026 •

edited

Loading

github-actions bot commented Jan 25, 2026 •

edited

Loading

github-actions bot commented Jan 25, 2026 •

edited

Loading

github-actions bot commented Jan 25, 2026 •

edited

Loading

github-actions bot commented Jan 25, 2026 •

edited

Loading

github-actions bot commented Jan 25, 2026 •

edited

Loading

github-actions bot commented Jan 25, 2026 •

edited

Loading