Metal optimizations by BrandonWeng · Pull Request #3 · FluidInference/FluidAudio

BrandonWeng · 2025-06-28T05:04:38Z

Making some optimizations to use Acceleration and Metal tool chain when available to do the audio conversions and embedding comparisons. Added some tests and benchmarks but I still need to integrate it end. to end to test it out fully

> uv run run_benchmarks.py
🚀 FluidAudioSwift Metal Acceleration Benchmarks
==================================================
📁 Changed directory to project root: /Users/brandonweng/code/FluidAudioSwift
📦 Building package...
[1/1] Planning build
Building for production...
[2/2] Compiling FluidAudioSwift DiarizerManager.swift
Build complete! (2.06s)
🔬 Running Metal acceleration benchmarks...
This may take several minutes...
✅ Benchmarks completed successfully!
📊 Benchmark Results Summary:
===============================
✅ Metal Performance Shaders available
🕐 Timestamp: 2025-06-28T05:01:52Z
📈 Total tests run: 26
⚡ Average speedup: 0.40x
🚀 Best speedup: 2.42x
⚠️  Metal overhead detected (expected for small operations)

📋 Test Breakdown:
   • Cosine Distance: 12 tests, 0.39x avg speedup
   • End To End Diarization: 3 tests, 0.98x avg speedup
   • Memory Usage: 3 tests, 0.00x avg speedup
   • Powerset Conversion: 8 tests, 0.20x avg speedup

📁 Full results saved to: benchmark_results_20250628_010232.json
💡 Tip: Use 'jq' to explore the JSON results in detail:
   cat benchmark_results_20250628_010232.json | jq '.tests[] | select(.test_type == "cosine_distance")'

🎯 Benchmark run complete!

### Why is this change needed?  Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py Updating our segmentation implementation and supporitng streaming VAD ```bash %swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:02.812] [INFO] [VadManager] VAD model loaded successfully [00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation... [00:08:02.820] [INFO] [VadAnalyze] • Speech Start at 1.200s [00:08:02.821] [INFO] [VadAnalyze] • Speech End at 2.700s [00:08:02.822] [INFO] [VadAnalyze] • Speech Start at 4.300s [00:08:02.825] [INFO] [VadAnalyze] • Speech End at 7.800s [00:08:02.828] [INFO] [VadAnalyze] • Speech Start at 13.700s [00:08:02.830] [INFO] [VadAnalyze] • Speech End at 16.200s [00:08:02.830] [INFO] [VadAnalyze] • Speech Start at 17.300s [00:08:02.832] [INFO] [VadAnalyze] • Speech End at 19.000s [00:08:02.839] [INFO] [VadAnalyze] • Speech Start at 29.600s [00:08:02.840] [INFO] [VadAnalyze] • Speech End at 30.600s [00:08:02.849] [INFO] [VadAnalyze] • Speech Start at 45.000s [00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments... [00:08:02.850] [INFO] [VadAnalyze] • Speech End at 45.500s [00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events % swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:08.309] [INFO] [VadManager] VAD model loaded successfully [00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation... [00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s [00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s) [00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s) [00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s) [00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s) [00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s) [00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s) [00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s) % ffmpeg -i voiceink-issue-279.wav -af silencedetect=noise=-30dB:d=0.5 -f null - ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers built with Apple clang version 17.0.0 (clang-1700.0.13.3) ... libavutil 60. 8.100 / 60. 8.100 libavcodec 62. 11.100 / 62. 11.100 libavformat 62. 3.100 / 62. 3.100 libavdevice 62. 1.100 / 62. 1.100 libavfilter 11. 4.100 / 11. 4.100 libswscale 9. 1.100 / 9. 1.100 libswresample 6. 1.100 / 6. 1.100 [aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono Input #0, wav, from 'voiceink-issue-279.wav': Duration: 00:00:45.66, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, null, to 'pipe:': Metadata: encoder : Lavf62.3.100 Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc62.11.100 pcm_s16le [silencedetect @ 0xb22c6c420] silence_start: 0 [silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364 [silencedetect @ 0xb22c6c420] silence_start: 2.305687 [silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125 [silencedetect @ 0xb22c6c420] silence_start: 7.579813 [silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125 [silencedetect @ 0xb22c6c420] silence_start: 15.845063 [silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687 [silencedetect @ 0xb22c6c420] silence_start: 18.692625 [silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813 [silencedetect @ 0xb22c6c420] silence_start: 30.367563 [silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445 [silencedetect @ 0xb22c6c420] silence_start: 41.454687 [silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125 [out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00 ```

## Summary Migrates fluidaudio-rs (Rust + FFI) to FluidAudioAPI (pure Swift 6) with: - Zero FFI overhead (5-10% faster than Rust bindings) - Swift 6 strict concurrency compliance - Actor-based isolation for thread safety - Full async/await throughout - 15 comprehensive tests (all passing) ## New Features ### Core Library - `FluidAudioAPI` actor with simplified async/await API - ASR: Automatic Speech Recognition - VAD: Voice Activity Detection - Diarization: Speaker identification - `transcribeSamples()`: Real-time buffer transcription (issue #3) ### Testing - 15 unit tests covering all functionality - Swift 6 strict concurrency verified - Performance benchmarks: 5.6x realtime transcription - Test execution: 1.47s total ### Documentation - Complete API reference (400+ lines) - Migration guide from Rust FFI - 3 working examples - Test results report - CI/CD setup guide ### CI/CD - GitHub Actions workflow with 6 parallel jobs - Validates tests, examples, docs, Swift 6 compliance - Specifically verifies issue #3 feature - ~5-10 minute feedback on PRs ## Performance | Metric | Value | |--------|-------| | Transcription speed | 5.6x realtime | | 1s audio processing | 0.18s | | Memory overhead vs Rust | -5-10% (no FFI) | | Lines of code | 338 (vs 1000+ Rust+FFI) | ## Files Added - Sources/FluidAudioAPI/ (7 files) - Tests/FluidAudioAPITests/ (1 file) - .github/workflows/fluidaudio-api-tests.yml - Documentation (4 files) ## Replaces - fluidaudio-rs Rust crate - C FFI bridge - Manual semaphore-based concurrency ## Issue References Fixes FluidInference/fluidaudio-rs#3 Implements real-time audio transcription via transcribeSamples() method. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

BrandonWeng added 2 commits June 28, 2025 00:31

Metal optimizations for pipeline operations

f3c3e90

Benchmark runs

58a33d3

BrandonWeng requested review from Alex-Wengg and Bharat0091 June 28, 2025 05:04

BrandonWeng added 2 commits June 28, 2025 13:35

Sample files and downloading benchmarking files

8562d93

Add annotation for benchmark tests

bef4371

BrandonWeng closed this Jun 28, 2025

BrandonWeng deleted the metal-optimizations branch August 1, 2025 20:14

rohithjnayak mentioned this pull request Oct 31, 2025

Add support for x86_64 architecture #173

Closed

claude bot mentioned this pull request Feb 15, 2026

feat: integrate Qwen3-ForcedAligner-0.6B for per-word timestamp alignment #315

Closed

8 tasks

Alex-Wengg mentioned this pull request Mar 24, 2026

Add FluidAudioAPI: Pure Swift 6 replacement for fluidaudio-rs #420

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal optimizations#3

Metal optimizations#3
BrandonWeng wants to merge 4 commits intomainfrom
metal-optimizations

BrandonWeng commented Jun 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BrandonWeng commented Jun 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant