docs: add architecture comments to PocketTTS pipeline by Alex-Wengg · Pull Request #380 · FluidInference/FluidAudio

Alex-Wengg · 2026-03-15T13:05:37Z

Summary

Adds clarifying comments across 6 PocketTTS pipeline files to document the architecture, data flow, and model I/O
Fixes stale comment referencing "200 positions" when the actual KV cache max is 512
No code changes, comments only

Files modified

File	Changes
`PocketTtsConstants.swift`	Explain each constant's role (80ms frames, 32-d latent, EOS threshold, etc.)
`PocketTtsSynthesizer+KVCache.swift`	Document cache shape `[2,1,512,16,64]` dimensions, prefill vs generate mode, voice-first ordering
`PocketTtsSynthesizer+Types.swift`	Group Mimi state tensors by function, note auto-generated CoreML key names
`PocketTtsSynthesizer+Flow.swift`	Explain flow matching concept, Euler integration, s/t parameters, sqrt(temperature)
`PocketTtsSynthesizer+Mimi.swift`	Clarify streaming state persistence across chunks (unlike KV cache)
`PocketTtsSynthesizer.swift`	Fix stale "200 positions" → 512, document BOS/NaN signaling, autoregressive feedback

Test plan

swift build passes
swift format lint clean
No behavioral changes — comments only

Add clarifying comments across 6 PocketTTS files to make the pipeline architecture, data flow, and model I/O easier to understand: - PocketTtsConstants: explain what each constant means in context - KVCache: document cache shape dimensions, prefill vs generate mode - Types: group Mimi state tensors by function, note auto-generated keys - Flow: explain flow matching, Euler integration, s/t parameters - Mimi: clarify streaming state persistence across chunks - Synthesizer: fix stale "200 positions" comment, document BOS/feedback

claude · 2026-03-15T13:05:59Z

Claude finished @Alex-Wengg's task —— View job

PR Review: PocketTTS Architecture Comments ✅

Reviewed this documentation PR that adds architecture comments to the PocketTTS pipeline.

Todo List

Create pending PR review
Get PR diff information
Read and analyze the affected files
Review comment accuracy and technical correctness
Add inline review comments (submitted as overall review)
Submit comprehensive review
Update final status

Review Summary

Recommendation: APPROVE ✅

This is an excellent documentation PR that significantly improves understanding of the PocketTTS architecture. All technical details are accurate, including:

✅ KV cache tensor shapes and mechanics correctly explained
✅ Flow matching and Euler integration process accurately described
✅ Mimi streaming state vs KV cache behavior properly contrasted
✅ Fixed stale "200 positions" → "512" reference
✅ Mathematical relationships (80ms timing, sqrt(temperature) scaling) verified

The comments provide valuable context on design decisions like voice-first ordering and cache reset behavior. This documentation will be invaluable for future maintainers and debugging.

No blocking issues found - this is purely additive documentation with no code changes.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

claude

Overall Assessment

This is an excellent documentation PR that significantly improves the understanding of the PocketTTS architecture. The added comments are technically accurate, well-structured, and provide valuable context for future maintainers.

Strengths

Accurate technical details: All mathematical relationships and tensor shapes are correct
Clear explanations: Complex concepts like flow matching, KV cache mechanics, and streaming state are well-explained
Architectural context: Comments explain the "why" behind design decisions (e.g., voice-before-text ordering, cache reset per chunk)
Consistency: Commentary style is consistent across all files
Important fix: Corrected the stale "200 positions" → "512" reference

Technical Accuracy Verified

✅ KV cache shape [2,1,512,16,64] dimensions correctly explained
✅ Flow matching process and Euler integration steps accurately described
✅ Mimi streaming state persistence vs KV cache reset properly contrasted
✅ Temperature scaling explanation (sqrt for variance) is mathematically correct
✅ Audio timing calculations (80ms frames, 1920 samples @ 24kHz) verified
✅ Autoregressive feedback mechanism clearly explained

Minor Observations

The detailed breakdown of Mimi's 26 state tensors provides excellent debugging context
CoreML key naming explanation helps explain the auto-generated, non-intuitive names
Voice prompt ordering rationale prevents potential future bugs from reordering

This documentation will be invaluable for anyone working on or debugging the PocketTTS pipeline. The comments strike the right balance between technical depth and clarity.

Recommendation: Approve - this is exactly the kind of architectural documentation that makes a codebase maintainable.

github-actions · 2026-03-15T13:15:33Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	759.7x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	779.5x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-03-15T13:24:21Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	NaN%	<30%	⚠️	Diarization Error Rate (lower is better)
JER	NaN%	<25%	⚠️	Jaccard Error Rate
RTFx	NaNx	>1.0x	⚠️	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	NaN	NaN	Fetching diarization models
Model Compile	NaN	NaN	CoreML compilation
Audio Load	NaN	NaN	Loading audio file
Segmentation	NaN	NaN	Detecting speech regions
Embedding	NaN	NaN	Extracting speaker voices
Clustering	NaN	NaN	Grouping same speakers
Total	NaN	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	NaN%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs diarization time • Test runtime: N/A • 03/15/2026, 09:24 AM EST}

github-actions · 2026-03-15T13:24:48Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	NaN%	<20%	⚠️	Diarization Error Rate (lower is better)
RTFx	NaNx	>1.0x	⚠️	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	NaN	NaN	Fetching diarization models
Model Compile	NaN	NaN	CoreML compilation
Audio Load	NaN	NaN	Loading audio file
Segmentation	NaN	NaN	VAD + speech detection
Embedding	NaN	NaN	Speaker embedding extraction
Clustering (VBx)	NaN	NaN	Hungarian algorithm + VBx clustering
Total	NaN	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	NaN%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: 4m 59s • 03/15/2026, 09:24 AM EST}

github-actions · 2026-03-15T13:26:31Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	0.0%	<35%	✅
Miss Rate	0.0%	-	-
False Alarm	0.0%	-	-
Speaker Error	0.0%	-	-
RTFx	0.0x	>1.0x	⚠️
Speakers	0/0	-	-

_{Sortformer High-Latency • ES2004a • Runtime: N/A • 2026-03-15T13:26:29.578Z}

Alex-Wengg merged commit 691a3f5 into main Mar 15, 2026
5 of 14 checks passed

Alex-Wengg deleted the docs/pocket-tts-architecture-comments branch March 15, 2026 13:06

devin-ai-integration bot reviewed Mar 15, 2026

View reviewed changes

claude bot reviewed Mar 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add architecture comments to PocketTTS pipeline#380

docs: add architecture comments to PocketTTS pipeline#380
Alex-Wengg merged 1 commit intomainfrom
docs/pocket-tts-architecture-comments

Alex-Wengg commented Mar 15, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

claude bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alex-Wengg commented Mar 15, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files modified

Test plan

Uh oh!

claude bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: PocketTTS Architecture Comments ✅

Todo List

Review Summary

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overall Assessment

Strengths

Technical Accuracy Verified

Minor Observations

Uh oh!

github-actions bot commented Mar 15, 2026

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Mar 15, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Mar 15, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Mar 15, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Mar 15, 2026 •

edited by devin-ai-integration bot

Loading

claude bot commented Mar 15, 2026 •

edited

Loading