Skip to content

docs: add architecture comments to PocketTTS pipeline#380

Merged
Alex-Wengg merged 1 commit intomainfrom
docs/pocket-tts-architecture-comments
Mar 15, 2026
Merged

docs: add architecture comments to PocketTTS pipeline#380
Alex-Wengg merged 1 commit intomainfrom
docs/pocket-tts-architecture-comments

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Mar 15, 2026

Summary

  • Adds clarifying comments across 6 PocketTTS pipeline files to document the architecture, data flow, and model I/O
  • Fixes stale comment referencing "200 positions" when the actual KV cache max is 512
  • No code changes, comments only

Files modified

File Changes
PocketTtsConstants.swift Explain each constant's role (80ms frames, 32-d latent, EOS threshold, etc.)
PocketTtsSynthesizer+KVCache.swift Document cache shape [2,1,512,16,64] dimensions, prefill vs generate mode, voice-first ordering
PocketTtsSynthesizer+Types.swift Group Mimi state tensors by function, note auto-generated CoreML key names
PocketTtsSynthesizer+Flow.swift Explain flow matching concept, Euler integration, s/t parameters, sqrt(temperature)
PocketTtsSynthesizer+Mimi.swift Clarify streaming state persistence across chunks (unlike KV cache)
PocketTtsSynthesizer.swift Fix stale "200 positions" → 512, document BOS/NaN signaling, autoregressive feedback

Test plan

  • swift build passes
  • swift format lint clean
  • No behavioral changes — comments only

Open with Devin

Add clarifying comments across 6 PocketTTS files to make the pipeline
architecture, data flow, and model I/O easier to understand:

- PocketTtsConstants: explain what each constant means in context
- KVCache: document cache shape dimensions, prefill vs generate mode
- Types: group Mimi state tensors by function, note auto-generated keys
- Flow: explain flow matching, Euler integration, s/t parameters
- Mimi: clarify streaming state persistence across chunks
- Synthesizer: fix stale "200 positions" comment, document BOS/feedback
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 15, 2026

Claude finished @Alex-Wengg's task —— View job


PR Review: PocketTTS Architecture Comments ✅

Reviewed this documentation PR that adds architecture comments to the PocketTTS pipeline.

Todo List

  • Create pending PR review
  • Get PR diff information
  • Read and analyze the affected files
  • Review comment accuracy and technical correctness
  • Add inline review comments (submitted as overall review)
  • Submit comprehensive review
  • Update final status

Review Summary

Recommendation: APPROVE

This is an excellent documentation PR that significantly improves understanding of the PocketTTS architecture. All technical details are accurate, including:

  • ✅ KV cache tensor shapes and mechanics correctly explained
  • ✅ Flow matching and Euler integration process accurately described
  • ✅ Mimi streaming state vs KV cache behavior properly contrasted
  • ✅ Fixed stale "200 positions" → "512" reference
  • ✅ Mathematical relationships (80ms timing, sqrt(temperature) scaling) verified

The comments provide valuable context on design decisions like voice-first ordering and cache reset behavior. This documentation will be invaluable for future maintainers and debugging.

No blocking issues found - this is purely additive documentation with no code changes.


@Alex-Wengg Alex-Wengg merged commit 691a3f5 into main Mar 15, 2026
5 of 14 checks passed
@Alex-Wengg Alex-Wengg deleted the docs/pocket-tts-architecture-comments branch March 15, 2026 13:06
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Assessment

This is an excellent documentation PR that significantly improves the understanding of the PocketTTS architecture. The added comments are technically accurate, well-structured, and provide valuable context for future maintainers.

Strengths

  • Accurate technical details: All mathematical relationships and tensor shapes are correct
  • Clear explanations: Complex concepts like flow matching, KV cache mechanics, and streaming state are well-explained
  • Architectural context: Comments explain the "why" behind design decisions (e.g., voice-before-text ordering, cache reset per chunk)
  • Consistency: Commentary style is consistent across all files
  • Important fix: Corrected the stale "200 positions" → "512" reference

Technical Accuracy Verified

  • ✅ KV cache shape [2,1,512,16,64] dimensions correctly explained
  • ✅ Flow matching process and Euler integration steps accurately described
  • ✅ Mimi streaming state persistence vs KV cache reset properly contrasted
  • ✅ Temperature scaling explanation (sqrt for variance) is mathematically correct
  • ✅ Audio timing calculations (80ms frames, 1920 samples @ 24kHz) verified
  • ✅ Autoregressive feedback mechanism clearly explained

Minor Observations

  • The detailed breakdown of Mimi's 26 state tensors provides excellent debugging context
  • CoreML key naming explanation helps explain the auto-generated, non-intuitive names
  • Voice prompt ordering rationale prevents potential future bugs from reordering

This documentation will be invaluable for anyone working on or debugging the PocketTTS pipeline. The comments strike the right balance between technical depth and clarity.

Recommendation: Approve - this is exactly the kind of architectural documentation that makes a codebase maintainable.

@github-actions
Copy link
Copy Markdown

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 759.7x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 779.5x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER NaN% <30% ⚠️ Diarization Error Rate (lower is better)
JER NaN% <25% ⚠️ Jaccard Error Rate
RTFx NaNx >1.0x ⚠️ Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download NaN NaN Fetching diarization models
Model Compile NaN NaN CoreML compilation
Audio Load NaN NaN Loading audio file
Segmentation NaN NaN Detecting speech regions
Embedding NaN NaN Extracting speaker voices
Clustering NaN NaN Grouping same speakers
Total NaN 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio NaN% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs diarization time • Test runtime: N/A • 03/15/2026, 09:24 AM EST

@github-actions
Copy link
Copy Markdown

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER NaN% <20% ⚠️ Diarization Error Rate (lower is better)
RTFx NaNx >1.0x ⚠️ Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download NaN NaN Fetching diarization models
Model Compile NaN NaN CoreML compilation
Audio Load NaN NaN Loading audio file
Segmentation NaN NaN VAD + speech detection
Embedding NaN NaN Speaker embedding extraction
Clustering (VBx) NaN NaN Hungarian algorithm + VBx clustering
Total NaN 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) NaN% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: 4m 59s • 03/15/2026, 09:24 AM EST

@github-actions
Copy link
Copy Markdown

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 0.0% <35%
Miss Rate 0.0% - -
False Alarm 0.0% - -
Speaker Error 0.0% - -
RTFx 0.0x >1.0x ⚠️
Speakers 0/0 - -

Sortformer High-Latency • ES2004a • Runtime: N/A • 2026-03-15T13:26:29.578Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant