feat(tts/styletts2): CoreML conversion (4-stage, int8 TP, 4.32× RTFx) by Alex-Wengg · Pull Request #46 · FluidInference/mobius

Alex-Wengg · 2026-04-28T17:40:55Z

Summary

Port yl4579/StyleTTS2 (LibriTTS multi-speaker checkpoint) to CoreML for on-device inference on Apple Silicon. Four-stage pipeline with mixed precision and per-stage compute-unit placement.

Headline numbers

RTFx: 4.32× warm (M-series Mac, 5-step ADPM2 sampler)
On-disk size: ~1.4 GB (decoder is fp32; fp16 produces robotic audio — see coreml/PHASE6_FP16_DECODER.md)
Log-mel cosine vs PyTorch fp32: 0.9687
Voice-clone fidelity: at the model's architectural ceiling — see TRIALS.md Phase 5

Stage layout

Stage	Buckets	Precision	Compute unit
text_predictor	5 token (32, 64, 128, 256, 512)	selective int8	ANE
diffusion_step	1 (B=512)	fp16	CPU+GPU
f0n_energy	dynamic	fp16	ANE
decoder	5 mel (256, 512, 1024, 2048, 4096)	fp32	CPU+GPU

The ADPM2 sampler loop (5 steps) and the hard-alignment matrix live in Swift; only per-step inference is in CoreML.

Optimizations applied

Selective int8 PTQ on text_predictor. linear_quantize_weights(weight_threshold=200_000) — same recipe family as pocket_tts/coreml/PRECISION.md. 89 MB saved across 5 buckets, log-mel cosine 0.9998 vs fp32.
Diffusion-bucket prune. Empirically every observed bert_dur fits in B=512; the smaller buckets were dead weight (192 MB). Cost ladder is non-linear (B=32 66 ms/step → B=512 152 ms/step) so the worst case adds ~430 ms per utterance to gain 192 MB.
Per-stage compute_units sweep. ANE for text_predictor + f0n_energy; CPU+GPU for diffusion + decoder (ANE either rejects subgraphs or runs slower for these). Result: RTFx 1.61× → 3.80× → 4.32× warm (final number includes explicit warmup of every package).

Decoder precision (fp32, not fp16)

Initial export was fp16 (coremltools mlprogram default) and produced audibly robotic synthesis. Root cause: SineGen's harmonic source accumulates phase via cumsum × 2π × hop=300, reaching magnitudes ~4000 mid-frame. fp16 precision at that magnitude (~4) is much larger than the per-sample increment (~0.05 rad), scrambling the sin output.

Two viable fp16 stabilizations were investigated and documented (mixed-precision via op_selector; v3 phase-mod-2π wrapping — numerically validated, rms 0.0547 matches fp32 0.0551), but both add complexity. We ship fp32 decoders. See coreml/PHASE6_FP16_DECODER.md.

SineGen op-translation fix

Three coremltools translation bugs in _f02sine (aten::remainder, two F.interpolate paths) make a vanilla trace produce silent or NaN waveforms. The v2 patch in _styletts2_lib.install_sinegen_v2_constfold_fix(t_mel) constant-folds the fracs index and inlines a manual linear lerp; required before tracing each bucket. See 04_export_decoder.py and coreml/PHASE6_FP16_DECODER.md.

Voice-clone forensics (Phase 5)

User flagged "robotic" audio. Investigation (separate from the SineGen issue):

GE2E speaker similarity (resemblyzer) noise-floors at ~0.88 on synthetic TTS — useless signal.
ECAPA-TDNN (speechbrain/spkrec-ecapa-voxceleb): cos(OPT, INT8) = 0.9987 → quantization is innocent. cos(INT8, INT8D512) = 0.7881 → bucket prune costs cosine.
PT vs CoreML head-to-head with real reference: cos(REF, PT) = 0.2933 (PyTorch fp32 itself). cos(REF, CM) = 0.1795. The ECAPA same-speaker threshold is ~0.3.

Conclusion: StyleTTS2's voice-cloning fidelity is bounded by the model architecture, not by CoreML conversion. PyTorch fp32 is at the ceiling.

Files

coreml/PRECISION.md — mixed-precision recipe + per-stage rationale (decoder fp32 documented)
coreml/PHASE6_FP16_DECODER.md — SineGen op-translation fix + fp16 audio-regression diagnosis
coreml/TRIALS.md — chronological log
scripts/01-04_*.py — per-stage exporters (each with --trace-only)
scripts/99{,b,c}_*.py — parity check, baseline e2e, optimized e2e (99c now takes --reference-wav)
scripts/optimize/{quantize_text_predictor_int8,measure_diffusion_buckets}.py

Test plan

Per-stage parity (cosine 0.9999+) via 99_parity_check.py (raw s_pred comparison fixed per Devin)
E2E log-mel parity vs PyTorch fp32: 0.9687
ASR round-trip: whisper-small transcribes all 4 variants correctly
Speaker-ID forensics (ECAPA-TDNN) confirms quantization preserves voice
Warm RTFx 4.32× measured in 99c_e2e_optimized.py
Swift-side ADPM2 sampler integration (follow-up PR in FluidAudio)

Devin review (PR #46)

Addressed in 873fbe1:

🔴 v1 SineGen shim using broken ops (already removed in b058784 by promoting v2 patch)
🔴 Hardcoded user paths in 99c_e2e_optimized.py, quantize_text_predictor_int8.py, measure_diffusion_buckets.py — replaced with Path(__file__).resolve().parent[s][N]
🟡 99c_e2e_optimized.py CPU_ONLY → CPU_AND_GPU for diffusion + decoder (matches RTFx claim)
🟡 99c_e2e_optimized.py argparse with --reference-wav (required) and optional --baseline-wav (no more crashes on missing /tmp file)
🟡 99_parity_check.py Stage B compares raw s_pred from PT against raw s_pred from CoreML (was inadvertently comparing post-blend PT against raw CM)

Port yl4579/StyleTTS2 LibriTTS multi-speaker checkpoint to CoreML. Stages and placement: - text_predictor (5 token buckets): selective int8, ANE - diffusion_step (B=512): fp16, CPU+GPU - f0n_energy: fp16, ANE - decoder (5 mel buckets): fp16, CPU+GPU Optimizations: - Selective int8 PTQ on text_predictor via linear_quantize_weights(weight_threshold=200_000) — 89 MB saved, log-mel cosine 0.9998 vs fp32. - Pruned 4 unused diffusion buckets (kept B=512 only) — 192 MB saved. - Total: 1062 MB → 871 MB (-281 MB / -26.5%). - Per-stage compute_units sweep: RTFx 1.61× → 3.80× → 4.32× warm. Validation: - Per-stage parity: cosine 0.9999+ - E2E log-mel parity vs PyTorch fp32: 0.9687 - Voice-clone fidelity (ECAPA-TDNN cos to ref): at architectural ceiling (PyTorch fp32 itself only achieves 0.29). Quantization is innocent (cos OPT vs INT8 = 0.9987); the "robotic" complaint is the model. Documentation: - coreml/PRECISION.md — mixed-precision recipe + per-stage rationale - coreml/TRIALS.md — chronological log of 25 trials across 5 phases - README.md updated with final numbers and run commands

Replaces the prior "stochastic SineGen baked at trace time" hypothesis (disproved by the deterministic-shim attempt) with the actual root cause: three coremltools translation bugs in SineGen._f02sine — 1. (x % 1) lowers to all-zeros via aten::remainder 2. F.interpolate(scale_factor=1/300, mode=linear) downsample → NaN 3. F.interpolate(scale_factor=300, mode=linear) upsample → NaN Three-part fix verified end-to-end on buckets 256 and 1024 (clean audio in real pipeline) with rms parity 0.998–1.000 across all five buckets: 1. (x % 1) → x - torch.floor(x) 2. downsample → stride slice [..., ::300] 3. upsample → manual linear lerp from CoreML primitives, with fracs index constant-folded at trace time (Python-int closure, not SymInt-driven arange) The constant-fold step is critical — v1 of this fix used arange against a SymInt and silently re-introduced the same broken aten::remainder op in the lerp index path, producing identical robotic output despite the modulo on f0/sr being correct. PHASE6 doc now reflects the verified fix, all five exported mlpackages, and the remaining work (ANE-eligible re-export, promote into canonical 04_export_decoder.py).

`SineGen._f02sine` triggers three coremltools op-translation bugs that turn the harmonic source into garbage and produce robotic CoreML audio. Replace the v1 align_corners shim in `_build_modules` (which only addressed the upsample translator and still produced robotic audio in the full pipeline) with the constant-folded v2 fix applied per-bucket. - Add top-level `install_sinegen_v2_constfold_fix(t_mel)` in `_styletts2_lib.py`. Rewrites `_f02sine` so the modulo becomes `x - floor(x)`, the downsample becomes a stride slice, and the upsample becomes a manual linear lerp built from `repeat_interleave` and a constant-folded `fracs` index baked at trace time via a Python- int closure (T_audio = T_mel * 2 * 300). Also installs a deterministic `forward` that drops the trace-baked random noise. - `04_export_decoder.py` calls the helper before each bucket's trace, so all five mlpackages {256, 512, 1024, 2048, 4096} go through the same canonical path. - Verified clean fp32 audio end-to-end (full pipeline, all five buckets). PT vs CoreML rms parity 0.998-1.000. fp16 weight precision investigated and **not** shipped: produces robotic audio because `phase_scaled = cumsum × 2π × 300` reaches ~4000 mid-frame, where fp16 precision (~4) is much larger than the per-sample phase increment (~0.05 rad). Two viable fixes sketched in PHASE6_FP16_DECODER.md (mixed precision via `op_selector`, or v3 phase-mod-2π wrapping in SineGen) for if/when size becomes a constraint. fp32 ships clean at the current size budget. Also documents the ANE-compile hang at convert time when `compute_units=.ALL` is used (synchronous XPC to anecompilerservice inside `MLModel.__init__`); workaround is `compute_units=CPU_AND_GPU` + `skip_model_load=True`, which leaves the saved mlpackage runtime- selectable and bypasses the daemon at export time.

- `_styletts2_lib.py` SineGen v1 shim still using broken ops: superseded by the prior commit (v2 constfold patch promoted into the canonical per-bucket path). No further action needed. - `99c_e2e_optimized.py`: replace hardcoded `/Users/kikow/...` path with `Path(__file__).resolve().parent`. Replace hardcoded `REF_WAV = /tmp/.../coreml.wav` (output of `99b_e2e_coreml.py`, which isn't in the README workflow) with a required `--reference-wav` CLI arg, and make the spectral-comparison vs `coreml_optimal.wav` an opt-in `--baseline-wav` arg that no-ops when the file is missing instead of crashing after the pipeline has already finished. Switch `diffusion_step` and `decoder` MLModel loads from `CPU_ONLY` to `CPU_AND_GPU` to match the documented per-stage placement (TRIALS.md Trial 13, README precision table). - `optimize/quantize_text_predictor_int8.py`, `optimize/measure_diffusion_buckets.py`: replace hardcoded `PKG` absolute path with `Path(__file__).resolve().parents[2] / "coreml"`. - `99_parity_check.py` Stage B: stash the raw ADPM2 sampler output (`s_pred_raw`) in the PyTorch reference dict before alpha/beta blending, and compare against that in Stage B (instead of the post-blend `s` / `ref` concat which conflated sampler parity with the blending arithmetic). Also runs `report("s_pred", ...)` so the cosine / abs-diff stats land in the same format as the other stages. - README: 99c invocation now needs `--reference-wav <path/to/ref.wav>`.

coremltools mlprogram defaults to fp16; without an explicit compute_precision=FLOAT32 the canonical 04_export_decoder.py produced fp16 decoders whose SineGen harmonic source saturates phase precision mid-frame (cumsum × 2π × 300 reaches ~4000; fp16 precision at that magnitude is ~4 vs per-sample increment ~0.05 rad). Result: scrambled sine output, audibly robotic synthesis. Pin compute_precision=ct.precision.FLOAT32 in the convert call and propagate the precision/size change through README.md and PRECISION.md (decoder row fp16 → fp32; total on-disk 871 MB → ~1.4 GB; bucket strategy and build-and-ship summary updated). Cross-references PHASE6_FP16_DECODER.md for the diagnosis and the two viable fp16-stabilization sketches kept as future work.

The int8 PTQ on text_predictor was tried and dropped — Apple Silicon ANE has no exposed int8 GEMM, so the only payoff was ~3 MB of weight bandwidth per bucket (~15 MB total). Per-channel scales were also fragile across the 5 buckets, requiring per-bucket weight_threshold tuning that did not survive the validation matrix. What ships now: fp16 text_predictor (5 buckets), fp16 diffusion_step (B=512), fp16 f0n_energy, fp32 decoder (5 buckets). On-disk total ~1.3 GB. Warm RTFx and log-mel cosine numbers unchanged. - coreml/PRECISION.md: rewritten around the fp16/fp32 split; int8 recipe demoted to "tried and dropped" reference. - README.md: ship table + script tree updated; quantize step removed from build-and-ship invocation. - .gitignore: hf-upload/ staging dir excluded.

PRECISION.md documents int8 PTQ on text_predictor was dropped before ship and isn't part of the build pipeline, but `99c_e2e_optimized.py` still referenced `_int8.mlpackage`. Following the README workflow (00–04 export scripts then `99c`) crashed with FileNotFoundError. Load the fp16 .mlpackage that the export pipeline actually produces. Also retitle the docstring + default output filename + log-mel diagnostic to drop the stale "int8" labels.

This comment was marked as resolved.

Sign in to view

Alex-Wengg added 4 commits April 28, 2026 21:08

This comment was marked as resolved.

Sign in to view

Alex-Wengg merged commit 2fdee6f into main Apr 29, 2026

Alex-Wengg deleted the tts/styletts2-coreml branch April 29, 2026 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tts/styletts2): CoreML conversion (4-stage, int8 TP, 4.32× RTFx)#46

feat(tts/styletts2): CoreML conversion (4-stage, int8 TP, 4.32× RTFx)#46
Alex-Wengg merged 7 commits intomainfrom
tts/styletts2-coreml

Alex-Wengg commented Apr 28, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alex-Wengg commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline numbers

Stage layout

Optimizations applied

Decoder precision (fp32, not fp16)

SineGen op-translation fix

Voice-clone forensics (Phase 5)

Files

Test plan

Devin review (PR #46)

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Apr 28, 2026 •

edited

Loading