Skip to content

feat(tts/styletts2): CoreML conversion (4-stage, int8 TP, 4.32× RTFx)#46

Merged
Alex-Wengg merged 7 commits intomainfrom
tts/styletts2-coreml
Apr 29, 2026
Merged

feat(tts/styletts2): CoreML conversion (4-stage, int8 TP, 4.32× RTFx)#46
Alex-Wengg merged 7 commits intomainfrom
tts/styletts2-coreml

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 28, 2026

Summary

Port yl4579/StyleTTS2 (LibriTTS multi-speaker checkpoint) to CoreML for on-device inference on Apple Silicon. Four-stage pipeline with mixed precision and per-stage compute-unit placement.

Headline numbers

  • RTFx: 4.32× warm (M-series Mac, 5-step ADPM2 sampler)
  • On-disk size: ~1.4 GB (decoder is fp32; fp16 produces robotic audio — see coreml/PHASE6_FP16_DECODER.md)
  • Log-mel cosine vs PyTorch fp32: 0.9687
  • Voice-clone fidelity: at the model's architectural ceiling — see TRIALS.md Phase 5

Stage layout

Stage Buckets Precision Compute unit
text_predictor 5 token (32, 64, 128, 256, 512) selective int8 ANE
diffusion_step 1 (B=512) fp16 CPU+GPU
f0n_energy dynamic fp16 ANE
decoder 5 mel (256, 512, 1024, 2048, 4096) fp32 CPU+GPU

The ADPM2 sampler loop (5 steps) and the hard-alignment matrix live in Swift; only per-step inference is in CoreML.

Optimizations applied

  1. Selective int8 PTQ on text_predictor. linear_quantize_weights(weight_threshold=200_000) — same recipe family as pocket_tts/coreml/PRECISION.md. 89 MB saved across 5 buckets, log-mel cosine 0.9998 vs fp32.
  2. Diffusion-bucket prune. Empirically every observed bert_dur fits in B=512; the smaller buckets were dead weight (192 MB). Cost ladder is non-linear (B=32 66 ms/step → B=512 152 ms/step) so the worst case adds ~430 ms per utterance to gain 192 MB.
  3. Per-stage compute_units sweep. ANE for text_predictor + f0n_energy; CPU+GPU for diffusion + decoder (ANE either rejects subgraphs or runs slower for these). Result: RTFx 1.61× → 3.80× → 4.32× warm (final number includes explicit warmup of every package).

Decoder precision (fp32, not fp16)

Initial export was fp16 (coremltools mlprogram default) and produced audibly robotic synthesis. Root cause: SineGen's harmonic source accumulates phase via cumsum × 2π × hop=300, reaching magnitudes ~4000 mid-frame. fp16 precision at that magnitude (~4) is much larger than the per-sample increment (~0.05 rad), scrambling the sin output.

Two viable fp16 stabilizations were investigated and documented (mixed-precision via op_selector; v3 phase-mod-2π wrapping — numerically validated, rms 0.0547 matches fp32 0.0551), but both add complexity. We ship fp32 decoders. See coreml/PHASE6_FP16_DECODER.md.

SineGen op-translation fix

Three coremltools translation bugs in _f02sine (aten::remainder, two F.interpolate paths) make a vanilla trace produce silent or NaN waveforms. The v2 patch in _styletts2_lib.install_sinegen_v2_constfold_fix(t_mel) constant-folds the fracs index and inlines a manual linear lerp; required before tracing each bucket. See 04_export_decoder.py and coreml/PHASE6_FP16_DECODER.md.

Voice-clone forensics (Phase 5)

User flagged "robotic" audio. Investigation (separate from the SineGen issue):

  • GE2E speaker similarity (resemblyzer) noise-floors at ~0.88 on synthetic TTS — useless signal.
  • ECAPA-TDNN (speechbrain/spkrec-ecapa-voxceleb): cos(OPT, INT8) = 0.9987 → quantization is innocent. cos(INT8, INT8D512) = 0.7881 → bucket prune costs cosine.
  • PT vs CoreML head-to-head with real reference: cos(REF, PT) = 0.2933 (PyTorch fp32 itself). cos(REF, CM) = 0.1795. The ECAPA same-speaker threshold is ~0.3.

Conclusion: StyleTTS2's voice-cloning fidelity is bounded by the model architecture, not by CoreML conversion. PyTorch fp32 is at the ceiling.

Files

  • coreml/PRECISION.md — mixed-precision recipe + per-stage rationale (decoder fp32 documented)
  • coreml/PHASE6_FP16_DECODER.md — SineGen op-translation fix + fp16 audio-regression diagnosis
  • coreml/TRIALS.md — chronological log
  • scripts/01-04_*.py — per-stage exporters (each with --trace-only)
  • scripts/99{,b,c}_*.py — parity check, baseline e2e, optimized e2e (99c now takes --reference-wav)
  • scripts/optimize/{quantize_text_predictor_int8,measure_diffusion_buckets}.py

Test plan

  • Per-stage parity (cosine 0.9999+) via 99_parity_check.py (raw s_pred comparison fixed per Devin)
  • E2E log-mel parity vs PyTorch fp32: 0.9687
  • ASR round-trip: whisper-small transcribes all 4 variants correctly
  • Speaker-ID forensics (ECAPA-TDNN) confirms quantization preserves voice
  • Warm RTFx 4.32× measured in 99c_e2e_optimized.py
  • Swift-side ADPM2 sampler integration (follow-up PR in FluidAudio)

Devin review (PR #46)

Addressed in 873fbe1:

  • 🔴 v1 SineGen shim using broken ops (already removed in b058784 by promoting v2 patch)
  • 🔴 Hardcoded user paths in 99c_e2e_optimized.py, quantize_text_predictor_int8.py, measure_diffusion_buckets.py — replaced with Path(__file__).resolve().parent[s][N]
  • 🟡 99c_e2e_optimized.py CPU_ONLY → CPU_AND_GPU for diffusion + decoder (matches RTFx claim)
  • 🟡 99c_e2e_optimized.py argparse with --reference-wav (required) and optional --baseline-wav (no more crashes on missing /tmp file)
  • 🟡 99_parity_check.py Stage B compares raw s_pred from PT against raw s_pred from CoreML (was inadvertently comparing post-blend PT against raw CM)

Open in Devin Review

Port yl4579/StyleTTS2 LibriTTS multi-speaker checkpoint to CoreML.

Stages and placement:
- text_predictor (5 token buckets): selective int8, ANE
- diffusion_step (B=512): fp16, CPU+GPU
- f0n_energy: fp16, ANE
- decoder (5 mel buckets): fp16, CPU+GPU

Optimizations:
- Selective int8 PTQ on text_predictor via
  linear_quantize_weights(weight_threshold=200_000) — 89 MB saved,
  log-mel cosine 0.9998 vs fp32.
- Pruned 4 unused diffusion buckets (kept B=512 only) — 192 MB saved.
- Total: 1062 MB → 871 MB (-281 MB / -26.5%).
- Per-stage compute_units sweep: RTFx 1.61× → 3.80× → 4.32× warm.

Validation:
- Per-stage parity: cosine 0.9999+
- E2E log-mel parity vs PyTorch fp32: 0.9687
- Voice-clone fidelity (ECAPA-TDNN cos to ref): at architectural ceiling
  (PyTorch fp32 itself only achieves 0.29). Quantization is innocent
  (cos OPT vs INT8 = 0.9987); the "robotic" complaint is the model.

Documentation:
- coreml/PRECISION.md — mixed-precision recipe + per-stage rationale
- coreml/TRIALS.md — chronological log of 25 trials across 5 phases
- README.md updated with final numbers and run commands
devin-ai-integration[bot]

This comment was marked as resolved.

Replaces the prior "stochastic SineGen baked at trace time" hypothesis
(disproved by the deterministic-shim attempt) with the actual root cause:
three coremltools translation bugs in SineGen._f02sine —

  1. (x % 1) lowers to all-zeros via aten::remainder
  2. F.interpolate(scale_factor=1/300, mode=linear) downsample → NaN
  3. F.interpolate(scale_factor=300, mode=linear) upsample → NaN

Three-part fix verified end-to-end on buckets 256 and 1024 (clean audio
in real pipeline) with rms parity 0.998–1.000 across all five buckets:

  1. (x % 1)             → x - torch.floor(x)
  2. downsample          → stride slice [..., ::300]
  3. upsample            → manual linear lerp from CoreML primitives,
                            with fracs index constant-folded at trace time
                            (Python-int closure, not SymInt-driven arange)

The constant-fold step is critical — v1 of this fix used arange against
a SymInt and silently re-introduced the same broken aten::remainder op
in the lerp index path, producing identical robotic output despite the
modulo on f0/sr being correct.

PHASE6 doc now reflects the verified fix, all five exported mlpackages,
and the remaining work (ANE-eligible re-export, promote into canonical
04_export_decoder.py).
devin-ai-integration[bot]

This comment was marked as resolved.

`SineGen._f02sine` triggers three coremltools op-translation bugs that
turn the harmonic source into garbage and produce robotic CoreML audio.
Replace the v1 align_corners shim in `_build_modules` (which only
addressed the upsample translator and still produced robotic audio in
the full pipeline) with the constant-folded v2 fix applied per-bucket.

- Add top-level `install_sinegen_v2_constfold_fix(t_mel)` in
  `_styletts2_lib.py`. Rewrites `_f02sine` so the modulo becomes
  `x - floor(x)`, the downsample becomes a stride slice, and the
  upsample becomes a manual linear lerp built from `repeat_interleave`
  and a constant-folded `fracs` index baked at trace time via a Python-
  int closure (T_audio = T_mel * 2 * 300). Also installs a
  deterministic `forward` that drops the trace-baked random noise.
- `04_export_decoder.py` calls the helper before each bucket's trace,
  so all five mlpackages {256, 512, 1024, 2048, 4096} go through the
  same canonical path.
- Verified clean fp32 audio end-to-end (full pipeline, all five
  buckets). PT vs CoreML rms parity 0.998-1.000.

fp16 weight precision investigated and **not** shipped: produces
robotic audio because `phase_scaled = cumsum × 2π × 300` reaches ~4000
mid-frame, where fp16 precision (~4) is much larger than the per-sample
phase increment (~0.05 rad). Two viable fixes sketched in
PHASE6_FP16_DECODER.md (mixed precision via `op_selector`, or v3
phase-mod-2π wrapping in SineGen) for if/when size becomes a constraint.
fp32 ships clean at the current size budget.

Also documents the ANE-compile hang at convert time when
`compute_units=.ALL` is used (synchronous XPC to anecompilerservice
inside `MLModel.__init__`); workaround is `compute_units=CPU_AND_GPU`
+ `skip_model_load=True`, which leaves the saved mlpackage runtime-
selectable and bypasses the daemon at export time.
- `_styletts2_lib.py` SineGen v1 shim still using broken ops: superseded
  by the prior commit (v2 constfold patch promoted into the canonical
  per-bucket path). No further action needed.

- `99c_e2e_optimized.py`: replace hardcoded `/Users/kikow/...` path with
  `Path(__file__).resolve().parent`. Replace hardcoded `REF_WAV =
  /tmp/.../coreml.wav` (output of `99b_e2e_coreml.py`, which isn't in
  the README workflow) with a required `--reference-wav` CLI arg, and
  make the spectral-comparison vs `coreml_optimal.wav` an opt-in
  `--baseline-wav` arg that no-ops when the file is missing instead of
  crashing after the pipeline has already finished. Switch
  `diffusion_step` and `decoder` MLModel loads from `CPU_ONLY` to
  `CPU_AND_GPU` to match the documented per-stage placement (TRIALS.md
  Trial 13, README precision table).

- `optimize/quantize_text_predictor_int8.py`,
  `optimize/measure_diffusion_buckets.py`: replace hardcoded `PKG`
  absolute path with `Path(__file__).resolve().parents[2] / "coreml"`.

- `99_parity_check.py` Stage B: stash the raw ADPM2 sampler output
  (`s_pred_raw`) in the PyTorch reference dict before alpha/beta
  blending, and compare against that in Stage B (instead of the
  post-blend `s` / `ref` concat which conflated sampler parity with
  the blending arithmetic). Also runs `report("s_pred", ...)` so the
  cosine / abs-diff stats land in the same format as the other stages.

- README: 99c invocation now needs `--reference-wav <path/to/ref.wav>`.
coremltools mlprogram defaults to fp16; without an explicit
compute_precision=FLOAT32 the canonical 04_export_decoder.py produced
fp16 decoders whose SineGen harmonic source saturates phase precision
mid-frame (cumsum × 2π × 300 reaches ~4000; fp16 precision at that
magnitude is ~4 vs per-sample increment ~0.05 rad). Result: scrambled
sine output, audibly robotic synthesis.

Pin compute_precision=ct.precision.FLOAT32 in the convert call and
propagate the precision/size change through README.md and PRECISION.md
(decoder row fp16 → fp32; total on-disk 871 MB → ~1.4 GB; bucket
strategy and build-and-ship summary updated). Cross-references
PHASE6_FP16_DECODER.md for the diagnosis and the two viable
fp16-stabilization sketches kept as future work.
The int8 PTQ on text_predictor was tried and dropped — Apple Silicon
ANE has no exposed int8 GEMM, so the only payoff was ~3 MB of weight
bandwidth per bucket (~15 MB total). Per-channel scales were also
fragile across the 5 buckets, requiring per-bucket weight_threshold
tuning that did not survive the validation matrix.

What ships now: fp16 text_predictor (5 buckets), fp16 diffusion_step
(B=512), fp16 f0n_energy, fp32 decoder (5 buckets). On-disk total
~1.3 GB. Warm RTFx and log-mel cosine numbers unchanged.

- coreml/PRECISION.md: rewritten around the fp16/fp32 split; int8
  recipe demoted to "tried and dropped" reference.
- README.md: ship table + script tree updated; quantize step removed
  from build-and-ship invocation.
- .gitignore: hf-upload/ staging dir excluded.
devin-ai-integration[bot]

This comment was marked as resolved.

PRECISION.md documents int8 PTQ on text_predictor was dropped before ship
and isn't part of the build pipeline, but `99c_e2e_optimized.py` still
referenced `_int8.mlpackage`. Following the README workflow (00–04 export
scripts then `99c`) crashed with FileNotFoundError. Load the fp16
.mlpackage that the export pipeline actually produces. Also retitle the
docstring + default output filename + log-mel diagnostic to drop the
stale "int8" labels.
@Alex-Wengg Alex-Wengg merged commit 2fdee6f into main Apr 29, 2026
@Alex-Wengg Alex-Wengg deleted the tts/styletts2-coreml branch April 29, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant