feat(tts/styletts2): CoreML conversion (4-stage, int8 TP, 4.32× RTFx)#46
Merged
Alex-Wengg merged 7 commits intomainfrom Apr 29, 2026
Merged
feat(tts/styletts2): CoreML conversion (4-stage, int8 TP, 4.32× RTFx)#46Alex-Wengg merged 7 commits intomainfrom
Alex-Wengg merged 7 commits intomainfrom
Conversation
Port yl4579/StyleTTS2 LibriTTS multi-speaker checkpoint to CoreML. Stages and placement: - text_predictor (5 token buckets): selective int8, ANE - diffusion_step (B=512): fp16, CPU+GPU - f0n_energy: fp16, ANE - decoder (5 mel buckets): fp16, CPU+GPU Optimizations: - Selective int8 PTQ on text_predictor via linear_quantize_weights(weight_threshold=200_000) — 89 MB saved, log-mel cosine 0.9998 vs fp32. - Pruned 4 unused diffusion buckets (kept B=512 only) — 192 MB saved. - Total: 1062 MB → 871 MB (-281 MB / -26.5%). - Per-stage compute_units sweep: RTFx 1.61× → 3.80× → 4.32× warm. Validation: - Per-stage parity: cosine 0.9999+ - E2E log-mel parity vs PyTorch fp32: 0.9687 - Voice-clone fidelity (ECAPA-TDNN cos to ref): at architectural ceiling (PyTorch fp32 itself only achieves 0.29). Quantization is innocent (cos OPT vs INT8 = 0.9987); the "robotic" complaint is the model. Documentation: - coreml/PRECISION.md — mixed-precision recipe + per-stage rationale - coreml/TRIALS.md — chronological log of 25 trials across 5 phases - README.md updated with final numbers and run commands
Replaces the prior "stochastic SineGen baked at trace time" hypothesis
(disproved by the deterministic-shim attempt) with the actual root cause:
three coremltools translation bugs in SineGen._f02sine —
1. (x % 1) lowers to all-zeros via aten::remainder
2. F.interpolate(scale_factor=1/300, mode=linear) downsample → NaN
3. F.interpolate(scale_factor=300, mode=linear) upsample → NaN
Three-part fix verified end-to-end on buckets 256 and 1024 (clean audio
in real pipeline) with rms parity 0.998–1.000 across all five buckets:
1. (x % 1) → x - torch.floor(x)
2. downsample → stride slice [..., ::300]
3. upsample → manual linear lerp from CoreML primitives,
with fracs index constant-folded at trace time
(Python-int closure, not SymInt-driven arange)
The constant-fold step is critical — v1 of this fix used arange against
a SymInt and silently re-introduced the same broken aten::remainder op
in the lerp index path, producing identical robotic output despite the
modulo on f0/sr being correct.
PHASE6 doc now reflects the verified fix, all five exported mlpackages,
and the remaining work (ANE-eligible re-export, promote into canonical
04_export_decoder.py).
`SineGen._f02sine` triggers three coremltools op-translation bugs that
turn the harmonic source into garbage and produce robotic CoreML audio.
Replace the v1 align_corners shim in `_build_modules` (which only
addressed the upsample translator and still produced robotic audio in
the full pipeline) with the constant-folded v2 fix applied per-bucket.
- Add top-level `install_sinegen_v2_constfold_fix(t_mel)` in
`_styletts2_lib.py`. Rewrites `_f02sine` so the modulo becomes
`x - floor(x)`, the downsample becomes a stride slice, and the
upsample becomes a manual linear lerp built from `repeat_interleave`
and a constant-folded `fracs` index baked at trace time via a Python-
int closure (T_audio = T_mel * 2 * 300). Also installs a
deterministic `forward` that drops the trace-baked random noise.
- `04_export_decoder.py` calls the helper before each bucket's trace,
so all five mlpackages {256, 512, 1024, 2048, 4096} go through the
same canonical path.
- Verified clean fp32 audio end-to-end (full pipeline, all five
buckets). PT vs CoreML rms parity 0.998-1.000.
fp16 weight precision investigated and **not** shipped: produces
robotic audio because `phase_scaled = cumsum × 2π × 300` reaches ~4000
mid-frame, where fp16 precision (~4) is much larger than the per-sample
phase increment (~0.05 rad). Two viable fixes sketched in
PHASE6_FP16_DECODER.md (mixed precision via `op_selector`, or v3
phase-mod-2π wrapping in SineGen) for if/when size becomes a constraint.
fp32 ships clean at the current size budget.
Also documents the ANE-compile hang at convert time when
`compute_units=.ALL` is used (synchronous XPC to anecompilerservice
inside `MLModel.__init__`); workaround is `compute_units=CPU_AND_GPU`
+ `skip_model_load=True`, which leaves the saved mlpackage runtime-
selectable and bypasses the daemon at export time.
- `_styletts2_lib.py` SineGen v1 shim still using broken ops: superseded
by the prior commit (v2 constfold patch promoted into the canonical
per-bucket path). No further action needed.
- `99c_e2e_optimized.py`: replace hardcoded `/Users/kikow/...` path with
`Path(__file__).resolve().parent`. Replace hardcoded `REF_WAV =
/tmp/.../coreml.wav` (output of `99b_e2e_coreml.py`, which isn't in
the README workflow) with a required `--reference-wav` CLI arg, and
make the spectral-comparison vs `coreml_optimal.wav` an opt-in
`--baseline-wav` arg that no-ops when the file is missing instead of
crashing after the pipeline has already finished. Switch
`diffusion_step` and `decoder` MLModel loads from `CPU_ONLY` to
`CPU_AND_GPU` to match the documented per-stage placement (TRIALS.md
Trial 13, README precision table).
- `optimize/quantize_text_predictor_int8.py`,
`optimize/measure_diffusion_buckets.py`: replace hardcoded `PKG`
absolute path with `Path(__file__).resolve().parents[2] / "coreml"`.
- `99_parity_check.py` Stage B: stash the raw ADPM2 sampler output
(`s_pred_raw`) in the PyTorch reference dict before alpha/beta
blending, and compare against that in Stage B (instead of the
post-blend `s` / `ref` concat which conflated sampler parity with
the blending arithmetic). Also runs `report("s_pred", ...)` so the
cosine / abs-diff stats land in the same format as the other stages.
- README: 99c invocation now needs `--reference-wav <path/to/ref.wav>`.
coremltools mlprogram defaults to fp16; without an explicit compute_precision=FLOAT32 the canonical 04_export_decoder.py produced fp16 decoders whose SineGen harmonic source saturates phase precision mid-frame (cumsum × 2π × 300 reaches ~4000; fp16 precision at that magnitude is ~4 vs per-sample increment ~0.05 rad). Result: scrambled sine output, audibly robotic synthesis. Pin compute_precision=ct.precision.FLOAT32 in the convert call and propagate the precision/size change through README.md and PRECISION.md (decoder row fp16 → fp32; total on-disk 871 MB → ~1.4 GB; bucket strategy and build-and-ship summary updated). Cross-references PHASE6_FP16_DECODER.md for the diagnosis and the two viable fp16-stabilization sketches kept as future work.
The int8 PTQ on text_predictor was tried and dropped — Apple Silicon ANE has no exposed int8 GEMM, so the only payoff was ~3 MB of weight bandwidth per bucket (~15 MB total). Per-channel scales were also fragile across the 5 buckets, requiring per-bucket weight_threshold tuning that did not survive the validation matrix. What ships now: fp16 text_predictor (5 buckets), fp16 diffusion_step (B=512), fp16 f0n_energy, fp32 decoder (5 buckets). On-disk total ~1.3 GB. Warm RTFx and log-mel cosine numbers unchanged. - coreml/PRECISION.md: rewritten around the fp16/fp32 split; int8 recipe demoted to "tried and dropped" reference. - README.md: ship table + script tree updated; quantize step removed from build-and-ship invocation. - .gitignore: hf-upload/ staging dir excluded.
PRECISION.md documents int8 PTQ on text_predictor was dropped before ship and isn't part of the build pipeline, but `99c_e2e_optimized.py` still referenced `_int8.mlpackage`. Following the README workflow (00–04 export scripts then `99c`) crashed with FileNotFoundError. Load the fp16 .mlpackage that the export pipeline actually produces. Also retitle the docstring + default output filename + log-mel diagnostic to drop the stale "int8" labels.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port yl4579/StyleTTS2 (LibriTTS multi-speaker checkpoint) to CoreML for on-device inference on Apple Silicon. Four-stage pipeline with mixed precision and per-stage compute-unit placement.
Headline numbers
coreml/PHASE6_FP16_DECODER.md)Stage layout
The ADPM2 sampler loop (5 steps) and the hard-alignment matrix live in Swift; only per-step inference is in CoreML.
Optimizations applied
linear_quantize_weights(weight_threshold=200_000)— same recipe family aspocket_tts/coreml/PRECISION.md. 89 MB saved across 5 buckets, log-mel cosine 0.9998 vs fp32.bert_durfits in B=512; the smaller buckets were dead weight (192 MB). Cost ladder is non-linear (B=32 66 ms/step → B=512 152 ms/step) so the worst case adds ~430 ms per utterance to gain 192 MB.Decoder precision (fp32, not fp16)
Initial export was fp16 (coremltools mlprogram default) and produced audibly robotic synthesis. Root cause: SineGen's harmonic source accumulates phase via
cumsum × 2π × hop=300, reaching magnitudes ~4000 mid-frame. fp16 precision at that magnitude (~4) is much larger than the per-sample increment (~0.05 rad), scrambling the sin output.Two viable fp16 stabilizations were investigated and documented (mixed-precision via
op_selector; v3 phase-mod-2π wrapping — numerically validated, rms 0.0547 matches fp32 0.0551), but both add complexity. We ship fp32 decoders. Seecoreml/PHASE6_FP16_DECODER.md.SineGen op-translation fix
Three coremltools translation bugs in
_f02sine(aten::remainder, two F.interpolate paths) make a vanilla trace produce silent or NaN waveforms. The v2 patch in_styletts2_lib.install_sinegen_v2_constfold_fix(t_mel)constant-folds the fracs index and inlines a manual linear lerp; required before tracing each bucket. See04_export_decoder.pyandcoreml/PHASE6_FP16_DECODER.md.Voice-clone forensics (Phase 5)
User flagged "robotic" audio. Investigation (separate from the SineGen issue):
speechbrain/spkrec-ecapa-voxceleb): cos(OPT, INT8) = 0.9987 → quantization is innocent. cos(INT8, INT8D512) = 0.7881 → bucket prune costs cosine.Conclusion: StyleTTS2's voice-cloning fidelity is bounded by the model architecture, not by CoreML conversion. PyTorch fp32 is at the ceiling.
Files
coreml/PRECISION.md— mixed-precision recipe + per-stage rationale (decoder fp32 documented)coreml/PHASE6_FP16_DECODER.md— SineGen op-translation fix + fp16 audio-regression diagnosiscoreml/TRIALS.md— chronological logscripts/01-04_*.py— per-stage exporters (each with--trace-only)scripts/99{,b,c}_*.py— parity check, baseline e2e, optimized e2e (99cnow takes--reference-wav)scripts/optimize/{quantize_text_predictor_int8,measure_diffusion_buckets}.pyTest plan
99_parity_check.py(raws_predcomparison fixed per Devin)99c_e2e_optimized.pyDevin review (PR #46)
Addressed in 873fbe1:
99c_e2e_optimized.py,quantize_text_predictor_int8.py,measure_diffusion_buckets.py— replaced withPath(__file__).resolve().parent[s][N]99c_e2e_optimized.pyCPU_ONLY → CPU_AND_GPU for diffusion + decoder (matches RTFx claim)99c_e2e_optimized.pyargparse with--reference-wav(required) and optional--baseline-wav(no more crashes on missing /tmp file)99_parity_check.pyStage B compares raws_predfrom PT against raws_predfrom CoreML (was inadvertently comparing post-blend PT against raw CM)