Skip to content

feat(tts/magpie): parity fixture + HF upload staging + manifest#44

Open
Alex-Wengg wants to merge 5 commits intomainfrom
tts/magpie-parity-fixture
Open

feat(tts/magpie): parity fixture + HF upload staging + manifest#44
Alex-Wengg wants to merge 5 commits intomainfrom
tts/magpie-parity-fixture

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 25, 2026

Summary

Adds three Swift-port-supporting tools under models/tts/magpie/coreml/:

  1. emit_parity_fixture.py — runs the full Magpie CoreML pipeline for a fixed (text, speaker, language, seed) and dumps every intermediate tensor as a single .npz so the Swift port can replay each stage and diff against this Python ground truth.
  2. prepare_hf_upload.py — stages hf-upload/ from compiled/build/ + constants/ for upload to the HF repo. Splits constants into constants/ (model tensors) + tokenizer/ (per-language lookup files), generates README.md + .gitattributes + _prep_report.json.
  3. build_manifest.py — generates manifest.json, a machine-readable index with sha256, file sizes, npy shapes/dtypes, and per-model IO specs. The Swift port's MagpieResourceDownloader consumes it.

All three re-import helpers from generate_coreml.py — they never fork the reference pipeline.

HF upload (live)

Already uploaded to FluidInference/magpie-tts-multilingual-357m-coreml (1.4 GB) — both .mlmodelc (compiled, ready-to-run) and .mlpackage (portable) for all 4 models, plus constants/, tokenizer/, and manifest.json.

What emit_parity_fixture.py captures

--mode full.npz (+ reference .wav):

Stage Keys
Config text, speakerIndex, languageCode, seed, useCfg, cfgScale, temperature, topk, sampleRate, minFrames
Tokenizer textTokens, textTokensPadded, textMask
Text encoder encoderOutput
Post-prefill KV prefillCache{0..11}, prefillPosition{0..11}
AR loop perStepDecoderHidden (N, 768), perStepCodes (N, 8), predictedCodes (8, N)
Audio audioPcm, audioSamples, genTimeSeconds

--mode tokenizer → small .json with {text, speakerIndex, languageCode, expectedTokenIds} for cheap Swift-side tokenizer diffing.

What manifest.json looks like

{
  "schema_version": "1.0",
  "repo_id": "FluidInference/magpie-tts-multilingual-357m-coreml",
  "model": {
    "name": "Magpie TTS Multilingual",
    "params_million": 357,
    "sample_rate": 22050,
    "max_nanocodec_seconds": 11.89,
    "supported_languages": ["english", "spanish", "german", "hindi", "mandarin", "french", "italian", "vietnamese"],
    "audio_eos_id": 2017,
    "forbidden_token_ids": [2016, 2018, 2019, 2020, 2021, 2022, 2023],
    "speaker_names": ["John", "Sofia", "Aria", "Jason", "Leo"],
    "streaming_nanocodec": { "supported": false, "note": "..." }
  },
  "models": {
    "decoder_step": {
      "compiled": { "path": "decoder_step.mlmodelc", "bytes": ..., "files": ... },
      "package":  { "path": "decoder_step.mlpackage", "bytes": ..., "files": ... },
      "io": { "inputs": [...], "outputs": [{ "name": "var_2201", "shape": [1,1,16192], ... }, ...] }
    },
    ...
  },
  "constants": { "json": [...], "npy": [...], "local_transformer": [...] },
  "languages": { "english": { "tokenizer_kind": "phoneme", "files": [...] }, ... }
}

Usage

# Build hf-upload/ from compiled artifacts
python prepare_hf_upload.py \
    --build-dir compiled/build \
    --constants-dir constants \
    --output-dir hf-upload --clean

# Add manifest.json
python build_manifest.py

# Emit parity fixture
python emit_parity_fixture.py "Hello world." \
    --speaker 0 --language en --seed 42 \
    --output fixture_en_s0.npz

Companion PR

Consumed by the Swift port in FluidInference/FluidAudio#541fluidaudiocli magpie parity / magpie tokenizer-parity / magpie text subcommands.

Test plan

  • python -m py_compile {emit_parity_fixture,prepare_hf_upload,build_manifest}.py — parses clean.
  • End-to-end: full mobius pipeline run produced 4 .mlmodelc + 4 .mlpackage + constants/ + tokenizer/ + manifest.json, uploaded successfully to HF (1.4 GB total).
  • Python inference smoke test: 11.05 s synthesis at 3.97x RTF using the generated .mlpackage set.
  • Inline IPA verified: "Hello | n ɛ m o ʊ |." produces həˈloʊ … nɛmoʊ in G2P output.
  • Swift-side fluidaudiocli magpie parity --fixture fixture_en_s0.npz hits MAE < 1e-3 on encoderOutput and SNR > 40 dB on audioPcm.

Notes

  • Requires build/ to contain compiled .mlpackage artifacts.
  • nemo extras required for tokenization.
  • CFG enabled by default (cfg_scale=2.5); pass --no-cfg to emit unconditional fixture.
  • coremltools 9.0 wheel quirk: PyPI's py3-none-any is a stub; uv may need --force-reinstall to pull cp311-none-macosx_11_0_arm64.whl.

Adds a standalone companion script next to `generate_coreml.py` that runs
the Magpie CoreML pipeline for a fixed (text, speaker, language, seed)
and dumps intermediate tensors so cross-implementation parity tests can
diff against this ground truth.

Usage:

    # Full pipeline — .npz with tokens, encoder output, prefill caches,
    # per-step decoder hidden + sampled codes, predicted (8,N), PCM audio.
    python emit_parity_fixture.py "Hello world." \
        --speaker 0 --language en --seed 42 \
        --output fixture_en_s0.npz

    # Tokenizer-only — small .json for quick Swift tokenizer diff
    # without loading CoreML.
    python emit_parity_fixture.py "Hello world." \
        --speaker 0 --language en --mode tokenizer \
        --output fixture_en_s0_tokens.json

The script re-imports from `generate_coreml.py` so it never drifts from
the reference pipeline. Consumed by the Swift port's
`fluidaudiocli magpie parity` and `magpie tokenizer-parity` subcommands
in FluidInference/FluidAudio#541.
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review


gen_time = time.time() - gen_start

predicted_codes_full = np.stack(per_step_codes, axis=1) # (8, N)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 EOS frame included in codec input causes fixture audio/codes to diverge from reference

The fixture's AR loop appends the EOS-containing codes to per_step_codes before breaking (line 216), then feeds the full per_step_codes (including the EOS frame) into predicted_codes_full (line 223) and through the NanoCodec decoder (line 229). In contrast, the reference generate_coreml.py:400-406 breaks before appending EOS codes to all_predictions, so the codec never sees the EOS frame.

This means predictedCodes in the fixture has one extra column containing EOS token IDs (e.g., 2017), the NanoCodec decodes those special tokens as regular codec codes producing garbage audio for that frame, and the resulting audioPcm and WAV file diverge from the reference. Since this tool exists to emit "ground truth" for cross-implementation parity testing, a Swift implementation validated against this fixture would produce different output than generate_coreml.py.

Recording the EOS step in per_step_codes for trace purposes is fine, but the codec-input codes should exclude the final EOS frame.

Prompt for agents
The issue is that predicted_codes_full at line 223 includes the EOS frame (appended at line 216) and is then fed to the NanoCodec decoder at lines 226-232. The reference generate_coreml.py excludes EOS codes from codec input entirely.

To fix this while preserving the full per-step trace (which is useful for the fixture):
1. After line 223, determine whether the last frame is an EOS frame. If the loop broke due to EOS (is_eos=True), the last entry in per_step_codes is the EOS frame.
2. Build a separate codec_codes variable that excludes the EOS frame: e.g. predicted_codes_full[:, :-1] if the loop ended on EOS, or predicted_codes_full otherwise.
3. Use that codec_codes for the NanoCodec decode step (lines 226-237) instead of predicted_codes_full.
4. Keep predicted_codes_full (with EOS) in the fixture under predictedCodes for trace completeness, but also store the codec-input codes if desired.

Alternatively, mimic the reference exactly: do not append EOS codes to per_step_codes (remove line 216), and record the EOS event separately (e.g. a boolean flag in the fixture). This keeps predicted_codes_full identical to the reference's predicted_codes.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Adds `prepare_hf_upload.py`, which assembles the layout expected by the
FluidAudio Swift port (`FluidInference/magpie-tts-multilingual-357m-coreml`)
from the mobius exporter outputs:

- Copies the 3 required CoreML models (and optional `decoder_prefill`)
  from `build/` to the repo root.
- Keeps model constants + speaker/audio embeddings + `local_transformer/`
  under `constants/`.
- Moves per-language tokenizer JSONs (english_phoneme_*, mandarin_*, etc.)
  into a dedicated `tokenizer/` subtree — Swift's `MagpieResourceDownloader`
  downloads this folder lazily on language selection.
- Writes a model card README.md and a `.gitattributes` that LFS-tracks
  `.mlmodelc` / `.npy` / `.bin` / `.safetensors` / `.onnx`.
- Emits a `_prep_report.json` listing what was copied / skipped / missing.

The script does NOT upload — it prints the exact `huggingface-cli upload`
command for the maintainer to run. Smoke-tested against a synthetic
fixture tree; MISS rows surface in the report and the script exits non-zero
when required inputs (local_transformer weights, core models) are absent.

Usage:

    python prepare_hf_upload.py                        # defaults
    python prepare_hf_upload.py --clean                # fresh staging dir
    python prepare_hf_upload.py --repo-id org/name     # override target
Generates a machine-readable index of every artifact in the upload
(models in both .mlmodelc + .mlpackage form, constants, per-language
tokenizer files), with shapes, sizes, and SHA-256 digests.

The Swift port's MagpieResourceDownloader consumes manifest.json to
know what to fetch and how to verify integrity.
@Alex-Wengg Alex-Wengg changed the title feat(tts/magpie): add parity fixture emitter for Swift port feat(tts/magpie): parity fixture + HF upload staging + manifest Apr 25, 2026
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 11 additional findings in Devin Review.

Open in Devin Review

Comment on lines +273 to +275
"french": {"tokenizer_kind": "byt5", "files": []},
"italian": {"tokenizer_kind": "byt5", "files": []},
"vietnamese": {"tokenizer_kind": "byt5", "files": []},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 French, Italian, Vietnamese incorrectly labeled as "byt5" with no tokenizer files in manifest

The LANGUAGE_FILES dict marks French, Italian, and Vietnamese as "tokenizer_kind": "byt5" with empty "files": [] lists. However, these languages are NOT byt5-based:

  • French uses french_chartokenizer (generate_coreml.py:470)
  • Italian uses italian_phoneme (generate_coreml.py:474)
  • Vietnamese uses vietnamese_phoneme (generate_coreml.py:475)

The exporter (export_tokenizers.py:56-62) produces token2id (and phoneme_dict for Italian/Vietnamese) files for these languages. The sibling script prepare_hf_upload.py:83-98 correctly lists the files. The manifest consumed by Swift's MagpieResourceDownloader will indicate these languages need no downloads, causing runtime failures when attempting to use them.

Suggested change
"french": {"tokenizer_kind": "byt5", "files": []},
"italian": {"tokenizer_kind": "byt5", "files": []},
"vietnamese": {"tokenizer_kind": "byt5", "files": []},
"french": {
"tokenizer_kind": "char",
"files": [
"tokenizer/french_chartokenizer_token2id.json",
],
},
"italian": {
"tokenizer_kind": "phoneme",
"files": [
"tokenizer/italian_phoneme_token2id.json",
"tokenizer/italian_phoneme_phoneme_dict.json",
],
},
"vietnamese": {
"tokenizer_kind": "phoneme",
"files": [
"tokenizer/vietnamese_phoneme_token2id.json",
"tokenizer/vietnamese_phoneme_phoneme_dict.json",
],
},
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +75 to +78
"english": [
"english_phoneme_token2id.json",
"english_phoneme_phoneme_dict.json",
],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 English heteronyms file missing from HF upload tokenizer list

The English entry in PER_LANGUAGE_TOKENIZER_FILES omits english_phoneme_heteronyms.json. The exporter (export_tokenizers.py:100-104) produces this file, build_manifest.py:236 references it as tokenizer/english_phoneme_heteronyms.json, and the German entry at line 94 correctly includes its own heteronyms file. Without this file in the upload, English heteronym pronunciation resolution will be unavailable in the Swift port.

Suggested change
"english": [
"english_phoneme_token2id.json",
"english_phoneme_phoneme_dict.json",
],
"english": [
"english_phoneme_token2id.json",
"english_phoneme_phoneme_dict.json",
"english_phoneme_heteronyms.json",
],
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +102 to +110
"mandarin": [
"mandarin_phoneme_token2id.json",
"mandarin_phoneme_pinyin_dict.json",
"mandarin_phoneme_tone_dict.json",
"mandarin_phoneme_ascii_letter_dict.json",
"mandarin_pypinyin_char_dict.json",
"mandarin_pypinyin_phrase_dict.json",
"mandarin_jieba_dict.json",
],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Mandarin phoneme_dict file missing from HF upload tokenizer list

The Mandarin entry in PER_LANGUAGE_TOKENIZER_FILES omits mandarin_phoneme_phoneme_dict.json. The exporter (export_tokenizers.py:65-78) produces this file via the generic IPA G2P phoneme_dict export path (separate from the Chinese-specific pinyin_dict), and build_manifest.py:264 references it. Since the file won't be in ALL_TOKENIZER_FILES, it also won't be skipped by the constants copier — it will end up in unknown_files and not be copied anywhere. The manifest builder will then fail when trying to hash a non-existent file.

Suggested change
"mandarin": [
"mandarin_phoneme_token2id.json",
"mandarin_phoneme_pinyin_dict.json",
"mandarin_phoneme_tone_dict.json",
"mandarin_phoneme_ascii_letter_dict.json",
"mandarin_pypinyin_char_dict.json",
"mandarin_pypinyin_phrase_dict.json",
"mandarin_jieba_dict.json",
],
"mandarin": [
"mandarin_phoneme_token2id.json",
"mandarin_phoneme_phoneme_dict.json",
"mandarin_phoneme_pinyin_dict.json",
"mandarin_phoneme_tone_dict.json",
"mandarin_phoneme_ascii_letter_dict.json",
"mandarin_pypinyin_char_dict.json",
"mandarin_pypinyin_phrase_dict.json",
"mandarin_jieba_dict.json",
],
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

The 30s hard-coded timeout on `event.wait` in `get_compute_plan` was
silently false-failing as "Failed to load compute plan: unknown error"
on graphs ≳1500 ops where `MLComputePlan.loadContentsOfURL` legitimately
takes 25-30s. Confirmed on Magpie's rank-4 decoder_step (1782 ops, load
took 27.09s end-to-end) and reproducible on nanocodec_decoder.

Changes:
  - `compute_plan.py`: add `DEFAULT_LOAD_TIMEOUT_S = 120.0` and
    `load_timeout_s` parameter; raise descriptive timeout error
    separately from the generic load error.
  - `fallback.py`: pass-through `load_timeout_s` to `get_compute_plan`.
  - `cli.py`: expose `--plan-timeout` typer option (default 120s);
    wire through to both the compute-plan and fallback paths.

Verified on Magpie decoder_step (1782 ops, 27.1s load, ANE 97.3%) and
nanocodec_decoder (1149 ops, ANE compile rejection visible via
--fallback at extended timeout).
…ental stateful variant

Production change:
  Split per-layer KV cache from rank-5 ``(2, B, max_seq, H, D)`` into
  rank-4 ``cache_k`` + ``cache_v`` tensors so the ANE compiler will
  accept the graph. Also replace causal mask constant ``-1e9`` (overflows
  fp16, ANE rejects) with fp16-safe ``-3e4``, and use additive masking
  for cross-attention.

  Result on Apple M2 / macOS 26.5:
    rank-5 (old):  GPU only, ~64 ms/step
    rank-4 (new):  97.3% ANE, 40 ms/step standalone, 96 ms/step in 146-step loop

  Loads ``decoder_step.mlpackage`` with ``ComputeUnit.ALL``. Cache and
  position output key names re-derived from the re-traced graph.

Experimental (kept, not enabled by default):
  Add ``traceable_decoder_step_stateful.py`` and
  ``convert_decoder_step_stateful.py`` for a CoreML MLState (stateful
  buffers) variant. Shrinks IO surface from 39/38 to 4/2 tensors, but
  forces CPU+GPU only (ANE rejects stateful graphs). Real-loop benchmark
  showed 212 ms/step — 2.2× regression vs rank-4. Both files carry
  prominent EXPERIMENTAL banners and the ``MAGPIE_STATEFUL`` env path in
  ``generate_coreml.py`` is off by default. Kept so future agents don't
  repeat the experiment thinking CosyVoice3's ~3× MLState speedup
  generalises (it doesn't — Magpie's rank-4 graph is already on ANE).

Files:
  - ``traceable/traceable_decoder_step.py`` — rank-4 production
  - ``convert_decoder_step.py`` — rank-4 production
  - ``generate_coreml.py`` — rank-4 keys + ``ComputeUnit.ALL``;
    experimental ``MAGPIE_STATEFUL`` env-gated branch
  - ``traceable/traceable_decoder_step_stateful.py`` — experimental, NEW
  - ``convert_decoder_step_stateful.py`` — experimental, NEW
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 17 additional findings in Devin Review.

Open in Devin Review

# Re-use everything from the main script so we never drift from the reference.
from generate_coreml import ( # noqa: E402
BUILD_DIR,
DECODER_CACHE_OUT_KEYS,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 emit_parity_fixture.py imports deleted DECODER_CACHE_OUT_KEYS and uses stale rank-5 cache format

The PR renamed DECODER_CACHE_OUT_KEYSDECODER_CACHE_K_OUT_KEYS + DECODER_CACHE_V_OUT_KEYS in generate_coreml.py, and changed caches from rank-5 (2, B, max_seq, H, D) to rank-4 (B, max_seq, H, D) with split K/V keys (cache_k{i} / cache_v{i}). However, emit_parity_fixture.py was never updated:

  1. Line 43 imports DECODER_CACHE_OUT_KEYS which no longer exists — immediate ImportError at runtime.
  2. Lines 59–60 create caches with old rank-5 shape (2, 1, max_seq_len, n_heads, d_head) and old key names cache{i} — incompatible with the new model that expects cache_k{i} / cache_v{i} with shape (1, max_seq_len, n_heads, d_head).
  3. Line 166 updates cache_dict[f"cache{i}"] using the deleted DECODER_CACHE_OUT_KEYS — even if the import were fixed, the cache update logic is wrong.

The script is completely non-functional with the new rank-4 decoder_step model.

Prompt for agents
emit_parity_fixture.py needs to be updated to match the new rank-4 split-K/V cache format from generate_coreml.py. Specifically:

1. In the import block (line 43), replace DECODER_CACHE_OUT_KEYS with DECODER_CACHE_K_OUT_KEYS and DECODER_CACHE_V_OUT_KEYS.

2. In _make_caches() (lines 56-63), change the cache creation from rank-5 (2, 1, max_seq_len, n_heads, d_head) with key cache{i} to two rank-4 tensors (1, max_seq_len, n_heads, d_head) with keys cache_k{i} and cache_v{i}. Mirror the make_caches() function in generate_coreml.py lines 365-371.

3. In _run_step() (lines 156-168), update the cache output reading from cache_dict[f"cache{i}"] = out[DECODER_CACHE_OUT_KEYS[i]] to cache_dict[f"cache_k{i}"] = out[DECODER_CACHE_K_OUT_KEYS[i]] and cache_dict[f"cache_v{i}"] = out[DECODER_CACHE_V_OUT_KEYS[i]]. Mirror the run_decoder_step() function in generate_coreml.py lines 398-402.

4. In the prefill snapshot (line 262), the prefillCache key naming f"prefillCache{i}" referencing f"cache{i}" also needs updating to match the new cache key names.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +144 to +167
"decoder_step": {
"inputs": [
{"name": "input", "dtype": "fp16", "shape": [1, 1, 768]},
{"name": "encoder_output", "dtype": "fp16", "shape": [1, 256, 768]},
{"name": "encoder_mask", "dtype": "fp16", "shape": [1, 256]},
{
"name": "cache_*",
"dtype": "fp16",
"shape": [2, 1, 512, 12, 64],
"count": 12,
},
{"name": "position_*", "dtype": "int32", "shape": [], "count": 12},
],
"outputs": [
{
"name": "var_2201",
"dtype": "fp16",
"shape": [1, 1, 16192],
"note": "logits, reshape to (1, 1, 8, 2024) for 8 codebooks",
},
{"name": "new_cache_*", "dtype": "fp16", "shape": [2, 1, 512, 12, 64], "count": 12},
{"name": "var_*", "dtype": "int32", "shape": [], "count": 12, "note": "advanced positions"},
],
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 build_manifest.py decoder_step IO spec describes stale rank-5 cache format

The MODEL_IO["decoder_step"] dictionary in build_manifest.py still describes the old rank-5 cache layout that was replaced in this PR. It lists input caches as cache_* with shape [2, 1, 512, 12, 64] and outputs as new_cache_* with the same shape, plus logits named var_2201. The actual model now uses split rank-4 caches (cache_k* / cache_v* with shape [1, 512, 12, 64]), output keys new_k_* / new_v_*, and logits named var_2129 (see generate_coreml.py:39-58). The Swift port's MagpieResourceDownloader consumes this manifest, so the stale IO spec will mislead consumers about the model's interface.

Prompt for agents
Update the MODEL_IO dictionary entry for decoder_step in build_manifest.py to reflect the new rank-4 split-K/V cache format. Specifically:

- Input caches should be two sets of 12: cache_k* with shape [1, 512, 12, 64] and cache_v* with shape [1, 512, 12, 64] (rank-4, not rank-5)
- Output caches should be new_k_* and new_v_* with shape [1, 512, 12, 64] (12 each)
- Logits output name should be var_2129 not var_2201
- Count of cache inputs is 24 (12 K + 12 V) not 12
- The position inputs remain the same (12 scalars)

Refer to the DECODER_CACHE_K_OUT_KEYS, DECODER_CACHE_V_OUT_KEYS, and DECODER_LOGITS_KEY constants in generate_coreml.py for the correct names.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +230 to +276
LANGUAGE_FILES: dict[str, dict[str, Any]] = {
"english": {
"tokenizer_kind": "phoneme",
"files": [
"tokenizer/english_phoneme_token2id.json",
"tokenizer/english_phoneme_phoneme_dict.json",
"tokenizer/english_phoneme_heteronyms.json",
],
},
"spanish": {
"tokenizer_kind": "phoneme",
"files": [
"tokenizer/spanish_phoneme_token2id.json",
"tokenizer/spanish_phoneme_phoneme_dict.json",
],
},
"german": {
"tokenizer_kind": "phoneme",
"files": [
"tokenizer/german_phoneme_token2id.json",
"tokenizer/german_phoneme_phoneme_dict.json",
"tokenizer/german_phoneme_heteronyms.json",
],
},
"hindi": {
"tokenizer_kind": "char",
"files": [
"tokenizer/hindi_chartokenizer_token2id.json",
],
},
"mandarin": {
"tokenizer_kind": "phoneme+jieba+pypinyin",
"files": [
"tokenizer/mandarin_phoneme_token2id.json",
"tokenizer/mandarin_phoneme_phoneme_dict.json",
"tokenizer/mandarin_phoneme_pinyin_dict.json",
"tokenizer/mandarin_phoneme_tone_dict.json",
"tokenizer/mandarin_phoneme_ascii_letter_dict.json",
"tokenizer/mandarin_pypinyin_char_dict.json",
"tokenizer/mandarin_pypinyin_phrase_dict.json",
"tokenizer/mandarin_jieba_dict.json",
],
},
"french": {"tokenizer_kind": "byt5", "files": []},
"italian": {"tokenizer_kind": "byt5", "files": []},
"vietnamese": {"tokenizer_kind": "byt5", "files": []},
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 build_manifest.py references tokenizer files not staged by prepare_hf_upload.py, causing FileNotFoundError

The LANGUAGE_FILES in build_manifest.py references two tokenizer files that are absent from PER_LANGUAGE_TOKENIZER_FILES in prepare_hf_upload.py: english_phoneme_heteronyms.json (line 236) and mandarin_phoneme_phoneme_dict.json (line 264). Since prepare_hf_upload.py stages the hf-upload/ directory and doesn't copy these files, build_manifest.py's json_entry() will crash with FileNotFoundError when computing their size/hash. The intended workflow is: run prepare_hf_upload.py first, then run build_manifest.py — but one list has files the other doesn't.

Specific mismatches
  • english: build_manifest.py includes english_phoneme_heteronyms.json, prepare_hf_upload.py does not
  • mandarin: build_manifest.py includes mandarin_phoneme_phoneme_dict.json, prepare_hf_upload.py does not
Prompt for agents
The language tokenizer file lists in build_manifest.py (LANGUAGE_FILES) and prepare_hf_upload.py (PER_LANGUAGE_TOKENIZER_FILES) are inconsistent. Reconcile them so the files prepare_hf_upload copies to hf-upload exactly match the files build_manifest expects to index.

Specific issues to resolve:
1. English: build_manifest includes english_phoneme_heteronyms.json but prepare_hf_upload does not. Either add it to prepare_hf_upload.py or remove it from build_manifest.py.
2. Mandarin: build_manifest includes mandarin_phoneme_phoneme_dict.json but prepare_hf_upload does not. Same resolution needed.
3. French, Italian, Vietnamese: prepare_hf_upload stages phoneme/chartokenizer files for these languages, but build_manifest marks them as byt5 with empty file lists. Since generate_coreml.py maps these languages to phoneme/char tokenizers (lines 524-529), build_manifest should list their tokenizer files too.

The authoritative source for which tokenizers each language needs is generate_coreml.py's language_tokenizer_map. Both files should mirror that.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant