# phoonnx — Train & Export TTS Models

**A platform-agnostic notebook for training [VITS](https://arxiv.org/abs/2106.06103)-based Text-to-Speech models with [phoonnx](https://github.com/TigreGotico/phoonnx).**

This notebook is **self-contained** — it does not depend on external scripts and works on **Kaggle, Colab, Paperspace, SageMaker, local machines**, or any Jupyter-compatible environment.

### What phoonnx does

phoonnx is a multilingual TTS toolkit built on the VITS architecture. It provides:
- Text normalization (numbers, dates, units, contractions — language-aware)
- Phonemization via 25+ backends (eSpeak, Gruut, ByT5, Misaki, Epitran, …)
- Training with PyTorch Lightning
- Export to ONNX for lightweight, cross-platform inference

Trained models work with **Piper TTS**, **sherpa-onnx**, **Open Voice OS**, and the **[NVDA screen-reader add-on](https://github.com/TigreGotico/phoonnx-AddonNVDA)**.


---
## 1 · Configuration

Edit the variables below to match your language, dataset, and hardware.
All paths are auto-resolved per platform — you **only** need to change the top section.

In [4]:
# ╔══════════════════════════════════════════════════════════════════════╗
# ║  EDIT THIS SECTION — everything else adapts automatically          ║
# ╚══════════════════════════════════════════════════════════════════════╝

# ── Language & phonemizer ────────────────────────────────────────────
LANG            = "en"            # BCP-47 language tag (e.g. "en", "pt-PT", "eu-ES")
PHONEMIZER      = "espeak"        # one of: espeak, gruut, byt5, misaki, epitran, …
                                  #   see full list in TRAINING.md
ALPHABET        = "ipa"           # ipa, arpa, sampa, pinyin, hangul, kana, …

# ── Dataset source (pick ONE) ───────────────────────────────────────
#    Option A – download from Hugging Face
HF_DATASET      = "TigreGotico/tts-train-synthetic-miro_en-US"
#    Option B – upload the dataset yourself and set the path
#HF_DATASET = ""   # leave empty to skip download

#    Browse available datasets:
#    https://huggingface.co/collections/TigreGotico/synthetic-tts-datasets

# ── Base checkpoint for fine-tuning (optional) ──────────────────────
#    Set both to "" to train from scratch.
BASE_CKPT_URL   = "https://huggingface.co/OpenVoiceOS/phoonnx_eu-ES_miro_espeak/resolve/main/epoch%3D299-step%3D99600.ckpt"
BASE_CONFIG_URL = "https://huggingface.co/OpenVoiceOS/phoonnx_eu-ES_miro_espeak/resolve/main/miro_eu-ES.json"

# ── Hugging Face token (for private datasets / avoid rate limits) ───
HF_TOKEN        = ""              # e.g. "hf_xxxxx"

# ── Training hyperparameters ────────────────────────────────────────
BATCH_SIZE      = 16              # reduce to 8 or 4 if you run out of VRAM
MAX_EPOCHS      = 1000            # training will checkpoint every epoch
PRECISION       = 32              # 32, 16, or "bf16"  (16 saves ~40 % VRAM)
LEARNING_RATE   = 2e-4
VALIDATION_SPLIT= 0.05
QUALITY         = "medium"        # x-low, medium, high  (affects model size)
SAMPLE_RATE     = 22050           # must match your audio files

# ── Extra flags ─────────────────────────────────────────────────────
SINGLE_SPEAKER  = True            # set False for multi-speaker datasets
ADD_DIACRITICS  = False           # True for Arabic (tashkeel) / Hebrew (nikud)
BYT5_MODEL      = ""              # only if PHONEMIZER="byt5", e.g.
                                  #   "OpenVoiceOS/g2p-mbyt5-12l-ipa-childes-espeak-onnx"

# ╔══════════════════════════════════════════════════════════════════════╗
# ║  END OF USER CONFIG — everything below is auto-configured          ║
# ╚══════════════════════════════════════════════════════════════════════╝

### 1.1 · Platform detection & paths

The cell below detects whether you are on **Kaggle, Colab, Paperspace, SageMaker**, or a **local** machine and sets all working paths accordingly. No manual edits needed.

In [7]:
import os, shutil, subprocess, sys

# ── Detect platform ─────────────────────────────────────────────────
def detect_platform():
    if os.path.exists("/kaggle"):
        return "kaggle"
    try:
        import google.colab  # noqa: F401
        return "colab"
    except ImportError:
        pass
    if os.environ.get("PAPERSPACE"):
        return "paperspace"
    if os.path.exists("/opt/ml"):
        return "sagemaker"
    return "local"

PLATFORM = detect_platform()
print(f"Detected platform: {PLATFORM}")

# ── Detect accelerator ──────────────────────────────────────────────
def detect_accelerator():
    try:
        subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)
        return "gpu"
    except (FileNotFoundError, subprocess.CalledProcessError):
        pass
    # Apple Silicon (MPS) — PyTorch >= 2.0
    try:
        import torch
        if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
            return "mps"
    except ImportError:
        pass
    return "cpu"

ACCELERATOR = detect_accelerator()
print(f"Accelerator: {ACCELERATOR}")
if ACCELERATOR == "cpu":
    print("⚠  CPU training is very slow — GPU is strongly recommended.")

# ── Resolve paths per platform ──────────────────────────────────────
WORK = {
    "kaggle":     "/kaggle/working",
    "colab":      "/content",
    "paperspace": "/notebooks",
    "sagemaker":  "/opt/ml",
    "local":      os.path.expanduser("~/phoonnx_work"),
}[PLATFORM]

os.makedirs(WORK, exist_ok=True)

DATASET_DIR       = os.path.join(WORK, "dataset")
PREPROCESSED_DIR  = os.path.join(WORK, "training")
CHECKPOINTS_DIR   = os.path.join(WORK, "checkpoints")
EXPORT_DIR        = os.path.join(WORK, "exported")

for d in [DATASET_DIR, PREPROCESSED_DIR, CHECKPOINTS_DIR, EXPORT_DIR]:
    os.makedirs(d, exist_ok=True)

LOCAL_CKPT   = os.path.join(CHECKPOINTS_DIR, "base.ckpt")
LOCAL_CONFIG = os.path.join(CHECKPOINTS_DIR, "base.ckpt.json")

print(f"Working directory : {WORK}")
print(f"Dataset path      : {DATASET_DIR}")
print(f"Preprocessed path : {PREPROCESSED_DIR}")
print(f"Checkpoints path  : {CHECKPOINTS_DIR}")

Detected platform: local
Accelerator: gpu
Working directory : /home/miro/phoonnx_work
Dataset path      : /home/miro/phoonnx_work/dataset
Preprocessed path : /home/miro/phoonnx_work/training
Checkpoints path  : /home/miro/phoonnx_work/checkpoints


### 1.2 · Set environment variables

Some phoonnx internals read from environment variables. This cell propagates your config.

In [8]:
# Propagate config to environment (read by phoonnx internals)
os.environ["LANG"]          = LANG
os.environ["PHONEMIZER"]    = PHONEMIZER
os.environ["ALPHABET"]      = ALPHABET
os.environ["WORKDIR"]      = WORK
if HF_TOKEN:
    os.environ["HF_TOKEN"]  = HF_TOKEN
if ACCELERATOR == "gpu":
    os.environ["CUDA"]      = "1"   # tells ByT5 phonemizer to use GPU

print("Environment configured ✓")

Environment configured ✓


---
## 2 · Install dependencies

Installs phoonnx with training extras and builds the Cython monotonic alignment module.
This uses a **virtual-env** to ensure a compatible python version
and installs directly into the system Python elsewhere.

> **Disk usage note**: `phoonnx[train]` pulls in PyTorch + Lightning (~3 GB).
> On free-tier Kaggle/Colab this is fine; on very constrained envs, ensure ≥5 GB free.

In [9]:
%%time
!uv venv ${WORKDIR}/.venv  --clear --python 3.10

VENV_PYTHON = WORK + "/.venv/bin/python"
os.environ["UV_PYTHON"] = VENV_PYTHON
venv_bin = os.path.join(WORK, ".venv", "bin")
os.environ["PATH"] = venv_bin + ":" + os.environ.get("PATH", "")
os.environ["VIRTUAL_ENV"] = os.path.join(WORK, ".venv")
print(f"Python for training: {VENV_PYTHON}")

Using CPython [36m3.10.19[39m[36m[39m
Creating virtual environment at: [36m/home/miro/phoonnx_work/.venv[39m
Activate with: [32msource /home/miro/phoonnx_work/.venv/bin/activate[39m
Python for training: /home/miro/phoonnx_work/.venv/bin/python
CPU times: user 13.3 ms, sys: 3.34 ms, total: 16.7 ms
Wall time: 665 ms


In [10]:
%%time
!git clone --depth 1 https://github.com/TigreGotico/phoonnx ${WORKDIR}/phoonnx

PHOONNX_DIR = WORK + "/phoonnx"

Cloning into '/home/miro/phoonnx_work/phoonnx'...
remote: Enumerating objects: 223, done.[K
remote: Counting objects: 100% (223/223), done.[K
remote: Compressing objects: 100% (187/187), done.[K
remote: Total 223 (delta 17), reused 166 (delta 4), pack-reused 0 (from 0)[K
Receiving objects: 100% (223/223), 14.17 MiB | 21.15 MiB/s, done.
Resolving deltas: 100% (17/17), done.
CPU times: user 33.7 ms, sys: 12.9 ms, total: 46.5 ms
Wall time: 1.72 s


In [11]:
%%time
!source ${WORKDIR}/.venv/bin/activate && uv pip install -e ${WORKDIR}/phoonnx[train]  "setuptools<82"

print("phoonnx[train] installed ✓")

[2mUsing Python 3.10.19 environment at: /home/miro/phoonnx_work/.venv[0m
[2K[2mResolved [1m84 packages[0m [2min 950ms[0m[0m                                        [0m
[2K[2mPrepared [1m1 package[0m [2min 255ms[0m[0m                                              
[2K[2mInstalled [1m84 packages[0m [2min 142ms[0m[0m                              [0m
 [32m+[39m [1maiohappyeyeballs[0m[2m==2.6.1[0m
 [32m+[39m [1maiohttp[0m[2m==3.13.3[0m
 [32m+[39m [1maiosignal[0m[2m==1.4.0[0m
 [32m+[39m [1masync-timeout[0m[2m==5.0.1[0m
 [32m+[39m [1mattrs[0m[2m==25.4.0[0m
 [32m+[39m [1maudioread[0m[2m==3.1.0[0m
 [32m+[39m [1mcertifi[0m[2m==2026.1.4[0m
 [32m+[39m [1mcffi[0m[2m==2.0.0[0m
 [32m+[39m [1mcharset-normalizer[0m[2m==3.4.4[0m
 [32m+[39m [1mclick[0m[2m==8.3.1[0m
 [32m+[39m [1mcoloredlogs[0m[2m==15.0.1[0m
 [32m+[39m [1mcombo-lock[0m[2m==0.3.0[0m
 [32m+[39m [1mcython[0m[2m==0.29.37[0m
 

### 2.1 · Phonemizer system dependencies

If you use **espeak**, the `espeak-ng` binary must be on the system.
Other phonemizers (gruut, byt5, misaki, epitran, …) might need extra python dependencies instead.

In [12]:
# Install espeak-ng (only needed if PHONEMIZER == "espeak")
if PHONEMIZER == "espeak":
    if shutil.which("espeak-ng") is None:
        if PLATFORM in ("colab", "kaggle", "paperspace"):
            subprocess.check_call(["apt-get", "install", "-y", "-qq", "espeak-ng"])
        else:
            print("⚠  espeak-ng not found. Install it manually:")
            print("   Debian/Ubuntu : sudo apt install espeak-ng")
            print("   macOS         : brew install espeak-ng")
            print("   Windows       : download from https://github.com/espeak-ng/espeak-ng/releases")
    else:
        print("espeak-ng already installed ✓")
else:
    print(f"Phonemizer '{PHONEMIZER}' does not need espeak-ng — skipping.")

espeak-ng already installed ✓


### 2.2 · Build monotonic alignment (Cython)

VITS requires a small Cython extension for monotonic alignment search.
This compiles it in-place.

In [13]:
%%time
import glob

MA_DIR = os.path.join(PHOONNX_DIR, "phoonnx_train", "vits", "monotonic_align")
MA_OUT = os.path.join(MA_DIR, "monotonic_align")
os.makedirs(MA_OUT, exist_ok=True)

env = os.environ.copy()

subprocess.check_call(["cythonize", "-i", "core.pyx"], cwd=MA_DIR, env=env)

# Move compiled .so into sub-package
for so in glob.glob(os.path.join(MA_DIR, "core*.so")):
    shutil.copy2(so, MA_OUT)

print("Monotonic alignment built ✓")
print("Contents:", os.listdir(MA_OUT))

Monotonic alignment built ✓
Contents: ['core.cpython-310-x86_64-linux-gnu.so']
CPU times: user 3.86 ms, sys: 88 μs, total: 3.95 ms
Wall time: 6.45 s


---
## 3 · Download dataset & base checkpoint

Downloads your dataset from Hugging Face and (optionally) a pre-trained
checkpoint to fine-tune from.

> **Storage tip**: Synthetic TTS datasets from TigreGotico are typically 30–300 MB.
> If you already uploaded a dataset to the platform (e.g. as a Kaggle dataset),
> set `HF_DATASET = ""` above and point `DATASET_DIR` to the upload location.

In [37]:
%%time
# ── Download dataset from Hugging Face ──────────────────────────────
if HF_DATASET:
    hf_cmd = ["huggingface-cli", "download", HF_DATASET,
              "--quiet", "--repo-type", "dataset",
              "--local-dir", DATASET_DIR]
    if HF_TOKEN:
        hf_cmd += ["--token", HF_TOKEN]
    subprocess.check_call(hf_cmd)
    print(f"Dataset downloaded to {DATASET_DIR} ✓")
else:
    print(f"HF_DATASET is empty — expecting data already at {DATASET_DIR}")

# Quick sanity check
metadata = os.path.join(DATASET_DIR, "metadata.csv")
if os.path.isfile(metadata):
    with open(metadata) as f:
        n_lines = sum(1 for _ in f) - 1  # minus header
    print(f"  Found metadata.csv with {n_lines} utterances")
    wavs_dir = os.path.join(DATASET_DIR, "wavs")
    if os.path.isdir(wavs_dir):
        n_wavs = len([f for f in os.listdir(wavs_dir) if f.endswith(".wav")])
        print(f"  Found {n_wavs} wav files in wavs/")
else:
    print("  ⚠  metadata.csv not found — verify your dataset path!")

HF_DATASET is empty — expecting data already at /home/miro/phoonnx_work/dataset
  Found metadata.csv with 1348 utterances
CPU times: user 2.53 ms, sys: 0 ns, total: 2.53 ms
Wall time: 1.66 ms


In [38]:
%%time
# ── Download base checkpoint (for fine-tuning) ──────────────────────
if BASE_CKPT_URL:
    if not os.path.isfile(LOCAL_CKPT):
        subprocess.check_call(["wget", "-q", BASE_CKPT_URL, "-O", LOCAL_CKPT])
        print(f"Checkpoint downloaded → {LOCAL_CKPT} ✓")
    else:
        print("Checkpoint already exists — skipping download.")
else:
    print("No base checkpoint URL — training from scratch.")

if BASE_CONFIG_URL:
    if not os.path.isfile(LOCAL_CONFIG):
        subprocess.check_call(["wget", "-q", BASE_CONFIG_URL, "-O", LOCAL_CONFIG])
        print(f"Config downloaded → {LOCAL_CONFIG} ✓")
    else:
        print("Config already exists — skipping download.")
else:
    print("No base config URL provided.")

Checkpoint downloaded → /home/miro/phoonnx_work/checkpoints/base.ckpt ✓
Config downloaded → /home/miro/phoonnx_work/checkpoints/base.ckpt.json ✓
CPU times: user 4.26 ms, sys: 2.31 ms, total: 6.57 ms
Wall time: 35.6 s


---
## 4 · Preprocess (phonemize + cache audio)

Converts raw text → phoneme IDs and normalizes/caches audio at `SAMPLE_RATE`.

**What happens under the hood:**
1. **Text normalization** — numbers, dates, units, contractions → spoken form (language-aware)
2. **Phonemization** — text → phoneme sequence via the chosen backend
3. **Audio caching** — resamples wavs and computes spectrograms

Outputs `config.json` + `dataset.jsonl` + `cache/` into `PREPROCESSED_DIR`.

> **Skip this cell** if you already have preprocessed data (e.g. uploaded to the platform).

In [39]:
%%time
PREPROCESS_SCRIPT = os.path.join(PHOONNX_DIR, "phoonnx_train", "preprocess.py")

cmd = [
    VENV_PYTHON, PREPROCESS_SCRIPT,
    "--input-dir",    DATASET_DIR,
    "--output-dir",   PREPROCESSED_DIR,
    "--language",     LANG,
    "--sample-rate",  str(SAMPLE_RATE),
    "--phoneme-type", PHONEMIZER,
    "--alphabet",     ALPHABET,
]

# Fine-tuning: reuse previous phoneme_id_map
if BASE_CONFIG_URL and os.path.isfile(LOCAL_CONFIG):
    cmd += ["--prev-config", LOCAL_CONFIG]

if SINGLE_SPEAKER:
    cmd += ["--single-speaker"]

if ADD_DIACRITICS:
    cmd += ["--add-diacritics"]

if BYT5_MODEL:
    cmd += ["--phonemizer-model", BYT5_MODEL]

print("Running:", " ".join(cmd))
subprocess.check_call(cmd)

# Verify outputs
for expected in ["config.json", "dataset.jsonl"]:
    path = os.path.join(PREPROCESSED_DIR, expected)
    if os.path.isfile(path):
        size_mb = os.path.getsize(path) / 1e6
        print(f"  ✓ {expected}  ({size_mb:.1f} MB)")
    else:
        print(f"  ✗ {expected}  MISSING — check logs above")

Running: /home/miro/phoonnx_work/.venv/bin/python /home/miro/phoonnx_work/phoonnx/phoonnx_train/preprocess.py --input-dir /home/miro/phoonnx_work/dataset --output-dir /home/miro/phoonnx_work/training --language pt --sample-rate 22050 --phoneme-type espeak --alphabet ipa --prev-config /home/miro/phoonnx_work/checkpoints/base.ckpt.json --single-speaker


INFO:preprocess:Loading utterances from dataset...
INFO:preprocess:Found 1000 utterances.
INFO:preprocess:Single speaker dataset
INFO:preprocess:Starting single pass processing with 16 workers...
Processing utterances:  77%|███████▋  | 774/1000 [02:08<00:20, 11.02it/s]ERROR:preprocess:Failed to process utterance: /home/miro/phoonnx_work/dataset/wav/0300000031.wav
Traceback (most recent call last):
  File "/home/miro/phoonnx_work/phoonnx/phoonnx_train/preprocess.py", line 213, in phonemize_worker
    utterance: str = casing(normalize(utt.text, config.language))
  File "/home/miro/phoonnx_work/phoonnx/phoonnx/util.py", line 725, in normalize
    dialog = _normalize_units(dialog, full_lang)
  File "/home/miro/phoonnx_work/phoonnx/phoonnx/util.py", line 662, in _normalize_units
    text = alphanumeric_pattern.sub(replace_alphanumeric, text)
  File "/home/miro/phoonnx_work/phoonnx/phoonnx/util.py", line 659, in replace_alphanumeric
    unit_word = alphanumeric_units[unit_symbol]
KeyError: '

  ✓ config.json  (0.0 MB)
  ✓ dataset.jsonl  (1.3 MB)
CPU times: user 470 ms, sys: 115 ms, total: 585 ms
Wall time: 2min 38s


---
## 5 · Train

Launches VITS training with PyTorch Lightning. Checkpoints are saved every epoch
to `CHECKPOINTS_DIR`.

**Tips for constrained environments:**
- Reduce `BATCH_SIZE` to 8 or 4 if you get OOM errors
- Use `PRECISION = 16` (mixed precision) to cut VRAM by ~40%
- Free-tier Kaggle/Colab may time-out after 9–12 h — the model will resume from the last checkpoint
- Set `MAX_EPOCHS` conservatively and re-run to continue

> Training is the longest step. On a single T4 GPU with batch size 16, expect ~2 min/epoch
> for a dataset of ~1000 utterances.

In [None]:
%%time
TRAIN_SCRIPT = os.path.join(PHOONNX_DIR, "phoonnx_train", "train.py")

cmd = [
    VENV_PYTHON, TRAIN_SCRIPT,
    "--dataset-dir",      PREPROCESSED_DIR,
    "--accelerator",      ACCELERATOR,
    "--devices",          "1",
    "--batch-size",       str(BATCH_SIZE),
    "--validation-split", str(VALIDATION_SPLIT),
    "--max-epochs",       str(MAX_EPOCHS),
    "--checkpoint-epochs", "1",
    "--precision",        str(PRECISION),
    "--quality",          QUALITY,
    "--learning-rate",    str(LEARNING_RATE),
    "--default-root-dir", CHECKPOINTS_DIR,
]

# Resume from base checkpoint (fine-tuning)
if BASE_CKPT_URL and os.path.isfile(LOCAL_CKPT):
    cmd += ["--resume-from-checkpoint", LOCAL_CKPT]

print("Running:", " ".join(cmd))
print("─" * 60)
subprocess.check_call(cmd)

Running: /home/miro/phoonnx_work/.venv/bin/python /home/miro/phoonnx_work/phoonnx/phoonnx_train/train.py --dataset-dir /home/miro/phoonnx_work/training --accelerator gpu --devices 1 --batch-size 16 --validation-split 0.05 --max-epochs 2 --checkpoint-epochs 1 --precision 32 --quality medium --learning-rate 0.0002 --default-root-dir /home/miro/phoonnx_work/checkpoints --resume-from-checkpoint /home/miro/phoonnx_work/checkpoints/base.ckpt
────────────────────────────────────────────────────────────


  __import__("pkg_resources").declare_namespace(__name__)
INFO:root:config_path: '/home/miro/phoonnx_work/training/config.json'
INFO:root:dataset_path: '/home/miro/phoonnx_work/training/dataset.jsonl'
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
INFO:root:Checkpoints will be saved every 1 epoch(s)
DEBUG:root:Config params: num_symbols=256 num_speakers=1 sample_rate=22050
DEBUG:fsspec.local:open file: /home/miro/phoonnx_work/checkpoints/base.ckpt
DEBUG:vits.lightning:No dataset to load
DEBUG:root:Checkpoint params: num_symbols=256 num_speakers=1 sample_rate=22050
DEBUG:vits.dataset:Loading dataset: /home/miro/phoonnx_work/training/dataset.jsonl
INFO:root:VitsModel params: num_symbols=256 num_speakers=1 sample_rate=22050
INFO:root:Successfully loaded model weights.
INFO:root:training started!!
You are using a CUDA device ('NVIDIA GeForce RTX 3060 Laptop GPU') that has Tensor Cores. 

---
## 6 · Export to ONNX

Converts the best `.ckpt` to ONNX format for fast, cross-platform inference.

The exported model works with:
- **phoonnx** (Python inference)
- **Piper TTS** (if trained with espeak + IPA)
- **sherpa-onnx** (C++/mobile, with `--generate-tokens`)
- **Open Voice OS** (OVOS TTS plugin)
- **NVDA screen reader** ([phoonnx-AddonNVDA](https://github.com/TigreGotico/phoonnx-AddonNVDA))

In [14]:
%%time
import glob, os

EXPORT_SCRIPT = os.path.join(PHOONNX_DIR, "phoonnx_train", "export_onnx.py")
TRAIN_CONFIG  = os.path.join(PREPROCESSED_DIR, "config.json")

# Find the latest checkpoint
ckpt_pattern = os.path.join(CHECKPOINTS_DIR, "**", "*.ckpt")
ckpts = sorted(glob.glob(ckpt_pattern, recursive=True), key=os.path.getmtime)

if not ckpts:
    print("✗ No checkpoints found — run training first.")
else:
    best_ckpt = ckpts[-1]
    print(f"Exporting: {best_ckpt}")

    cmd = [
        VENV_PYTHON, EXPORT_SCRIPT,
        best_ckpt,
        "--config",     TRAIN_CONFIG,
        "--output-dir", EXPORT_DIR,
        "--generate-tokens",   # for sherpa-onnx compatibility
        "--piper",             # for Piper TTS compatibility
    ]
    subprocess.check_call(cmd)

    print(f"\nExported files in {EXPORT_DIR}:")
    for f in sorted(os.listdir(EXPORT_DIR)):
        size = os.path.getsize(os.path.join(EXPORT_DIR, f)) / 1e6
        print(f"  {f}  ({size:.1f} MB)")

Exporting: /home/miro/phoonnx_work/checkpoints/base.ckpt


  __import__("pkg_resources").declare_namespace(__name__)
DEBUG:phoonnx_train.export_onnx:Arguments: checkpoint=PosixPath('/home/miro/phoonnx_work/checkpoints/base.ckpt'), config=PosixPath('/home/miro/phoonnx_work/training/config.json'), output_dir=PosixPath('/home/miro/phoonnx_work/exported'), generate_tokens=True, piper=True
DEBUG:phoonnx_train.export_onnx:Output directory ensured: /home/miro/phoonnx_work/exported
INFO:phoonnx_train.export_onnx:Loaded phoonnx config from /home/miro/phoonnx_work/training/config.json
INFO:phoonnx_train.export_onnx:Generated tokens file at /home/miro/phoonnx_work/exported/base.ckpt.tokens.txt
DEBUG:fsspec.local:open file: /home/miro/phoonnx_work/checkpoints/base.ckpt
DEBUG:vits.lightning:No dataset to load
DEBUG:phoonnx_train.export_onnx:Removed weight normalization from decoder.
INFO:phoonnx_train.export_onnx:Starting ONNX export to /home/miro/phoonnx_work/exported/base.ckpt.onnx (opset=15)...
  t_s == t_t
  pad_length = max(length - (self.window_size 

Removing weight norm...


  assert (discriminant >= 0).all(), discriminant
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
INFO:phoonnx_train.export_onnx:Successfully exported model to /home/miro/phoonnx_work/exported/base.ckpt.onnx
ERROR:phoonnx_train.export_onnx:The 'onnx' package is required to add metadata. Please install it with 'pip install onnx'.
INFO:phoonnx_train.export_onnx:Export complete.



Exported files in /home/miro/phoonnx_work/exported:
  base.ckpt.onnx  (63.5 MB)
  base.ckpt.piper.json  (0.0 MB)
  base.ckpt.tokens.txt  (0.0 MB)
CPU times: user 11.8 ms, sys: 10.8 ms, total: 22.6 ms
Wall time: 17.3 s


---
## 7 · Quick inference test (optional)

Synthesizes a short sentence with the exported ONNX model to verify everything works.

In [15]:
import glob, os

onnx_files = glob.glob(os.path.join(EXPORT_DIR, "*.onnx"))
json_files = glob.glob(os.path.join(EXPORT_DIR, "*.json"))

if onnx_files and json_files:
    onnx_path = onnx_files[0]
    json_path = json_files[0]
    test_wav  = os.path.join(EXPORT_DIR, "test_output.wav")

    test_code = f'''
import wave
from phoonnx.config import SynthesisConfig
from phoonnx.voice import TTSVoice

voice = TTSVoice.load("{onnx_path}", "{json_path}")
config = SynthesisConfig(noise_scale=0.667, length_scale=1.0, noise_w_scale=0.8)
with wave.open("{test_wav}", "wb") as wf:
    voice.synthesize_wav("Hello, this is a test.", wf, config)
print("Inference test passed ✓  →  {test_wav}")
'''
    # Write & run with the training Python (which has phoonnx installed)
    test_script = os.path.join(WORK, "_test_inference.py")
    with open(test_script, "w") as f:
        f.write(test_code)
    subprocess.check_call([VENV_PYTHON, test_script])
else:
    print("No exported model found — run the export cell first.")

Inference test passed ✓  →  /home/miro/phoonnx_work/exported/test_output.wav


---
## 8 · Cleanup (optional)

Free disk space by removing intermediate files. Only the exported ONNX model and
the latest checkpoint are kept.

In [16]:
# Uncomment the lines below to reclaim disk space

# Remove cached spectrograms (largest intermediate artifact)
# shutil.rmtree(os.path.join(PREPROCESSED_DIR, "cache"), ignore_errors=True)

# Remove the cloned phoonnx repo
# shutil.rmtree(PHOONNX_DIR, ignore_errors=True)

# Remove the raw dataset (keep only preprocessed + exported)
# shutil.rmtree(DATASET_DIR, ignore_errors=True)

print("Cleanup section — uncomment lines above to free disk space.")

Cleanup section — uncomment lines above to free disk space.


---
## Appendix A · Dataset format (LJSpeech-style)

phoonnx expects an **LJSpeech-style** directory:

```
dataset/
├── metadata.csv       # pipe-separated: filename|text  (no header)
└── wavs/
    ├── 0001.wav
    ├── 0002.wav
    └── ...
```

- Audio should be **mono WAV** at the target sample rate (default 22050 Hz)
- For multi-speaker datasets, add a third column: `filename|speaker_id|text`

### Available synthetic datasets

Browse ready-to-use datasets at:
**[huggingface.co/collections/TigreGotico/synthetic-tts-datasets](https://huggingface.co/collections/TigreGotico/synthetic-tts-datasets)**

Languages include: Arabic, Asturian, Aragonese, Basque, Catalan (4 dialects), Colombian Spanish,
Danish, Dutch, English (US/GB), Farsi, French, Galician, German, Hebrew, Hindi, Italian,
Japanese, Korean, Marwari, Mirandese, Polish, Portuguese (PT/BR), Romanian, Spanish, Swedish, and more.

## Appendix B · Understanding `config.json`

The preprocessing step produces a `config.json` that stores model + dataset parameters:

| Field | Description |
|---|---|
| `audio.sample_rate` | Training/inference sample rate (e.g. 22050) |
| `audio.quality` | Label from `--quality` flag or output folder name |
| `lang_code` | Language code used for phonemization (normalized via `langcodes`) |
| `inference.noise_scale` | Controls variability in speech (default 0.667) |
| `inference.length_scale` | Controls speech rate (1.0 = normal) |
| `inference.noise_w` | Additional noise parameter (default 0.8) |
| `inference.add_diacritics` | Arabic tashkeel / Hebrew nikud (default false) |
| `alphabet` | Phoneme alphabet: `ipa`, `arpa`, `pinyin`, … |
| `phoneme_type` | Which phonemizer was used |
| `phonemizer_model` | Only for ByT5-based phonemizers |
| `phoneme_id_map` | Symbol → numeric ID mapping |
| `num_speakers` | 1 for single-speaker, >1 for multi-speaker |
| `speaker_id_map` | Speaker label → ID (multi-speaker only) |

## Appendix C · Troubleshooting

| Problem | Solution |
|---|---|
| **OOM (Out of Memory)** | Lower `BATCH_SIZE` to 8 or 4, or set `PRECISION = 16` |
| **Session timeout** | Re-run the training cell — it resumes from last checkpoint |
| **espeak-ng not found** | Run the system-deps cell, or install manually |
| **Cython build fails** | Ensure `cython` is installed: `pip install cython` |
| **Dataset not found** | Check `DATASET_DIR` path and verify `metadata.csv` exists |
| **phoneme_id_map mismatch** | When fine-tuning, always pass `--prev-config` |
| **Slow training on CPU** | Use a GPU runtime — CPU is not practical for VITS |
| **Python version issues** | phoonnx train works best with Python ≤3.10 |