# Synthetic TTS Training Dataset Generator

Generates LJSpeech-style TTS training datasets from text prompts using a **single configurable donor TTS** voice and optional voice conversion.  
When a folder of reference voices is provided, **one complete LJSpeech dataset is produced per reference voice**.

You can use metadata.csv from an existing TTS dataset or find .txt prompts here:
  - https://huggingface.co/datasets/fdemelo/phonetic-piper-recording-studio-prompts
  - https://huggingface.co/collections/TigreGotico/synthetic-tts-datasets
  - https://github.com/rhasspy/piper-recording-studio/tree/master/prompts

### Pipeline stages

| # | Stage | Description |
|---|---|---|
| 0 | Configuration | Set all paths, parameters, and flags via env vars |
| 1 | Load Prompts | Read sentences from `prompts.txt`, a folder of `.txt` files, or `metadata.csv` |
| 2 | Donor TTS Synthesis | Single configured TTS engine + voice generates all prompts once |
| 3 | Post-Processing | Optional audio super-resolution, silence trimming, volume normalisation |
| 4 | Voice Conversion + Dataset Assembly | For each reference voice: VC all utterances → build LJSpeech dataset (`filename|transcription`) |
| 5 | Upload to HuggingFace | Optional push to HF (one repo per voice) |
| 6 | Output Summary | File counts and sanity checks |

### Output structure
```
base_dir/
  1_donor_tts/           # Raw donor TTS audio (synthesised once)
  2_processed/           # Post-processed donor audio
  3_datasets/
    voice_alice/         # ← One per reference voice
      wavs/
        LJ001-0001.wav
      metadata.csv       # filename|transcription
    donor_raw/           # ← If no VC
      wavs/
      metadata.csv
```

### Usage
1. Edit the configuration cell below.
2. Run cells sequentially.
3. Each stage is self-contained — pip deps are installed per step.
4. Existing output files are skipped on re-run (resume-safe).

---

> **Credits:** Based on pipelines from [TigreGotico/synthetic_dataset_generator](https://github.com/TigreGotico/synthetic_dataset_generator).  
> Funded through the [NGI0 Commons Fund](https://nlnet.nl/commonsfund) via [NLnet](https://nlnet.nl), with support from the European Commission's Next Generation Internet programme (grant No 101135429).

---
## 0 · Global Configuration

All parameters are set as environment variables via `os.environ.setdefault()`.  
Override any of them from your shell, a `.env` file, or `%env` magic **before** running the config cell.

**Edit the values below to match your setup, then run the cell.**

In [None]:
import os

# ──────────────────────────────────────────────
# GENERAL
# ──────────────────────────────────────────────
os.environ.setdefault("TTS_LANG",            "en")                 # Language code (en, es, pt, …)
os.environ.setdefault("TTS_BASE_DIR",        "/data/tts_dataset")  # Root directory for all I/O
os.environ.setdefault("TTS_DATASET_NAME",    "my_tts_dataset")     # Base name for output datasets

# ──────────────────────────────────────────────
# INPUT SOURCES  (Stage 1)
#   Set exactly ONE of these three:
#   - PROMPTS_FILE   : plain text (one sentence per line) or
#                      piper-recording-studio format (sentence_id<TAB>text)
#   - PROMPTS_DIR    : folder of .txt files (each line = one sentence)
#   - PROMPTS_CSV    : CSV/TSV with a 'text' or 'transcript' column
# ──────────────────────────────────────────────
os.environ.setdefault("PROMPTS_FILE",        "prompts.txt")        # Path to prompts.txt
os.environ.setdefault("PROMPTS_DIR",         "")                   # Path to folder of .txt files
os.environ.setdefault("PROMPTS_CSV",         "")                   # Path to metadata.csv
os.environ.setdefault("CSV_TEXT_COL",        "text")               # Column name for text in CSV
os.environ.setdefault("CSV_SEP",             "|")                  # CSV separator (| for LJSpeech style)

# ──────────────────────────────────────────────
# DONOR TTS  (Stage 2)
#   A single TTS engine + voice used to synthesise all prompts.
#   Supported engines: edge, google, phoonnx
#
#   Edge voices   : `python -c "from ovos_tts_plugin_edge_tts import VOICES; print(VOICES)"`
#                   examples: en-US-JennyNeural, en-US-GuyNeural, en-GB-SoniaNeural
#   Google        : voice is ignored, only lang matters
#   Phoonnx       : ONNX runtime TTS (supports all piper voices + more)
#                   examples: OpenVoiceOS/phoonnx_pt-PT_miro_tugaphone
#                   repo: https://github.com/TigreGotico/phoonnx
# ──────────────────────────────────────────────
os.environ.setdefault("DONOR_ENGINE",        "edge")               # edge | google | phoonnx
os.environ.setdefault("DONOR_VOICE",         "en-US-JennyNeural") # Voice identifier for chosen engine
os.environ.setdefault("DONOR_RATE",          "+0%")                # Speech rate (Edge only, e.g. +10%, -5%)
os.environ.setdefault("DONOR_SLOW",          "0")                  # Google only: 1 = slow speed

# ── Phoonnx-specific options (only when DONOR_ENGINE=phoonnx) ────
os.environ.setdefault("PHOONNX_PHONETIC_SPELL", "1")               # Fix pronunciations for some words/langs
os.environ.setdefault("PHOONNX_NOISE_SCALE",    "0.667")           # Generator noise
os.environ.setdefault("PHOONNX_LENGTH_SCALE",   "1.0")             # Phoneme length (>1 = slower speech)
os.environ.setdefault("PHOONNX_NOISE_W",        "0.8")             # Phoneme width noise
os.environ.setdefault("PHOONNX_DIACRITICS",     "0")               # Add diacritics (Arabic/Hebrew only)

# ──────────────────────────────────────────────
# VOICE CONVERSION  (Stage 4)
#   A folder of reference voice WAVs.  One LJSpeech dataset is
#   generated per reference voice file.
#   Leave REF_VOICES_DIR empty to produce a single dataset from
#   the raw donor TTS audio (no voice conversion).
# ──────────────────────────────────────────────
os.environ.setdefault("REF_VOICES_DIR",      "")                   # Folder of reference voice WAVs
os.environ.setdefault("VC_DEVICE",           "cuda")               # cuda | cpu
os.environ.setdefault("VC_ENGINE",           "chatterbox")         # chatterbox | chatterbox_onnx

# ──────────────────────────────────────────────
# POST-PROCESSING  (Stage 3)
# ──────────────────────────────────────────────
os.environ.setdefault("TARGET_SR",           "22050")              # Final sample rate (22050 for LJSpeech)
os.environ.setdefault("TARGET_DBFS",         "-20.0")              # Volume normalisation target
os.environ.setdefault("MIN_DURATION",        "0.5")                # Seconds — discard shorter clips
os.environ.setdefault("MAX_DURATION",        "15.0")               # Seconds — discard longer clips
os.environ.setdefault("TRIM_SILENCE",        "1")                  # 1 = trim leading/trailing silence
os.environ.setdefault("TRIM_TOP_DB",         "30")                 # dB threshold for silence trimming

# ── Audio Super-Resolution/Resampling ─
#   Upscales ~16 kHz TTS audio to 48 kHz before final resample.
#   Leave SR_ENGINE empty to use plain librosa resampling.
#
#   novasr  : 52 kB model, ~3500x RT on GPU. 16→48 kHz.
#             https://github.com/ysharma3501/NovaSR
#   flashsr : ~500 kB ONNX, ~200-400x RT. 16→48 kHz. CPU-friendly.
#             https://github.com/ysharma3501/FlashSR
#   lavasr  : Vocos-based speech enhancement + upsampling.
#             https://github.com/ysharma3501/LavaSR
# ──────────────────────────────────────────────
os.environ.setdefault("SR_ENGINE",           "")                   # novasr | flashsr | lavasr | (empty=off)
os.environ.setdefault("SR_DEVICE",           "cpu")                # cuda | cpu  (novasr/lavasr only)
os.environ.setdefault("LAVASR_DENOISE",      "0")                  # 1 = enable LavaSR denoiser

# ──────────────────────────────────────────────
# HUGGINGFACE UPLOAD  (Stage 5)
# ──────────────────────────────────────────────
os.environ.setdefault("HF_UPLOAD",           "0")                  # 1 = push to HuggingFace
os.environ.setdefault("HF_REPO_PREFIX",      "")                   # e.g. your-user/tts — voice name is appended
os.environ.setdefault("HF_TOKEN",            "")                   # HF write token (or use huggingface-cli login)

# ──────────────────────────────────────────────
# Derived paths
# ──────────────────────────────────────────────
BASE = os.environ["TTS_BASE_DIR"]
LANG = os.environ["TTS_LANG"]
NAME = os.environ["TTS_DATASET_NAME"]

os.environ.setdefault("DIR_DONOR_TTS",       os.path.join(BASE, "1_donor_tts"))
os.environ.setdefault("DIR_PROCESSED",       os.path.join(BASE, "2_processed"))
os.environ.setdefault("DIR_DATASETS",        os.path.join(BASE, "3_datasets"))


def _env(key, default=""):
    return os.environ.get(key, default)


def _flag(key):
    return _env(key, "0").strip().lower() in ("1", "true", "yes")


print("Configuration loaded.")
print(f"  Language       : {LANG}")
print(f"  Dataset name   : {NAME}")
print(f"  Base directory : {BASE}")
print(f"  Donor TTS      : {_env('DONOR_ENGINE')} / {_env('DONOR_VOICE')}")
if _env('DONOR_ENGINE') == 'phoonnx':
    print(f"    phonetic_spell={_flag('PHOONNX_PHONETIC_SPELL')}  noise_scale={_env('PHOONNX_NOISE_SCALE')}  "
          f"length_scale={_env('PHOONNX_LENGTH_SCALE')}  noise_w={_env('PHOONNX_NOISE_W')}  "
          f"diacritics={_flag('PHOONNX_DIACRITICS')}")
sr_engine = _env('SR_ENGINE')
if sr_engine:
    print(f"  Super-res      : {sr_engine} on {_env('SR_DEVICE')}")
else:
    print(f"  Super-res      : off (librosa resample to {_env('TARGET_SR')} Hz)")
ref_dir = _env('REF_VOICES_DIR')
if ref_dir:
    from pathlib import Path as _P
    n_refs = len(list(_P(ref_dir).rglob('*.wav'))) if _P(ref_dir).is_dir() else 0
    print(f"  VC enabled     : yes ({n_refs} reference voices in {ref_dir})")
    print(f"  VC engine      : {_env('VC_ENGINE')} on {_env('VC_DEVICE')}")
else:
    print(f"  VC enabled     : no (single donor-only dataset)")
print(f"  HF upload      : {'yes' if _flag('HF_UPLOAD') else 'no'}")

---
## 1 · Load Prompts

Reads text prompts from one of three sources (in priority order):
1. `PROMPTS_CSV` — a CSV/TSV with a text column (pipe-separated LJSpeech style supported)
2. `PROMPTS_DIR` — a folder of `.txt` files, each containing one sentence per line
3. `PROMPTS_FILE` — a single text file with one sentence per line, or piper-recording-studio format (`sentence_id\ttext`)

Output is a deduplicated list of sentences stored in `PROMPTS`.

In [None]:
import os
import re
from pathlib import Path


def _clean_text(text):
    """Strip and collapse whitespace.  Returns empty string for junk."""
    text = text.strip()
    text = re.sub(r'\s+', ' ', text)
    return text if len(text) > 1 else ""


def load_prompts_from_file(path):
    """One sentence per line, or piper-recording-studio TSV: sentence_id<TAB>text."""
    lines = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            # piper-recording-studio format: sentence_id<TAB>text
            if "\t" in line:
                parts = line.split("\t", 1)
                line = parts[-1]  # discard sentence_id, keep text
            t = _clean_text(line)
            if t:
                lines.append(t)
    return lines


def load_prompts_from_dir(folder):
    """Read every .txt file in folder, one sentence per line."""
    lines = []
    for txt in sorted(Path(folder).rglob("*.txt")):
        lines.extend(load_prompts_from_file(txt))
    return lines


def load_prompts_from_csv(path, text_col, sep):
    """Read a CSV; auto-detects column by name or position."""
    import csv
    lines = []
    with open(path, "r", encoding="utf-8") as f:
        # Sniff whether there's a header
        sample = f.read(4096)
        f.seek(0)
        has_header = csv.Sniffer().has_header(sample)
        reader = csv.reader(f, delimiter=sep)

        if has_header:
            header = next(reader)
            header_lower = [h.strip().lower() for h in header]
            col_idx = None
            for candidate in [text_col.lower(), "text", "transcript", "transcription",
                              "sentence"]:
                if candidate in header_lower:
                    col_idx = header_lower.index(candidate)
                    break
            if col_idx is None:
                # LJSpeech: filename|text — text is column 1 or last
                col_idx = min(1, len(header) - 1)
                print(f"  Warning: text column '{text_col}' not found in header {header}.")
                print(f"           Falling back to column index {col_idx}.")
        else:
            # No header — assume LJSpeech: col 0 = filename, col 1 = text
            col_idx = 1 if sep == "|" else 0

        for row in reader:
            if len(row) <= col_idx:
                continue
            t = _clean_text(row[col_idx])
            if t:
                lines.append(t)
    return lines


# ── Load from the configured source ──────────────────────────────────
csv_path  = _env("PROMPTS_CSV")
dir_path  = _env("PROMPTS_DIR")
file_path = _env("PROMPTS_FILE")

raw_prompts = []
if csv_path and os.path.exists(csv_path):
    print(f"Loading prompts from CSV: {csv_path}")
    raw_prompts = load_prompts_from_csv(csv_path, _env("CSV_TEXT_COL", "text"), _env("CSV_SEP", "|"))
elif dir_path and os.path.isdir(dir_path):
    print(f"Loading prompts from directory: {dir_path}")
    raw_prompts = load_prompts_from_dir(dir_path)
elif file_path and os.path.exists(file_path):
    print(f"Loading prompts from file: {file_path}")
    raw_prompts = load_prompts_from_file(file_path)
else:
    print("ERROR: No valid prompt source found.")
    print("  Set one of: PROMPTS_FILE, PROMPTS_DIR, or PROMPTS_CSV")

# Deduplicate while preserving order
seen = set()
PROMPTS = []
for text in raw_prompts:
    key = text.lower().strip()
    if key not in seen:
        seen.add(key)
        PROMPTS.append(text)

print(f"  Loaded {len(raw_prompts)} lines -> {len(PROMPTS)} unique prompts")
if PROMPTS:
    print(f"  First: {PROMPTS[0][:80]}{'…' if len(PROMPTS[0]) > 80 else ''}")
    print(f"  Last : {PROMPTS[-1][:80]}{'…' if len(PROMPTS[-1]) > 80 else ''}")

---
## 2 · Donor TTS Synthesis

Uses the **single configured donor voice** to synthesise every prompt exactly once.  
This is the "master" audio that gets voice-converted into each target voice in Stage 4.

| Key env vars | |
|---|---|
| `DONOR_ENGINE` | `edge` / `google` / `phoonnx` |
| `DONOR_VOICE` | Voice identifier (engine-specific) |
| `DONOR_RATE` | Speech rate override (Edge only) |
| `PHOONNX_*` | Phoonnx synthesis parameters |

In [None]:
donor_engine = _env("DONOR_ENGINE", "edge")
if donor_engine == "edge":
    !pip install --quiet --break-system-packages ovos_tts_plugin_edge_tts
elif donor_engine == "google":
    !pip install --quiet --break-system-packages ovos_tts_plugin_google_tx
elif donor_engine == "phoonnx":
    !pip install --quiet --break-system-packages phoonnx ovos-utils
!pip install --quiet --break-system-packages tqdm

In [None]:
import os
import sys
import json
import time
import warnings
from pathlib import Path
from tqdm import tqdm

warnings.filterwarnings("ignore")


# ═══════════════════════════════════════════════════════════════════
# Instantiate the single donor TTS plugin
# ═══════════════════════════════════════════════════════════════════

def _make_donor_plugin(engine: str, voice: str, lang: str, rate: str, slow: bool):
    """
    Create a single TTS plugin instance from top-level config.
    Returns (plugin, voice_id) where voice_id is used for filenames.
    """
    if engine == "edge":
        from ovos_tts_plugin_edge_tts import EdgeTTSPlugin
        cfg = {"voice": voice}
        if rate:
            cfg["rate"] = rate
        return EdgeTTSPlugin(config=cfg), f"edge_{voice}"

    if engine == "google":
        from ovos_tts_plugin_google_tx import GoogleTranslateTTS
        cfg = {"lang": lang, "slow": slow}
        return GoogleTranslateTTS(config=cfg), f"google_{lang}"

    if engine == "phoonnx":
        from phoonnx.opm import PhoonnxTTSPlugin
        cfg = {
            "lang": lang,
            "enable_phonetic_spelling": _flag("PHOONNX_PHONETIC_SPELL"),
            "noise-scale":  float(_env("PHOONNX_NOISE_SCALE", "0.667")),
            "length-scale": float(_env("PHOONNX_LENGTH_SCALE", "1.0")),
            "noise-w":      float(_env("PHOONNX_NOISE_W", "0.8")),
            "add_diacritics": _flag("PHOONNX_DIACRITICS"),
        }
        if voice and voice != "default":
            cfg["voice"] = voice
        return PhoonnxTTSPlugin(config=cfg), f"phoonnx_{voice}"

    raise ValueError(f"Unknown donor engine: {engine}")


# ═══════════════════════════════════════════════════════════════════
# Synthesis loop — one WAV per prompt, single donor voice
# ═══════════════════════════════════════════════════════════════════

def synthesize_with_donor(
    prompts: list,
    plugin,
    voice_id: str,
    lang: str,
    output_dir: str,
):
    """
    Synthesise every prompt with the donor plugin.  Saves WAVs as
    {idx:06d}.wav plus a JSONL manifest for traceability.
    Resume-safe: skips prompts whose output WAV already exists.
    """
    os.makedirs(output_dir, exist_ok=True)
    manifest_path = os.path.join(output_dir, "tts_manifest.jsonl")

    # Load existing manifest entries for resume
    existing = {}
    if os.path.exists(manifest_path):
        with open(manifest_path, "r", encoding="utf-8") as f:
            for line in f:
                if line.strip():
                    rec = json.loads(line)
                    existing[rec["idx"]] = rec
        print(f"  Resuming: {len(existing)} already synthesised.")

    manifest_f = open(manifest_path, "a", encoding="utf-8")
    success, fail, skipped = 0, 0, 0
    result_map = dict(existing)
    start = time.time()

    for idx, text in enumerate(tqdm(prompts, desc="Donor TTS", unit="utt", file=sys.stdout)):
        wav_name = f"{idx:06d}.wav"
        wav_path = os.path.join(output_dir, wav_name)

        if idx in existing and os.path.exists(wav_path):
            skipped += 1
            continue

        try:
            plugin.get_tts(text, wav_path, lang=lang)
            rec = {
                "idx": idx,
                "wav": wav_name,
                "text": text,
                "engine": _env("DONOR_ENGINE"),
                "voice": voice_id,
            }
            manifest_f.write(json.dumps(rec, ensure_ascii=False) + "\n")
            manifest_f.flush()
            result_map[idx] = rec
            success += 1
        except Exception as e:
            tqdm.write(f"  Failed #{idx}: {e}")
            fail += 1

    manifest_f.close()
    elapsed = time.time() - start
    print(f"  Done: {success} new, {skipped} skipped, {fail} failed  ({elapsed:.1f}s)")
    return result_map


# ═══════════════════════════════════════════════════════════════════
# Run donor TTS
# ═══════════════════════════════════════════════════════════════════

if not PROMPTS:
    print("No prompts loaded — run Stage 1 first.")
else:
    engine   = _env("DONOR_ENGINE", "edge")
    voice    = _env("DONOR_VOICE", "en-US-JennyNeural")
    rate     = _env("DONOR_RATE", "+0%")
    slow     = _flag("DONOR_SLOW")

    print(f"Donor: {engine} / {voice}")
    donor_plugin, donor_voice_id = _make_donor_plugin(engine, voice, LANG, rate, slow)

    donor_dir = _env("DIR_DONOR_TTS")
    TTS_MANIFEST = synthesize_with_donor(
        prompts=PROMPTS,
        plugin=donor_plugin,
        voice_id=donor_voice_id,
        lang=LANG,
        output_dir=donor_dir,
    )
    print(f"\nDonor TTS complete.  {len(TTS_MANIFEST)} utterances in {donor_dir}")

---
## 3 · Post-Processing

For each donor WAV:
1. **Audio super-resolution** (optional) — upscale ~16 kHz TTS to 48 kHz using NovaSR, FlashSR, or LavaSR
2. Resample to target sample rate (default 22050 Hz for LJSpeech)
3. Convert to mono
4. Trim leading/trailing silence
5. Normalise volume to target dBFS
6. Filter by min/max duration

When `SR_ENGINE` is set, super-resolution replaces the initial resample — the SR model
upscales to 48 kHz, then a final `librosa.resample(48000 → TARGET_SR)` brings it to the
desired output rate.

| Key env vars | |
|---|---|
| `SR_ENGINE` | `novasr` / `flashsr` / `lavasr` / empty=off |
| `SR_DEVICE` | `cuda` / `cpu` (novasr/lavasr) |
| `LAVASR_DENOISE` | `1` = enable LavaSR denoiser |
| `TARGET_SR` | Final sample rate (22050) |
| `TARGET_DBFS` | Volume target in dBFS |
| `MIN_DURATION` / `MAX_DURATION` | Duration filter (seconds) |
| `TRIM_SILENCE` | 1 = trim silence |
| `TRIM_TOP_DB` | Silence threshold (dB) |

In [None]:
sr_engine = _env("SR_ENGINE").strip().lower()
if sr_engine == "novasr":
    !pip install --quiet --break-system-packages git+https://github.com/ysharma3501/NovaSR.git
elif sr_engine == "flashsr":
    !pip install --quiet --break-system-packages git+https://github.com/ysharma3501/FlashSR.git
elif sr_engine == "lavasr":
    !pip install --quiet --break-system-packages git+https://github.com/ysharma3501/LavaSR.git
!pip install --quiet --break-system-packages librosa soundfile numpy tqdm

In [None]:
import os
import sys
import numpy as np
import librosa
import soundfile as sf
from pathlib import Path
from tqdm import tqdm


# ═══════════════════════════════════════════════════════════════════
# Load SR model once (if enabled)
# ═══════════════════════════════════════════════════════════════════

def _load_sr_model(engine: str, device: str):
    """Load and return the super-resolution model + its output sample rate."""
    if engine == "novasr":
        from NovaSR import FastSR
        model = FastSR(device=device)
        return model, 48000

    if engine == "flashsr":
        from FastAudioSR import FASR
        from huggingface_hub import hf_hub_download
        ckpt = hf_hub_download(repo_id="YatharthS/FlashSR", filename="upsampler.pth", local_dir=".")
        model = FASR(ckpt)
        _ = model.model.half()
        return model, 48000

    if engine == "lavasr":
        from LavaSR.model import LavaEnhance
        model = LavaEnhance("YatharthS/LavaSR", device)
        return model, 16000

    raise ValueError(f"Unknown SR engine: {engine}")


def _apply_sr(wav_np: np.ndarray, sr_in: int, engine: str, model, sr_out: int) -> tuple:
    """
    Apply super-resolution to a numpy audio array.
    Returns (upsampled_wav, new_sample_rate).
    """
    import torch

    if engine == "novasr":
        # NovaSR expects 16 kHz input loaded via its own loader, but also
        # works with raw tensors.  Resample to 16k first if needed.
        if sr_in != 16000:
            wav_np = librosa.resample(wav_np, orig_sr=sr_in, target_sr=16000)
        lowres = torch.from_numpy(wav_np).unsqueeze(0)
        highres = model.infer(lowres).cpu().numpy().squeeze()
        return highres, 48000

    if engine == "flashsr":
        if sr_in != 16000:
            wav_np = librosa.resample(wav_np, orig_sr=sr_in, target_sr=16000)
        lowres = torch.from_numpy(wav_np).unsqueeze(0).half()
        highres = model.run(lowres).cpu().numpy().squeeze()
        return highres, 48000

    if engine == "lavasr":
        if sr_in != 16000:
            wav_np = librosa.resample(wav_np, orig_sr=sr_in, target_sr=16000)
        lowres = torch.from_numpy(wav_np).unsqueeze(0)
        denoise = _flag("LAVASR_DENOISE")
        highres = model.enhance(lowres, denoise=denoise).cpu().numpy().squeeze()
        return highres, 16000  # LavaSR outputs at 16 kHz enhanced

    # Should not reach here
    return wav_np, sr_in


# ═══════════════════════════════════════════════════════════════════
# Main post-processing loop
# ═══════════════════════════════════════════════════════════════════

def postprocess_audio(
    manifest: dict,
    source_dir: str,
    output_dir: str,
    target_sr: int = 22050,
    target_dbfs: float = -20.0,
    min_dur: float = 0.5,
    max_dur: float = 15.0,
    trim: bool = True,
    trim_db: int = 30,
    sr_engine: str = "",
    sr_model=None,
    sr_model_out_rate: int = 48000,
):
    """
    Post-process each donor WAV: optional SR, resample, mono, trim, normalise, filter.
    Returns a filtered manifest of successfully processed utterances.
    Resume-safe: skips files that already exist in output_dir.
    """
    os.makedirs(output_dir, exist_ok=True)
    kept, dropped, skipped = 0, 0, 0
    filtered_manifest = {}

    for idx in tqdm(sorted(manifest.keys()), desc="Post-processing", unit="utt", file=sys.stdout):
        rec = manifest[idx]
        src_path = os.path.join(source_dir, rec["wav"])
        dst_path = os.path.join(output_dir, rec["wav"])

        # Resume check
        if os.path.exists(dst_path):
            try:
                info = sf.info(dst_path)
                dur = info.frames / info.samplerate
                if min_dur <= dur <= max_dur:
                    filtered_manifest[idx] = rec
                    skipped += 1
                    continue
            except Exception:
                pass  # re-process

        if not os.path.exists(src_path):
            dropped += 1
            continue

        try:
            # Load at native SR first (for SR models that want 16k input)
            wav, sr = librosa.load(src_path, sr=None, mono=True)

            # Audio super-resolution (if enabled)
            if sr_engine and sr_model is not None:
                wav, sr = _apply_sr(wav, sr, sr_engine, sr_model, sr_model_out_rate)

            # Resample to final target SR
            if sr != target_sr:
                wav = librosa.resample(wav, orig_sr=sr, target_sr=target_sr)
                sr = target_sr

            if trim:
                wav, _ = librosa.effects.trim(wav, top_db=trim_db)

            dur = len(wav) / sr
            if dur < min_dur or dur > max_dur:
                dropped += 1
                continue

            # Volume normalisation
            rms = np.sqrt(np.mean(wav ** 2) + 1e-9)
            current_db = 20 * np.log10(rms + 1e-9)
            gain_linear = 10 ** ((target_dbfs - current_db) / 20.0)
            wav = np.clip(wav * gain_linear, -1.0, 1.0)

            sf.write(dst_path, wav, sr)
            filtered_manifest[idx] = rec
            kept += 1

        except Exception as e:
            tqdm.write(f"  Error processing {rec['wav']}: {e}")
            dropped += 1

    print(f"  Result: {kept} new + {skipped} existing = {kept + skipped} kept, {dropped} dropped")
    return filtered_manifest


# ═══════════════════════════════════════════════════════════════════
# Run post-processing
# ═══════════════════════════════════════════════════════════════════

sr_engine_name = _env("SR_ENGINE").strip().lower()
sr_model_obj, sr_out_rate = None, 48000

if sr_engine_name:
    print(f"Loading SR model: {sr_engine_name} on {_env('SR_DEVICE', 'cpu')} ...")
    sr_model_obj, sr_out_rate = _load_sr_model(sr_engine_name, _env("SR_DEVICE", "cpu"))
    print("  SR model loaded.")

PROCESSED_MANIFEST = postprocess_audio(
    manifest=TTS_MANIFEST,
    source_dir=_env("DIR_DONOR_TTS"),
    output_dir=_env("DIR_PROCESSED"),
    target_sr=int(_env("TARGET_SR", "22050")),
    target_dbfs=float(_env("TARGET_DBFS", "-20.0")),
    min_dur=float(_env("MIN_DURATION", "0.5")),
    max_dur=float(_env("MAX_DURATION", "15.0")),
    trim=_flag("TRIM_SILENCE"),
    trim_db=int(_env("TRIM_TOP_DB", "30")),
    sr_engine=sr_engine_name,
    sr_model=sr_model_obj,
    sr_model_out_rate=sr_out_rate,
)

# Free SR model memory
if sr_model_obj is not None:
    del sr_model_obj
    try:
        import torch
        torch.cuda.empty_cache()
    except Exception:
        pass

print(f"\nPost-processing complete. {len(PROCESSED_MANIFEST)} utterances ready.")

---
## 4 · Voice Conversion + LJSpeech Dataset Assembly

For **each reference voice WAV** in `REF_VOICES_DIR`:
1. Voice-convert every processed donor utterance to that voice
2. Assemble a complete LJSpeech dataset (`wavs/` + `metadata.csv`)

If `REF_VOICES_DIR` is empty, a single dataset is built directly from the donor audio (no VC).

### Output per voice
```
3_datasets/
  {voice_stem}/
    wavs/
      LJ001-0001.wav
      ...
    metadata.csv   # filename|transcription  (no header)
```

| Key env vars | |
|---|---|
| `REF_VOICES_DIR` | Folder of reference voice WAVs |
| `VC_ENGINE` | `chatterbox` or `chatterbox_onnx` |
| `VC_DEVICE` | `cuda` / `cpu` |

In [None]:
ref_voices_dir = _env("REF_VOICES_DIR")
vc_engine = _env("VC_ENGINE", "chatterbox")

if ref_voices_dir:
    if vc_engine == "chatterbox_onnx":
        !pip install --quiet --break-system-packages chatterbox_onnx tqdm
    else:
        !pip install --quiet --break-system-packages chatterbox torch torchaudio tqdm
else:
    print("REF_VOICES_DIR not set — will build a single donor-only dataset.")

In [None]:
import os
import sys
import time
import shutil
from pathlib import Path
from tqdm import tqdm


def build_ljspeech_from_dir(
    manifest: dict,
    audio_dir: str,
    output_dir: str,
    chapter_prefix: str = "LJ001",
):
    """
    Copy WAVs into wavs/ with LJSpeech naming and write metadata.csv.
    Format: filename|transcription  (no header, pipe-separated).
    Returns the number of utterances written.
    """
    wavs_dir = os.path.join(output_dir, "wavs")
    os.makedirs(wavs_dir, exist_ok=True)
    meta_lines = []
    count = 0

    for idx in sorted(manifest.keys()):
        rec = manifest[idx]
        src = os.path.join(audio_dir, rec["wav"])
        if not os.path.exists(src):
            continue
        count += 1
        lj_name = f"{chapter_prefix}-{count:04d}"
        shutil.copy2(src, os.path.join(wavs_dir, f"{lj_name}.wav"))
        meta_lines.append(f"{lj_name}|{rec['text']}")

    meta_path = os.path.join(output_dir, "metadata.csv")
    with open(meta_path, "w", encoding="utf-8") as f:
        f.write("\n".join(meta_lines) + "\n")
    return count


def vc_and_build_dataset(
    manifest: dict,
    processed_dir: str,
    ref_voice_path: str,
    dataset_dir: str,
    model,
    engine: str,
    target_sr: int,
):
    """
    Voice-convert every utterance to a single reference voice,
    then assemble a LJSpeech dataset from the results.
    Resume-safe: skips WAVs that already exist in the temp VC dir.
    Cleans up intermediate VC files after assembly to save disk.
    """
    vc_tmp = os.path.join(dataset_dir, "_vc_tmp")
    os.makedirs(vc_tmp, exist_ok=True)

    success, fail, skipped = 0, 0, 0

    for idx in tqdm(
        sorted(manifest.keys()),
        desc=f"  VC {Path(ref_voice_path).stem}",
        unit="utt",
        file=sys.stdout,
    ):
        rec = manifest[idx]
        src_path = os.path.join(processed_dir, rec["wav"])
        dst_path = os.path.join(vc_tmp, rec["wav"])

        if os.path.exists(dst_path):
            skipped += 1
            continue
        if not os.path.exists(src_path):
            fail += 1
            continue

        try:
            if engine == "chatterbox_onnx":
                model.voice_convert(
                    source_audio_path=src_path,
                    target_voice_path=ref_voice_path,
                    output_file_name=dst_path,
                )
            else:
                import torchaudio as ta
                wav = model.generate(
                    audio=src_path,
                    target_voice_path=ref_voice_path,
                )
                ta.save(dst_path, wav, model.sr)
            success += 1
        except Exception:
            fail += 1

    print(f"    VC: {success} new, {skipped} resumed, {fail} failed")

    # Assemble LJSpeech from the VC output
    n = build_ljspeech_from_dir(manifest, vc_tmp, dataset_dir)
    print(f"    Dataset: {n} utterances -> {dataset_dir}")

    # Clean up temp VC files to save disk
    shutil.rmtree(vc_tmp, ignore_errors=True)
    return n


# ═══════════════════════════════════════════════════════════════════
# Run: iterate over reference voices (or build donor-only dataset)
# ═══════════════════════════════════════════════════════════════════

datasets_dir   = _env("DIR_DATASETS")
processed_dir  = _env("DIR_PROCESSED")
ref_voices_dir = _env("REF_VOICES_DIR")
vc_engine      = _env("VC_ENGINE", "chatterbox")
vc_device      = _env("VC_DEVICE", "cuda")
target_sr      = int(_env("TARGET_SR", "22050"))

BUILT_DATASETS = {}  # voice_name -> dataset_dir

if not ref_voices_dir:
    # ── No VC: build a single dataset from processed donor audio ──
    print("No reference voices — building donor-only dataset.")
    ds_dir = os.path.join(datasets_dir, "donor_raw")
    n = build_ljspeech_from_dir(PROCESSED_MANIFEST, processed_dir, ds_dir)
    BUILT_DATASETS["donor_raw"] = ds_dir
    print(f"  Built {n} utterances -> {ds_dir}")

else:
    # ── Discover reference voices ─────────────────────────────────
    ref_files = sorted(Path(ref_voices_dir).rglob("*.wav"))
    if not ref_files:
        print(f"ERROR: No .wav files found in {ref_voices_dir}")
    else:
        print(f"Found {len(ref_files)} reference voice(s). Loading VC model...")

        # Load model once, reuse for all voices
        if vc_engine == "chatterbox_onnx":
            from chatterbox_onnx import ChatterboxOnnx
            vc_model = ChatterboxOnnx(device=vc_device)
        else:
            from chatterbox.vc import ChatterboxVC
            vc_model = ChatterboxVC.from_pretrained(vc_device)
        print("  VC model loaded.\n")

        start_all = time.time()

        for ri, ref_path in enumerate(ref_files):
            voice_name = ref_path.stem  # e.g. "alice" from alice.wav
            ds_dir = os.path.join(datasets_dir, voice_name)

            # Skip if this dataset already looks complete
            meta_file = os.path.join(ds_dir, "metadata.csv")
            if os.path.exists(meta_file):
                existing_n = sum(1 for line in open(meta_file) if line.strip())
                if existing_n >= len(PROCESSED_MANIFEST):
                    print(f"[{ri+1}/{len(ref_files)}] {voice_name}: already complete ({existing_n} utts) — skipping.")
                    BUILT_DATASETS[voice_name] = ds_dir
                    continue

            print(f"[{ri+1}/{len(ref_files)}] Processing voice: {voice_name}")
            n = vc_and_build_dataset(
                manifest=PROCESSED_MANIFEST,
                processed_dir=processed_dir,
                ref_voice_path=str(ref_path),
                dataset_dir=ds_dir,
                model=vc_model,
                engine=vc_engine,
                target_sr=target_sr,
            )
            BUILT_DATASETS[voice_name] = ds_dir
            print()

        # Free GPU memory
        del vc_model
        try:
            import torch
            torch.cuda.empty_cache()
        except Exception:
            pass

        elapsed = time.time() - start_all
        print(f"\nAll voices done. {len(BUILT_DATASETS)} datasets in {elapsed:.1f}s")

print(f"\nDatasets built: {list(BUILT_DATASETS.keys())}")

---
## 5 · Upload to HuggingFace (Optional)

Pushes each per-voice LJSpeech dataset to HuggingFace.  
Repo names are `{HF_REPO_PREFIX}_{voice_name}` (e.g. `user/tts_alice`).

| Key env vars | |
|---|---|
| `HF_UPLOAD` | `1` to enable |
| `HF_REPO_PREFIX` | e.g. `your-user/tts` — voice name is appended |
| `HF_TOKEN` | Write token (optional if already logged in) |

In [None]:
if _flag("HF_UPLOAD"):
    !pip install --quiet --break-system-packages huggingface_hub
else:
    print("HF_UPLOAD not enabled — skipping install.")

In [None]:
import os

if not _flag("HF_UPLOAD"):
    print("HF_UPLOAD not enabled — skipping upload.")
else:
    repo_prefix = _env("HF_REPO_PREFIX")
    hf_token    = _env("HF_TOKEN") or None

    if not repo_prefix:
        print("ERROR: HF_REPO_PREFIX not set.  Example: your-user/tts")
    else:
        from huggingface_hub import HfApi
        api = HfApi(token=hf_token)

        for voice_name, ds_dir in BUILT_DATASETS.items():
            repo_id = f"{repo_prefix}_{voice_name}"
            print(f"Uploading {voice_name} -> {repo_id} ...")

            api.create_repo(
                repo_id=repo_id,
                repo_type="dataset",
                exist_ok=True,
            )
            api.upload_folder(
                folder_path=ds_dir,
                repo_id=repo_id,
                repo_type="dataset",
                commit_message=f"Upload TTS dataset: {_env('TTS_DATASET_NAME')} / {voice_name}",
            )
            print(f"  Done: https://huggingface.co/datasets/{repo_id}")

        print(f"\nAll {len(BUILT_DATASETS)} dataset(s) uploaded.")

---
## 6 · Output Summary

Quick sanity check: list all output directories and count generated files.

In [None]:
from pathlib import Path

print("=" * 65)
print("  OUTPUT SUMMARY")
print("=" * 65)

# Pipeline dirs
for label, d in [
    ("Donor TTS (raw)",  _env("DIR_DONOR_TTS")),
    ("Post-processed",   _env("DIR_PROCESSED")),
]:
    p = Path(d)
    if p.exists():
        wav_n = len(list(p.rglob("*.wav")))
        print(f"  {label:25s}  {wav_n:>6} wav files")
    else:
        print(f"  {label:25s}  (not created)")

print(f"{'─' * 65}")
print(f"  {'DATASET':25s}  {'WAVs':>6}  {'meta lines':>10}")
print(f"{'─' * 65}")

# Per-voice datasets
datasets_root = Path(_env("DIR_DATASETS"))
if datasets_root.exists():
    for ds_dir in sorted(datasets_root.iterdir()):
        if not ds_dir.is_dir():
            continue
        wav_n = len(list((ds_dir / "wavs").rglob("*.wav"))) if (ds_dir / "wavs").exists() else 0
        meta = ds_dir / "metadata.csv"
        meta_n = sum(1 for line in open(meta) if line.strip()) if meta.exists() else 0

        marker = "✓" if wav_n > 0 and wav_n == meta_n else "⚠"
        print(f"  {marker} {ds_dir.name:23s}  {wav_n:>6}  {meta_n:>10}")

        # Preview first & last line of metadata
        if meta.exists():
            lines = meta.read_text(encoding="utf-8").strip().splitlines()
            if lines:
                print(f"    first: {lines[0][:90]}{'…' if len(lines[0]) > 90 else ''}")
            if len(lines) > 1:
                print(f"    last : {lines[-1][:90]}{'…' if len(lines[-1]) > 90 else ''}")
else:
    print("  (no datasets created yet)")

print("=" * 65)