# Piper TTS Fine-Tuning Notebook

This notebook provides a streamlined workflow for fine-tuning Piper TTS models on custom voice data.

## Workflow Overview
1. **Setup** - Downgrade Python, mount Drive, clone Piper, install dependencies
2. **Data Preparation** - Extract audio dataset and upload transcript
3. **Training Configuration** - Configure training settings and download pretrained model
4. **Training** - Run fine-tuning
5. **Export & Download** - Export to ONNX and download locally

## Requirements
- Google Colab with GPU runtime (T4 or better)
- Audio dataset: WAV files (16000 or 22050Hz, 16-bit, mono)
- Transcript file: `wavs/<filename>.wav|<transcription text>`

## Important
The first cell will **restart the runtime** to apply the Python downgrade. After the restart, **run the first cell again** to complete the setup.

---

In [None]:
#@markdown # **1. Environment Setup**
#@markdown ---
#@markdown This cell sets up the complete environment:
#@markdown - Downgrades Python to 3.10 (required for torch==2.1.0)
#@markdown - Mounts Google Drive
#@markdown - Clones Piper repository
#@markdown - Builds monotonic_align extension
#@markdown - Installs the "Golden Trio" dependencies for ONNX export
#@markdown - Applies required patches
#@markdown
#@markdown **Note:** This cell will restart the Python runtime. After it completes, re-run this cell once more.

import sys

# Check if we need to downgrade Python
if sys.version_info >= (3, 11):
    print("="*50)
    print("Downgrading Python to 3.10...")
    print("="*50)
    
    # Downgrade Python to 3.10
    !sudo apt-get update -qq
    !sudo apt-get install -qq python3.10 python3.10-distutils
    !sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
    !sudo update-alternatives --set python3 /usr/bin/python3.10
    
    # Fix pip for Python 3.10
    !curl -sS https://bootstrap.pypa.io/get-pip.py | python3
    
    # Install ipykernel for Colab runtime compatibility
    !python3.10 -m pip install ipykernel google-colab
    
    print("\nPython downgrade complete!")
    print("Restarting runtime...")
    
    # Restart runtime
    import os
    os.kill(os.getpid(), 9)

print(f"Python version: {sys.version}")

import os
os.environ["TORCH_FORCE_WEIGHTS_ONLY_LOAD"] = "0"

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Clone Piper repository
print("\nCloning Piper repository...")
!rm -rf /content/piper
!git clone -q https://github.com/rhasspy/piper.git /content/piper

# Build monotonic_align extension
print("\nBuilding monotonic_align...")
%cd /content/piper/src/python
!bash build_monotonic_align.sh 2>/dev/null

# Install the "Golden Trio" - these specific versions are critical for ONNX export
print("\n" + "="*50)
print("Installing Golden Trio dependencies...")
print("="*50)

!pip uninstall -y torch torchvision torchaudio pytorch-lightning lightning onnxscript onnxruntime onnx -q 2>/dev/null
!pip install torch==2.1.0 pytorch-lightning==1.9.0 torchmetrics==0.11.4 onnx==1.16.1 onnxruntime==1.17.1 -q

# Install other Piper dependencies
!pip install -q cython piper-phonemize==1.1.0 librosa numpy==1.26

# Apply patches for ONNX export compatibility
print("\nApplying ONNX export patches...")

# Patch 1: Comment out math assertion in transforms.py
!sed -i 's/assert (discriminant >= 0).all(), discriminant/# assert (discriminant >= 0).all(), discriminant/' /content/piper/src/python/piper_train/vits/transforms.py

# Patch 2: Add .detach() to mask guard in modules.py
!sed -i 's/h = self.pre(x0) \* x_mask/h = self.pre(x0) * x_mask.detach()/' /content/piper/src/python/piper_train/vits/modules.py

# Setup PyTorch serialization for Piper checkpoints
import torch
import pathlib
torch.serialization.add_safe_globals([pathlib.PosixPath, pathlib.PurePosixPath])

print("\n" + "="*50)
print("Setup complete!")
print(f"Python: {sys.version.split()[0]}")
print(f"PyTorch: {torch.__version__}")


In [None]:
#@markdown # **2. Extract Dataset**
#@markdown ---
#@markdown Extract your audio dataset from a ZIP file on Google Drive.
#@markdown
#@markdown **Audio Requirements:**
#@markdown - WAV format
#@markdown - 16000 or 22050Hz sample rate
#@markdown - 16-bit, mono
#@markdown - Numbered files: 1.wav, 2.wav, 3.wav, ...

import os
import wave
import zipfile
import datetime

def get_dataset_duration(wav_path):
    """Calculate total duration and count of WAV files."""
    totalduration = 0
    wav_files = [x for x in os.listdir(wav_path) if x.endswith(".wav")]
    for file_name in wav_files:
        file_path = os.path.join(wav_path, file_name)
        try:
            with wave.open(file_path, "rb") as wave_file:
                frames = wave_file.getnframes()
                rate = wave_file.getframerate()
                duration = frames / float(rate)
                totalduration += duration
        except:
            pass
    duration_str = str(datetime.timedelta(seconds=round(totalduration, 0)))
    return len(wav_files), duration_str

# Create dataset directories
%cd /content
!rm -rf /content/dataset
os.makedirs("/content/dataset/wavs", exist_ok=True)
%cd /content/dataset

#@markdown ### Path to your audio dataset ZIP file:
zip_path = "/content/drive/MyDrive/piper_dataset.zip" #@param {type:"string"}

zip_path = zip_path.strip()
if not zip_path:
    raise Exception("You must provide a path to your audio dataset.")

if not os.path.exists(zip_path):
    raise Exception(f"Path not found: {zip_path}")

if zipfile.is_zipfile(zip_path):
    print("Extracting audio files...")
    !unzip -q -j "{zip_path}" -d /content/dataset/wavs
else:
    print("Copying audio files from folder...")
    !cp -a "{zip_path}/." /content/dataset/wavs/

# Handle nested wavs folder if present
if os.path.exists("/content/dataset/wavs/wavs"):
    !mv /content/dataset/wavs/wavs/* /content/dataset/wavs/
    !rm -rf /content/dataset/wavs/wavs

# Remove any text files that came with the ZIP
!rm -f /content/dataset/wavs/*.txt /content/dataset/wavs/*.csv 2>/dev/null

# Report dataset info
audio_count, dataset_dur = get_dataset_duration("/content/dataset/wavs")
print(f"\nDataset loaded: {audio_count} WAV files, total duration: {dataset_dur}")
%cd /content

In [None]:
#@markdown # **3. Upload Transcript & Preprocess**
#@markdown ---
#@markdown Upload your transcript file and preprocess the dataset.
#@markdown
#@markdown **Transcript Format (single-speaker):**
#@markdown ```
#@markdown wavs/1.wav|This is the text spoken in audio 1.
#@markdown wavs/2.wav|This is the text spoken in audio 2.
#@markdown ```
#@markdown
#@markdown **Transcript Format (multi-speaker):**
#@markdown ```
#@markdown wavs/1.wav|speaker1|Text spoken by speaker 1.
#@markdown wavs/2.wav|speaker2|Text spoken by speaker 2.
#@markdown ```

import os
from google.colab import files

#@markdown ### Language of your dataset:
language = "English (U.S.)" #@param ["English (British)", "English (U.S.)", "Deutsch", "Fran\u00e7ais", "Espa\u00f1ol (Castellano)", "Espa\u00f1ol (Latinoamericano)", "Italiano", "Portugu\u00eas (Brasil)", "Portugu\u00eas (Portugal)", "Nederlands", "Polski", "Русский", "\u7b80\u4f53\u4e2d\u6587", "\u65e5\u672c\u8a9e"]

#@markdown ### Model name (no spaces):
model_name = "my_voice" #@param {type:"string"}

#@markdown ### Output folder (save to Drive recommended):
output_path = "/content/drive/MyDrive/colab/piper" #@param {type:"string"}

#@markdown ### Sample rate of your audio files:
sample_rate = "22050" #@param ["16000", "22050"]

#@markdown ### Single speaker dataset?
single_speaker = True #@param {type:"boolean"}

# Language code mapping
languages = {
    "English (British)": "en",
    "English (U.S.)": "en-us",
    "Deutsch": "de",
    "Fran\u00e7ais": "fr",
    "Espa\u00f1ol (Castellano)": "es",
    "Espa\u00f1ol (Latinoamericano)": "es-419",
    "Italiano": "it",
    "Portugu\u00eas (Brasil)": "pt-br",
    "Portugu\u00eas (Portugal)": "pt-pt",
    "Nederlands": "nl",
    "Polski": "pl",
    "Русский": "ru",
    "\u7b80\u4f53\u4e2d\u6587": "zh",
    "\u65e5\u672c\u8a9e": "ja"
}

final_language = languages[language]
output_dir = os.path.join(output_path, model_name)
os.makedirs(output_dir, exist_ok=True)

# Upload transcript
%cd /content/dataset
!rm -f /content/dataset/metadata.csv

print("Please upload your transcript file (metadata.csv or .txt):")
uploaded = files.upload()
uploaded_filename = list(uploaded.keys())[0]
if uploaded_filename != "metadata.csv":
    !mv "{uploaded_filename}" metadata.csv

# Create audio cache directory
os.makedirs("/content/audio_cache", exist_ok=True)

# Run preprocessing
%cd /content/piper/src/python

force_sp = "--single-speaker" if single_speaker else ""

print("\nRunning preprocessing...")
!python -m piper_train.preprocess \
  --language {final_language} \
  --input-dir /content/dataset \
  --cache-dir "/content/audio_cache" \
  --output-dir "{output_dir}" \
  --dataset-name "{model_name}" \
  --dataset-format ljspeech \
  --sample-rate {sample_rate} \
  {force_sp}

print("\nPreprocessing complete!")
print(f"Output directory: {output_dir}")

In [None]:
#@markdown # **4. Training Settings**
#@markdown ---
#@markdown Configure training hyperparameters.

#@markdown ### Batch size:
#@markdown Reduce if you run out of GPU memory.
batch_size = 12 #@param {type:"integer"}

#@markdown ### Model quality:
#@markdown - **x-low**: 16KHz, 5-7M params (fastest)
#@markdown - **medium**: 22.05KHz, 15-20M params (recommended)
#@markdown - **high**: 22.05KHz, 28-32M params (best quality)
quality = "medium" #@param ["x-low", "medium", "high"]

#@markdown ### Maximum training epochs:
max_epochs = 3000 #@param {type:"integer"}

#@markdown ### Checkpoint save interval (epochs):
checkpoint_epochs = 5 #@param {type:"integer"}

#@markdown ### Enable validation?
#@markdown Disable for very small datasets (<5 min audio).
enable_validation = False #@param {type:"boolean"}

#@markdown ### Log interval (steps):
log_every_n_steps = 1000 #@param {type:"integer"}

# Store settings for training cell
training_settings = {
    'batch_size': batch_size,
    'quality': quality,
    'max_epochs': max_epochs,
    'checkpoint_epochs': checkpoint_epochs,
    'enable_validation': enable_validation,
    'log_every_n_steps': log_every_n_steps
}

print("Training settings configured:")
for key, value in training_settings.items():
    print(f"  {key}: {value}")

In [None]:
#@markdown # **5. Download Pretrained Model**
#@markdown ---
#@markdown Download a pretrained model to fine-tune.
#@markdown
#@markdown Select a model that matches your target language.

import json
import ipywidgets as widgets
from IPython.display import display
from google.colab import output
import os

# Load pretrained models list
try:
    with open('/content/piper/notebooks/pretrained_models.json') as f:
        pretrained_models = json.load(f)
except FileNotFoundError:
    raise Exception("pretrained_models.json not found. Run Setup cell first.")

if final_language not in pretrained_models:
    print(f"No pretrained models available for {final_language}.")
    print("Available languages:", list(pretrained_models.keys()))
    raise Exception(f"Please choose a different language or provide your own checkpoint.")

models = pretrained_models[final_language]
model_options = [(name, name) for name in models.keys()]

model_dropdown = widgets.Dropdown(
    description="Model:",
    options=model_options,
    style={'description_width': 'initial'}
)

download_button = widgets.Button(description="Download Model", button_style='primary')
status_output = widgets.Output()

def download_model(btn):
    with status_output:
        status_output.clear_output()
        selected_model = model_dropdown.value
        model_url = pretrained_models[final_language][selected_model]
        print(f"Downloading {selected_model}...")
        
        !rm -f /content/pretrained.ckpt
        
        if model_url.startswith("1"):
            # Google Drive file ID
            !gdown -q "{model_url}" -O "/content/pretrained.ckpt"
        elif "drive.google.com" in model_url:
            !gdown -q "{model_url}" -O "/content/pretrained.ckpt" --fuzzy
        else:
            !wget -q "{model_url}" -O "/content/pretrained.ckpt"
        
        if os.path.exists("/content/pretrained.ckpt"):
            size_mb = os.path.getsize("/content/pretrained.ckpt") / (1024 * 1024)
            print(f"\nDownload complete! ({size_mb:.1f} MB)")
            print("Pretrained model saved to: /content/pretrained.ckpt")
        else:
            print("\nError: Download failed. Please try again.")

download_button.on_click(download_model)

print(f"Available pretrained models for {language}:\n")
display(model_dropdown, download_button, status_output)

In [None]:
#@markdown # **6. Run Fine-Tuning**
#@markdown ---
#@markdown Start training! Monitor progress in the output.
#@markdown
#@markdown **Tips:**
#@markdown - Training can take several hours depending on dataset size
#@markdown - Checkpoints are saved to Google Drive automatically
#@markdown - You can stop and resume training later

import os

# Verify pretrained model exists
if not os.path.exists("/content/pretrained.ckpt"):
    raise Exception("Pretrained model not found! Run the 'Download Pretrained Model' cell first.")

# Set validation parameters
if training_settings['enable_validation']:
    validation_split = 0.01
    num_test_examples = 1
else:
    validation_split = 0
    num_test_examples = 0

print(f"Starting fine-tuning...")
print(f"Output directory: {output_dir}")
print(f"Quality: {training_settings['quality']}")
print(f"Max epochs: {training_settings['max_epochs']}")
print(f"Batch size: {training_settings['batch_size']}")
print("\n" + "="*50 + "\n")

%cd /content/piper/src/python

!python -m piper_train \
  --dataset-dir "{output_dir}" \
  --accelerator 'gpu' \
  --devices 1 \
  --batch-size {training_settings['batch_size']} \
  --validation-split {validation_split} \
  --num-test-examples {num_test_examples} \
  --quality {training_settings['quality']} \
  --checkpoint-epochs {training_settings['checkpoint_epochs']} \
  --num_ckpt 1 \
  --log_every_n_steps {training_settings['log_every_n_steps']} \
  --max_epochs {training_settings['max_epochs']} \
  --resume_from_checkpoint "/content/pretrained.ckpt" \
  --precision 32

print("\n" + "="*50)
print("Training complete!")
print("="*50)

In [None]:
#@markdown # **7. Export to ONNX**
#@markdown ---
#@markdown Export your trained model to ONNX format for inference.
#@markdown
#@markdown This cell auto-detects the latest checkpoint, or you can specify one.

import os
import glob
import re

#@markdown ### Checkpoint path (leave empty to auto-detect latest):
checkpoint_path = "" #@param {type:"string"}

#@markdown ### Output ONNX filename (without extension):
output_name = "my_voice" #@param {type:"string"}

# Auto-detect latest checkpoint if not specified
if not checkpoint_path:
    checkpoints = glob.glob(f"{output_dir}/lightning_logs/**/checkpoints/*.ckpt", recursive=True)
    if not checkpoints:
        raise Exception(f"No checkpoints found in {output_dir}/lightning_logs/")
    
    # Sort by version number and epoch
    def get_checkpoint_info(path):
        version_match = re.search(r'version_(\d+)', path)
        epoch_match = re.search(r'epoch=(\d+)', path)
        version = int(version_match.group(1)) if version_match else 0
        epoch = int(epoch_match.group(1)) if epoch_match else 0
        return (version, epoch)
    
    checkpoints.sort(key=get_checkpoint_info, reverse=True)
    checkpoint_path = checkpoints[0]
    print(f"Auto-detected checkpoint: {checkpoint_path}")

# Verify checkpoint exists
if not os.path.exists(checkpoint_path):
    raise Exception(f"Checkpoint not found: {checkpoint_path}")

# Set output paths
output_onnx = f"{output_dir}/{output_name}.onnx"
config_path = f"{output_dir}/config.json"

print(f"\nExporting to ONNX...")
print(f"Checkpoint: {checkpoint_path}")
print(f"Output: {output_onnx}")
print("\n" + "="*50 + "\n")

%cd /content/piper/src/python

!python3 -m piper_train.export_onnx \
    "{checkpoint_path}" \
    "{output_onnx}"

# Copy config file
if os.path.exists(config_path):
    !cp "{config_path}" "{output_onnx}.json"
    print(f"\nConfig copied to: {output_onnx}.json")

# Verify export
if os.path.exists(output_onnx):
    size_mb = os.path.getsize(output_onnx) / (1024 * 1024)
    print("\n" + "="*50)
    print(f"Export successful!")
    print(f"ONNX model: {output_onnx} ({size_mb:.1f} MB)")
    print(f"Config: {output_onnx}.json")
    print("="*50)
else:
    print("\nError: Export failed. Check the output above for errors.")

In [None]:
#@markdown # **8. Download Model**
#@markdown ---
#@markdown Download the exported ONNX model and config to your local computer.

from google.colab import files
import os

# Get the ONNX file path from previous cell
onnx_path = f"{output_dir}/{output_name}.onnx"
config_path = f"{onnx_path}.json"

print("Preparing files for download...\n")

# Download ONNX model
if os.path.exists(onnx_path):
    print(f"Downloading: {os.path.basename(onnx_path)}")
    files.download(onnx_path)
else:
    print(f"Error: ONNX file not found at {onnx_path}")
    print("Please run the Export cell first.")

# Download config
if os.path.exists(config_path):
    print(f"Downloading: {os.path.basename(config_path)}")
    files.download(config_path)
else:
    print(f"Warning: Config file not found at {config_path}")

# **Usage Instructions**

After downloading your model files, you can use them with Piper TTS:

## Installation

```bash
pip install piper-tts
```

## Command Line Usage

```bash
echo "Hello, this is my custom voice!" | piper \
    --model my_voice.onnx \
    --config my_voice.onnx.json \
    --output_file output.wav
```

## Python Usage

```python
from piper import PiperVoice

voice = PiperVoice.load("my_voice.onnx", "my_voice.onnx.json")

with open("output.wav", "wb") as f:
    voice.synthesize("Hello, this is my custom voice!", f)
```

## Resources

- [Piper GitHub](https://github.com/rhasspy/piper)
- [Piper Documentation](https://github.com/rhasspy/piper/blob/master/TRAINING.md)
- [Piper Samples](https://rhasspy.github.io/piper-samples/)

---

*Happy voice cloning!*