## TTS Starter notebook.
This is the starter notebook for fine tuning "microsoft/speecht5_tts" model with "facebook/voxpopuli".

The starter includes the following steps.
1.   Installation of dependencies
2.   Load dataset
3.   Pre-processing of data
4.   Train model.




## Installation

In [None]:
!pip install transformers datasets soundfile speechbrain accelerate

Import dependencies

In [1]:
import os
import torch
from speechbrain.pretrained import EncoderClassifier
from functools import partial
from dataclasses import dataclass
from typing import Any, Dict, List, Union
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer
from datasets import load_dataset, Audio
from huggingface_hub import notebook_login
# from transformers import SpeechT5Processor
from collections import defaultdict
import matplotlib.pyplot as plt

Enable hugging face authentication if you need to push the final model to hugging face.

In [None]:
notebook_login()

## Loading dataset
Since the dataset is huge, let's load only 500 samples.

In [None]:

dataset = load_dataset("facebook/voxpopuli", "nl", split="train[:500]")
len(dataset)


## Data processing

# Sample the audio to 16KHZ
Data should be sufficient for fine-tuning. SpeechT5 expects audio data to have a sampling rate of 16 kHz, so let's make sure the dataset meet this requirement:

In [None]:
 dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

# SpeechT5 Data Preprocessing

This preprocessing pipeline prepares Dutch audio-text data for SpeechT5 TTS model fine-tuning. The main challenges are handling non-English characters and ensuring proper text normalization.
Key Steps
1. Load Processor: Initialize SpeechT5 processor containing both tokenizer and feature extractor for data preparation.
2. Text Selection: Use normalized_text instead of raw_text because SpeechT5 tokenizer doesn't handle numbers well. The normalized version has numbers written as words.
3. Character Compatibility: SpeechT5 was trained on English, so Dutch characters like à, ç, è, ë, í, ï, ö, ü get converted to <unk> tokens, losing meaning.
4. Vocabulary Analysis: Extract all unique characters from the dataset and compare with tokenizer vocabulary to identify unsupported characters.
5. Character Replacement: Map unsupported Dutch characters to their closest English equivalents (e.g., à → a, ë → e) to preserve meaning while maintaining tokenizer compatibility.

In [None]:

# Load processor and tokenizer
checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)
tokenizer = processor.tokenizer

# Examine dataset structure
dataset[0]
# Output shows: raw_text, normalized_text, audio, speaker_id, gender, etc.

# Extract all unique characters from dataset
def extract_all_chars(batch):
    all_text = " ".join(batch["normalized_text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

vocabs = dataset.map(
    extract_all_chars,
    batched=True,
    batch_size=-1,
    keep_in_memory=True,
    remove_columns=dataset.column_names,
)

# Compare dataset vocabulary with tokenizer vocabulary
dataset_vocab = set(vocabs["vocab"][0])
tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()}

# Find unsupported characters
unsupported_chars = dataset_vocab - tokenizer_vocab
# Result: {' ', 'à', 'ç', 'è', 'ë', 'í', 'ï', 'ö', 'ü'}

# Define character replacements
replacements = [
    ("à", "a"),
    ("ç", "c"),
    ("è", "e"),
    ("ë", "e"),
    ("í", "i"),
    ("ï", "i"),
    ("ö", "o"),
    ("ü", "u"),
]

# Apply text cleanup
def cleanup_text(inputs):
    for src, dst in replacements:
        inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst)
    return inputs

dataset = dataset.map(cleanup_text)

# Speaker Analysis and Embeddings
Multi-speaker TTS requires understanding speaker distribution and generating speaker embeddings to differentiate between voices during training.
Key Steps
1. Speaker Distribution Analysis: Count unique speakers and examples per speaker in the VoxPopuli dataset (500 downloaded sample) to understand data balance.
2. Data Filtering: Remove speakers with too few or too many examples to improve training efficiency. Filter to speakers with 1-15 examples for balanced representation.
3. Speaker Embeddings: Generate 512-dimensional speaker embeddings using pre-trained SpeechBrain X-vector model to capture voice characteristics.
4. Cross-Language Consideration: The X-vector model was trained on English (VoxCeleb), but we're using Dutch data. This may still work reasonably well, though optimal results would require Dutch-trained embeddings.

In [None]:
# Analyze speaker distribution
from collections import defaultdict
import matplotlib.pyplot as plt

speaker_counts = defaultdict(int)

for speaker_id in dataset["speaker_id"]:
    speaker_counts[speaker_id] += 1

# Visualize speaker distribution
plt.figure()
plt.hist(speaker_counts.values(), bins=20)
plt.ylabel("Speakers")
plt.xlabel("Examples")
plt.show()

# Filter speakers with balanced data (100-400 examples)
def select_speaker(speaker_id):
    return 1 <= speaker_counts[speaker_id] <= 15

dataset = dataset.filter(select_speaker, input_columns=["speaker_id"])

# Check filtered results
print(f"Speakers remaining: {len(set(dataset['speaker_id']))}")
print(f"Examples remaining: {len(dataset)}")

# Create speaker embedding function

spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
device = "cuda" if torch.cuda.is_available() else "cpu"

speaker_model = EncoderClassifier.from_hparams(
    source=spk_model_name,
    run_opts={"device": device},
    savedir=os.path.join("/tmp", spk_model_name),
)

def create_speaker_embedding(waveform):
    with torch.no_grad():
        speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
        speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
        speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
    return speaker_embeddings

## **Dataset Processing and Data Collator**
Transform the raw dataset into model-ready format with tokenized text, log-mel spectrograms, and speaker embeddings, then create batching functionality for training.
Key Steps
1. Data Processing: Convert each example using SpeechT5Processor to:

Tokenize normalized text into input_ids
Convert audio to log-mel spectrogram labels (80 mel bins)
Generate 512-dimensional speaker embeddings
Create stop_labels for sequence termination

2. Length Filtering: Remove examples longer than 200 tokens (originally 600 max) to enable larger batch sizes and prevent memory issues.
3. Train/Test Split: Create 90/10 split for training and evaluation.
4. Custom Data Collator: Handle variable-length seasquences by:

Padding shorter sequences to batch maximum length
Masking padded spectrogram areas with -100 (ignored in loss)
Adjusting lengths to multiples of reduction factor (2)
Batching speaker embeddings as tensors


In [None]:
# Define dataset processing function
def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        text=example["normalized_text"],
        audio_target=audio["array"],
        sampling_rate=audio["sampling_rate"],
        return_attention_mask=False,
    )

    # Strip off the batch dimension
    example["labels"] = example["labels"][0]

    # Use SpeechBrain to obtain x-vector
    example["speaker_embeddings"] = create_speaker_embedding(audio["array"])

    return example

# Verify processing on single example
processed_example = prepare_dataset(dataset[0])
print(list(processed_example.keys()))


print(processed_example["speaker_embeddings"].shape)

# Visualize log-mel spectrogram
import matplotlib.pyplot as plt

plt.figure()
plt.imshow(processed_example["labels"].T)
plt.show()
# Note: Spectrogram appears upside down due to matplotlib y-axis convention

# Apply processing to entire dataset (5-10 minutes)
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

# Filter out examples longer than 200 tokens
def is_not_too_long(input_ids):
    input_length = len(input_ids)
    return input_length < 200

dataset = dataset.filter(is_not_too_long, input_columns=["input_ids"])
print(len(dataset))

# Create train/test split
dataset = dataset.train_test_split(test_size=0.1)

# Define custom data collator
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import torch

@dataclass
class TTSDataCollatorWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
        label_features = [{"input_values": feature["labels"]} for feature in features]
        speaker_features = [feature["speaker_embeddings"] for feature in features]

        # Collate the inputs and targets into a batch
        batch = processor.pad(
            input_ids=input_ids, labels=label_features, return_tensors="pt"
        )

        # Replace padding with -100 to ignore loss correctly
        batch["labels"] = batch["labels"].masked_fill(
            batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100
        )

        # Not used during fine-tuning
        del batch["decoder_attention_mask"]

        # Round down target lengths to multiple of reduction factor
        if model.config.reduction_factor > 1:
            target_lengths = torch.tensor(
                [len(feature["input_values"]) for feature in label_features]
            )
            target_lengths = target_lengths.new(
                [
                    length - length % model.config.reduction_factor
                    for length in target_lengths
                ]
            )
            max_length = max(target_lengths)
            batch["labels"] = batch["labels"][:, :max_length]

        # Add speaker embeddings
        batch["speaker_embeddings"] = torch.tensor(speaker_features)

        return batch

# Initialize data collator
data_collator = TTSDataCollatorWithPadding(processor=processor)

## **Model Training**
Fine-tune pre-trained SpeechT5 model on Dutch dataset using optimized training configuration.
Key Steps
1. Model Loading: Load pre-trained checkpoint
2. Cache Configuration: Disable cache for training, enable for inference
3. Training Setup: Configure parameters and initialize trainer
4. Training Execution: Run training and push to Hub

In [None]:
# Load pre-trained model
from transformers import SpeechT5ForTextToSpeech

model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint)

# Configure cache settings
from functools import partial

# Disable cache during training (incompatible with gradient checkpointing)
model.config.use_cache = False

# Re-enable cache for generation to speed up inference
model.generate = partial(model.generate, use_cache=True)

# Define training arguments
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="Denhotech/tts",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    eval_strategy="steps",
    per_device_eval_batch_size=2,
    save_steps=100,
    eval_steps=100,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    greater_is_better=False,
    label_names=["labels"],
    push_to_hub=True,
)

# Initialize trainer
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    tokenizer=processor,
)

# Start training (takes several hours)
trainer.train()

# Push final model to Hub
trainer.push_to_hub()