# Fine-tuning Parler-TTS

## Goal of this notebook

In the following notebook, we'll fine-tune [Parler-TTS Mini v1](https://huggingface.co/parler-tts/parler-tts-mini-v1) on a 5h subset of the [Galsen AI TTS dataset](https://huggingface.co/datasets/galsenai/wolof_tts).

In particular, we'll:
- Annotate the dataset with natural language speech description using [Data-Speech](https://github.com/huggingface/dataspeech).
- Fine-tune Parler-TTS with the created dataset.

**You should be able to adapt this notebook to your own datasets quite easily.**





## Prepare the Environment

Throughout this tutorial, we'll use a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit "Connect T4" in the top right-hand corner of the screen.

##### <a name="installation"> We'll install Parler-TTS and Data-Speech from source in order to train our model.

In [3]:
!git clone https://github.com/huggingface/dataspeech.git
!cd dataspeech
!pip install --quiet -r ./dataspeech/requirements.txt

fatal: destination path 'dataspeech' already exists and is not an empty directory.


In [4]:
!git clone https://github.com/huggingface/parler-tts.git
%cd parler-tts
!pip install --quiet -e .[train]

fatal: destination path 'parler-tts' already exists and is not an empty directory.
/home/ubuntu/parler-tts


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpcio-tools 1.66.1 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 3.19.6 which is incompatible.
streamlit 1.38.0 requires protobuf<6,>=3.20, but you have protobuf 3.19.6 which is incompatible.
tensorboardx 2.6.2.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.
tortoise 3.0.0 requires tokenizers<0.14.0,>=0.13.2, but you have tokenizers 0.19.1 which is incompatible.
tortoise 3.0.0 requires torchaudio<0.14.0,>=0.13.1, but you have torchaudio 2.5.0 which is incompatible.[0m[31m
[0m

On Colab, we need to run an additional set-up, that you can skip if you're on your local machine.

In [3]:
!pip install --upgrade protobuf wandb==0.16.6

Defaulting to user installation because normal site-packages is not writeable
Collecting protobuf
  Using cached protobuf-5.28.2-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
  Using cached protobuf-4.25.5-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Using cached protobuf-4.25.5-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.19.6
    Uninstalling protobuf-3.19.6:
      Successfully uninstalled protobuf-3.19.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
descript-audiotools 0.7.2 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.25.5 which is incompatible.
grpcio-tools 1.66.1 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 4.25.5 which is incompatible.
tortoise 3.0.0 requires tokenizers<0.14.0,>=0.13.2, but

You should link you Hugging Face account so that you can push model repositories on the Hub. This will allow you to save your trained models on the Hub so that you can share them with the community.

Run the command below and then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not have one already. You should make sure that this token has "write" privileges.

In [1]:
!git config --global credential.helper store
!huggingface-cli login --token <YOUR_HUGGINGFACE_TOKEN>

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ubuntu/.cache/huggingface/token
Login successful


## 1. Creating our fine-tuning dataset


The aim here is to create an annotated version of Anta TTS, in order to fine-tune the [Parler-TTS Mini v1 checkpoint](https://huggingface.co/parler-tts/parler-tts-mini-v1) on this dataset.

Thanks to a [script similar to what's described in the Data-Speech FAQ](https://github.com/huggingface/dataspeech?tab=readme-ov-file#how-do-i-use-datasets-that-i-have-with-this-repository), we've uploaded the dataset to the HuggingFace hub, under the name [galsenai/wolof_tts](https://huggingface.co/datasets/galsenai/wolof_tts).

The purpose of this notebook is demonstration so we'll filter the dataset in order to have only female voice and save it under the name [galsenai/women_wolof_tts](https://huggingface.co/datasets/galsenai/women_wolof_tts).

Feel free to follow the link above to listen to some samples of the TTS dataset thanks to the hub viewer.

We'll:
1. Annotate the dataset with continuous variables that measures the speech characteristics
2. Map those annotations to text bins that characterize the speech characteristics.
3. Create natural language descriptions from those text bins

In [2]:
%cd ../dataspeech

/home/ubuntu/dataspeech


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [6]:
!pip install datasets

Defaulting to user installation because normal site-packages is not writeable


In [6]:
from datasets import load_dataset
dataset = load_dataset("galsenai/wolof_tts")

Downloading readme:   0%|          | 0.00/541 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/395M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/393M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/26684 [00:00<?, ? examples/s]

In [None]:
def get_women_voice(example):
    return example["gender"] == 'female'

dataset = dataset.filter(get_women_voice)

In [11]:
def normaliser(example):
    example["transcription_normalised"] = example["text"].lower()
    return example


dataset = dataset.map(get_women_voice)
dataset = dataset.rename_column('text', 'transcription')

dataset.push_to_hub('your_username/women_tts')

Dunu ay buqat.


In [12]:
from IPython.display import Audio
print(dataset["train"][-1]["transcription"])
Audio(dataset["train"][-1]["audio"]["array"], rate=dataset["train"][1]["audio"]["sampling_rate"])

« Jàpp naa ne pólitig du liggéey, j mooy sama liggéey » (Isaa Sàll)


In [13]:
del dataset


### Annotating the dataset

We'll use [`main.py`](https://github.com/huggingface/dataspeech/blob/main/main.py) to get the following continuous variables:
- Speaking rate `(nb_phonemes / utterance_length)`
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
- Reverberation
- Speech monotony


In [None]:
!python3 main.py "your_username/women_tts" \
  --configuration "default" \
  --text_column_name "transcription" \
  --audio_column_name "audio" \
  --cpu_num_workers 2 \
  --rename_column \
  --repo_id "parler_tts" \
  --apply_squim_quality_estimation

  from speechbrain.pretrained import (
Compute SI-SDR, PESQ, STOI
Map:   0%|                                     | 0/26684 [00:00<?, ? examples/s][32mINFO[0m - The local file (/home/ubuntu/.cache/torch/hub/torchaudio/models/squim_objective_dns2020.pth) exists. Skipping the download.
Map: 100%|█████████████████████████| 26684/26684 [16:21<00:00, 27.18 examples/s]
Compute pitch
  checkpoint = torch.load(checkpoint, map_location='cpu')
Map: 100%|█████████████████████████| 26684/26684 [43:34<00:00, 10.21 examples/s]
Compute snr and reverb
Map:   0%|                                     | 0/26684 [00:00<?, ? examples/s][32mINFO[0m - Lightning automatically upgraded your loaded checkpoint from v1.6.5 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../.cache/huggingface/hub/models--ylacombe--brouhaha-best/snapshots/99bf97b13fd4dda2434a6f7c50855933076f2937/best.ckpt`
Model was trained with pyannote.audio 0.0.1, yours i

The whole process took under 10mn!

The resulting dataset will be pushed to the HuggingFace hub under your HuggingFace handle.

Let's see what the new dataset looks like:

In [None]:
from datasets import load_dataset
dataset = load_dataset("your_username/parler_tts")
print("SI-SDR 1st sample", dataset["train"][0]["si-sdr"])
print("C50 2nd sample", dataset["train"][0]["c50"])
del dataset

As you can see, the current annotations are continuous variables. To use it with Parler-TTS, we need to convert it to textual description, something that the two next steps will take care of.

### 2. Map annotations to text bins

Since the ultimate goal here is to fine-tune the [Parler-TTS v1 Mini checkpoint](https://huggingface.co/parler-tts/parler-tts-mini-v1) on the dataset, we want to stay consistent with the text bins of the datasets on which the latter model was trained.

This is easy to do thanks to the following:

In [None]:
!python3 ./scripts/metadata_to_text.py \
    "your_username/parler_tts" \
    --repo_id "parler_tts-text-tags" \
    --configuration "default" \
    --cpu_num_workers 2 \
    --path_to_bin_edges "./examples/tags_to_annotations/v02_bin_edges.json" \
    --path_to_text_bins "./examples/tags_to_annotations/v02_text_bins.json" \
    --avoid_pitch_computation \
    --apply_squim_quality_estimation

Thanks to [`v02_bin_edges.json`](https://github.com/huggingface/dataspeech/blob/main/examples/tags_to_annotations/v02_bin_edges.json), we don't need to recompute bins from scratch and the above script takes a few seconds.

The resulting dataset will be pushed to the HuggingFace hub under your HuggingFace handle. Mine was push to your_username/parler_tts-text-tags.

You can notice that text bins such as `slightly slowly`, `very monotone` have been added to the samples.

In [None]:
from datasets import load_dataset
dataset = load_dataset("your_username/parler_tts-text-tags")
print("Noise 1st sample:", dataset["train"][0]["sdr_noise"])
print("Speaking rate 2nd sample:", dataset["train"][0]["speaking_rate"])
del dataset


### 3. Create natural language descriptions from those text bins

Now that we have text bins associated to the Anta dataset, the next step is to create natural language descriptions out of the few created features.

Here, we decided to create prompts that use the name `Anta`, prompts that'll look like the following:
`In a very expressive voice, Anta pronounces her words incredibly slowly. There's some background noise in this room with a bit of echo'`

This step generally demands more resources and times and should use one or many GPUs.

The following command shows how to do it using the [2B version of the Gemma 2 model from Google](https://huggingface.co/google/gemma-2-2b-it), which should run in about 50 minutes in this Colab free T4. Note that we used this model because this notebook aims to show the potential of Parler-TTS fine-tuning, and thus it aims for time-efficiency. Otherwise, we would have gone for a bigger mode.

As usual, we precise the dataset name and configuration we want to annotate. `model_name_or_path` should point to a `transformers` model for prompt annotation. You can find a list of such models [here](https://huggingface.co/models?pipeline_tag=text-generation&library=transformers&sort=trending).

**Note** how we've been able to specify that the dataset is mono-speaker and that we should name the voice Anta thanks to the flags:


`--speaker_name "Anta" --is_single_speaker`.


In [None]:
!python3 ./scripts/run_prompt_creation.py \
    --speaker_name "Anta" \
    --is_single_speaker \
    --is_new_speaker_prompt \
    --dataset_name "your_username/parler_tts-text-tags" \
    --output_dir "./tmp_anta" \
    --dataset_config_name "default" \
    --model_name_or_path "google/gemma-2-2b-it" \
    --per_device_eval_batch_size 5 \
    --attn_implementation "sdpa" \
    --dataloader_num_workers 2 \
    --push_to_hub \
    --hub_dataset_id "your_username/parler_tts-descriptions-tags" \
    --preprocessing_num_workers 2

Let's take a look at some created prompts:

In [None]:
from datasets import load_dataset
dataset = load_dataset("your_username/parler_tts-descriptions-tags")
print("1st sample:", dataset["train"][0]["text_description"])
print("2nd sample:", dataset["train"][1]["text_description"])
del dataset

## Fine-tuning Parler-TTS



In [None]:
%cd ../parler-tts

We can know fully focus on fine-tuning Parler-TTS. Luckily, [the Parler-TTS library](https://github.com/huggingface/.parler-tts) has a training script available [here](https://github.com/huggingface/parler-tts/tree/main/training), that can be used with just a few arguments.


> **Note:** you need to enter your choice concerning WandB. If you don't have an account, you can enter `3` to avoid logging on WandB. Otherwise; you can logging to follow how your model trained.

In [None]:
!accelerate launch ./training/run_parler_tts_training.py \
    --model_name_or_path "parler-tts/parler-tts-mini-v1" \
    --feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
    --description_tokenizer_name "parler-tts/parler-tts-mini-v1" \
    --prompt_tokenizer_name "parler-tts/parler-tts-mini-v1" \
    --report_to "wandb" \
    --overwrite_output_dir true \
    --train_dataset_name "your_username/women_tts" \
    --train_metadata_dataset_name "your_username/parler_tts-descriptions-tags" \
    --train_dataset_config_name "default" \
    --train_split_name "train" \
    --eval_dataset_name "your_username/women_tts" \
    --eval_metadata_dataset_name "your_username/parler_tts-descriptions-tags" \
    --eval_dataset_config_name "default" \
    --eval_split_name "train" \
    --max_eval_samples 8 \
    --per_device_eval_batch_size 8 \
    --target_audio_column_name "audio" \
    --description_column_name "text_description" \
    --prompt_column_name "text" \
    --max_duration_in_seconds 20 \
    --min_duration_in_seconds 2.0 \
    --max_text_length 400 \
    --preprocessing_num_workers 2 \
    --do_train true \
    --num_train_epochs 2 \
    --gradient_accumulation_steps 18 \
    --gradient_checkpointing true \
    --per_device_train_batch_size 2 \
    --learning_rate 0.0001 \
    --adam_beta1 0.9 \
    --adam_beta2 0.99 \
    --weight_decay 0.01 \
    --lr_scheduler_type "constant_with_warmup" \
    --warmup_steps 50 \
    --logging_steps 2 \
    --freeze_text_encoder true \
    --audio_encoder_per_device_batch_size 5 \
    --dtype "float16" \
    --seed 456 \
    --output_dir "./output_dir_training/" \
    --temporary_save_to_disk "./audio_code_tmp/" \
    --save_to_disk "./tmp_dataset_audio/" \
    --dataloader_num_workers 2 \
    --do_eval \
    --predict_with_generate \
    --include_inputs_for_metrics \
    --group_by_length true

## Inference

The full training on the free T4 from Google Colab took about an hour.
Now, let's see how to do inference with the newly fine-tuned model!

First install the Parler-TTS library:

In [None]:
!pip install git+https://github.com/huggingface/parler-tts.git

Then:

In [None]:
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("/content/parler-tts/output_dir_training", torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "Hey, how are you doing today?"
description = "'Anta delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks fast.'"

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

This is great! As you can see, the model now managed to get a **consistent** voice throughout generation that looks like **Anta**!

Since we're quite happy about it, let's push it to the hub to be able to re-use it!

In [None]:
model.push_to_hub("parler-tts-mini-v1-wolof-colab")
tokenizer.push_to_hub("parler-tts-mini-v1-wolof-colab")

You'll now be able to load the model and the tokenizer using the direct repository id of your model, i.e `<your_HF_handle>/parler-tts-mini-wolof-colab`.

```python
model = ParlerTTSForConditionalGeneration.from_pretrained("<your_HF_handle>/parler-tts-mini-wolof-colab").to(device)
tokenizer = AutoTokenizer.from_pretrained("<your_HF_handle>/parler-tts-mini-wolof-colab")
```



## Conclusion

To conclude, we've shown here how to fine-tune Parler-TTS Mini v1 on this newly created dataset!

**If you want to fine-tune the model on your own dataset, you can follow and/or adapt the current notebook to make it work!