# Fine-tuning Parler-TTS

## Goal of this notebook

In the following notebook, we'll fine-tune [Parler-TTS Mini v1](https://huggingface.co/parler-tts/parler-tts-mini-v1) on the `18h 47mn 19s` *female* voice of the [Galsen AI TTS dataset](https://huggingface.co/datasets/galsenai/wolof_tts).

In particular, we'll:
- Annotate the dataset with natural language speech description using [Data-Speech](https://github.com/huggingface/dataspeech).
- Fine-tune Parler-TTS with the created dataset.

**You should be able to adapt this notebook to your own datasets quite easily.**





## Prepare the Environment

Throughout this tutorial, we'll use a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit "Connect T4" in the top right-hand corner of the screen.

##### <a name="installation"> We'll install Parler-TTS and Data-Speech from source in order to train our model.

In [4]:
%cd ../
!ls

/home/caytu
Wolof-ASR  Wolof-TTS


In [None]:
!cd src
!git clone https://github.com/huggingface/dataspeech.git
!pip install --quiet -r ./dataspeech/requirements.txt

In [None]:
!git clone https://github.com/huggingface/parler-tts.git
%cd parler-tts
!pip install --quiet -e .[train]

On Colab, we need to run an additional set-up, that you can skip if you're on your local machine.

In [None]:
!pip install --upgrade protobuf wandb==0.16.6

You should link you Hugging Face account so that you can push model repositories on the Hub. This will allow you to save your trained models on the Hub so that you can share them with the community.

Run the command below and then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not have one already. You should make sure that this token has "write" privileges.

In [9]:
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

True

In [10]:
import os

hf_token = os.getenv('HF_TOKEN')

In [11]:
!git config --global credential.helper store
!huggingface-cli login --token $hf_token

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
The token `galsenai` has been saved to /home/caytu/.cache/huggingface/stored_tokens
Your token has been saved to /home/caytu/.cache/huggingface/token
Login successful.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## 1. Creating our fine-tuning dataset


The aim here is to create an annotated version of Anta TTS, in order to fine-tune the [Parler-TTS Mini v1 checkpoint](https://huggingface.co/parler-tts/parler-tts-mini-v1) on this dataset.

Thanks to a [script similar to what's described in the Data-Speech FAQ](https://github.com/huggingface/dataspeech?tab=readme-ov-file#how-do-i-use-datasets-that-i-have-with-this-repository), we've uploaded the dataset to the HuggingFace hub, under the name [galsenai/wolof_tts](https://huggingface.co/datasets/galsenai/wolof_tts).

The purpose of this notebook is demonstration so we'll filter the dataset in order to have only female voice and save it under the name [galsenai/women_wolof_tts](https://huggingface.co/datasets/galsenai/women_wolof_tts).

Feel free to follow the link above to listen to some samples of the TTS dataset thanks to the hub viewer.

We'll:
1. Annotate the dataset with continuous variables that measures the speech characteristics
2. Map those annotations to text bins that characterize the speech characteristics.
3. Create natural language descriptions from those text bins

In [3]:
%cd ../src/dataspeech

/home/caytu/Wolof-TTS/src/dataspeech


In [None]:
!pip install datasets

In [4]:
from datasets import load_dataset
dataset = load_dataset("derguene/anta_women_tts")

README.md:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

train-00000-of-00012.parquet:   0%|          | 0.00/377M [00:00<?, ?B/s]

train-00001-of-00012.parquet:   0%|          | 0.00/380M [00:00<?, ?B/s]

train-00002-of-00012.parquet:   0%|          | 0.00/383M [00:00<?, ?B/s]

train-00003-of-00012.parquet:   0%|          | 0.00/385M [00:00<?, ?B/s]

train-00004-of-00012.parquet:   0%|          | 0.00/386M [00:00<?, ?B/s]

train-00005-of-00012.parquet:   0%|          | 0.00/382M [00:00<?, ?B/s]

train-00006-of-00012.parquet:   0%|          | 0.00/386M [00:00<?, ?B/s]

train-00007-of-00012.parquet:   0%|          | 0.00/393M [00:00<?, ?B/s]

train-00008-of-00012.parquet:   0%|          | 0.00/383M [00:00<?, ?B/s]

train-00009-of-00012.parquet:   0%|          | 0.00/382M [00:00<?, ?B/s]

train-00010-of-00012.parquet:   0%|          | 0.00/401M [00:00<?, ?B/s]

train-00011-of-00012.parquet:   0%|          | 0.00/377M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/19918 [00:00<?, ? examples/s]

In [5]:
type(dataset["train"]["audio"])

list

In [27]:
def get_women_voice(example):
    return example["gender"] == 'female'

dataset = dataset.filter(get_women_voice)

In [None]:
def normaliser(example):
    example["transcription_normalised"] = example["text"].lower()
    example["gender"] = 1
    return example


dataset = dataset.map(normaliser)
dataset = dataset.rename_column('text', 'transcription')

dataset.push_to_hub('derguene/anta_women_tts')

In [7]:
from IPython.display import Audio
print(dataset["train"][-1]["text"])
Audio(dataset["train"][-1]["audio"]["array"], rate=dataset["train"][1]["audio"]["sampling_rate"])

xam na li mu bëgg te mu ngi koy def


In [19]:
del dataset


### Annotating the dataset

We'll use [`main.py`](https://github.com/huggingface/dataspeech/blob/main/main.py) to get the following continuous variables:
- Speaking rate `(nb_phonemes / utterance_length)`
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
- Reverberation
- Speech monotony


In [8]:
!ls

LICENSE    dataspeech  main.py		 scripts
README.md  examples    requirements.txt  tmp_anta


In [None]:
!python3 main.py "derguene/anta_women_tts" \
  --configuration "default" \
  --text_column_name "text" \
  --audio_column_name "audio" \
  --cpu_num_workers 2 \
  --rename_column \
  --repo_id "parler_tts" \
  --apply_squim_quality_estimation

  from speechbrain.pretrained import (
Compute SI-SDR, PESQ, STOI
Map:  97%|████████████████████████▎| 19372/19918 [05:56<00:09, 56.31 examples/s]

The whole process took under 10mn!

The resulting dataset will be pushed to the HuggingFace hub under your HuggingFace handle.

Let's see what the new dataset looks like:

In [None]:
from datasets import load_dataset

dataset = load_dataset("derguene$/parler_tts")
print("SI-SDR 1st sample", dataset["train"][0]["si-sdr"])
print("C50 2nd sample", dataset["train"][0]["c50"])
del dataset

As you can see, the current annotations are continuous variables. To use it with Parler-TTS, we need to convert it to textual description, something that the two next steps will take care of.

### 2. Map annotations to text bins

Since the ultimate goal here is to fine-tune the [Parler-TTS v1 Mini checkpoint](https://huggingface.co/parler-tts/parler-tts-mini-v1) on the dataset, we want to stay consistent with the text bins of the datasets on which the latter model was trained.

This is easy to do thanks to the following:

In [23]:
!python3 ./scripts/metadata_to_text.py \
    "Alwaly/parler_tts" \
    --repo_id "parler_tts-text-tags" \
    --configuration "default" \
    --cpu_num_workers 2 \
    --path_to_bin_edges "./examples/tags_to_annotations/v02_bin_edges.json" \
    --path_to_text_bins "./examples/tags_to_annotations/v02_text_bins.json" \
    --avoid_pitch_computation \
    --apply_squim_quality_estimation

Already computed bin edges have been passed for speaking_rate. Will use: [0.0, 3.8258038258038254, 7.651607651607651, 11.477411477411476, 15.303215303215302, 19.129019129019127, 22.95482295482295, 26.78062678062678].
Map (num_proc=2): 100%|█████████| 29949/29949 [00:02<00:00, 12232.41 examples/s]
Already computed bin edges have been passed for noise. Will use: [17.12751579284668, 25.4012325831822, 33.67494937351772, 41.94866616385323, 50.22238295418875, 58.49609974452427, 66.76981653485979, 75.04353332519531].
Map (num_proc=2): 100%|█████████| 29949/29949 [00:02<00:00, 12474.70 examples/s]
Already computed bin edges have been passed for reverberation. Will use: [10, 35, 45, 55, 59, 60].
Map (num_proc=2): 100%|█████████| 29949/29949 [00:02<00:00, 12210.38 examples/s]
Already computed bin edges have been passed for speech_monotony. Will use: [0.0, 20.37920924595424, 40.75841849190848, 70, 90, 142.6544647216797].
Map (num_proc=2): 100%|█████████| 29949/29949 [00:02<00:00, 11981.97 example

Thanks to [`v02_bin_edges.json`](https://github.com/huggingface/dataspeech/blob/main/examples/tags_to_annotations/v02_bin_edges.json), we don't need to recompute bins from scratch and the above script takes a few seconds.

The resulting dataset will be pushed to the HuggingFace hub under your HuggingFace handle. Mine was push to your_username/parler_tts-text-tags.

You can notice that text bins such as `slightly slowly`, `very monotone` have been added to the samples.

In [12]:
from datasets import load_dataset
dataset = load_dataset("derguene/parler_tts-text-tags")
print("Noise 1st sample:", dataset["train"][0]["sdr_noise"])
print("Speaking rate 2nd sample:", dataset["train"][0]["speaking_rate"])
del dataset

README.md:   0%|          | 0.00/882 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/4.03M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/29949 [00:00<?, ? examples/s]

Noise 1st sample: noisy
Speaking rate 2nd sample: slightly slowly



### 3. Create natural language descriptions from those text bins

Now that we have text bins associated to the Anta dataset, the next step is to create natural language descriptions out of the few created features.

Here, we decided to create prompts that use the name `Anta`, prompts that'll look like the following:
`In a very expressive voice, Anta pronounces her words incredibly slowly. There's some background noise in this room with a bit of echo'`

This step generally demands more resources and times and should use one or many GPUs.

The following command shows how to do it using the [2B version of the Gemma 2 model from Google](https://huggingface.co/google/gemma-2-2b-it), which should run in about 50 minutes in this Colab free T4. Note that we used this model because this notebook aims to show the potential of Parler-TTS fine-tuning, and thus it aims for time-efficiency. Otherwise, we would have gone for a bigger mode.

As usual, we precise the dataset name and configuration we want to annotate. `model_name_or_path` should point to a `transformers` model for prompt annotation. You can find a list of such models [here](https://huggingface.co/models?pipeline_tag=text-generation&library=transformers&sort=trending).

**Note** how we've been able to specify that the dataset is mono-speaker and that we should name the voice Anta thanks to the flags:


`--speaker_name "Anta" --is_single_speaker`.


In [None]:
!python3 ./scripts/run_prompt_creation.py \
    --speaker_name "Anta" \
    --is_single_speaker \
    --is_new_speaker_prompt \
    --dataset_name "derguene/parler_tts-text-tags" \
    --output_dir "./tmp_anta" \
    --dataset_config_name "default" \
    --model_name_or_path "google/gemma-2-2b-it" \
    --per_device_eval_batch_size 5 \
    --attn_implementation "sdpa" \
    --dataloader_num_workers 2 \
    --push_to_hub \
    --hub_dataset_id "derguene/parler_tts-descriptions-tags" \
    --preprocessing_num_workers 2

11/11/2024 10:18:45 - INFO - __main__ - *** Load annotated dataset ***
11/11/2024 10:18:46 - INFO - __main__ - *** Load pretrained model ***
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:18<00:00,  9.13s/it]
Preparing prompts (num_proc=2): 100%|█| 29949/29949 [00:18<00:00, 1612.49 exampl
 ... :   0%|                                           | 0/5990 [00:00<?, ?it/s]You're using a GemmaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GemmaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[2024-11-11 10:19:28,300] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
11/11/2024 10:19:28 - INFO - root - gcc -pt

Let's take a look at some created prompts:

In [14]:
from datasets import load_dataset
dataset = load_dataset("derguene/parler_tts-descriptions-tags")
print("1st sample:", dataset["train"][0]["text_description"])
print("2nd sample:", dataset["train"][1]["text_description"])
del dataset

README.md:   0%|          | 0.00/929 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/5.91M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/29949 [00:00<?, ? examples/s]

1st sample: 

'Anta's speech has a slightly distant-sounding quality, with a noticeable amount of noise. Her voice is expressive and animated, though delivered at a slightly slow pace.'

Let me know if you have any questions. 



2nd sample: 

'Anta's voice sounds slightly distant, with a noticeable amount of noise. The tone of her speech is slightly expressive, and she delivers it in a slow pace.' 





## Fine-tuning Parler-TTS



In [5]:
%cd ../../src/parler-tts

/home/caytu/Wolof-TTS/src/parler-tts


We can know fully focus on fine-tuning Parler-TTS. Luckily, [the Parler-TTS library](https://github.com/huggingface/.parler-tts) has a training script available [here](https://github.com/huggingface/parler-tts/tree/main/training), that can be used with just a few arguments.


> **Note:** you need to enter your choice concerning WandB. If you don't have an account, you can enter `3` to avoid logging on WandB. Otherwise; you can logging to follow how your model trained.

In [6]:
import os

os.environ["WANDB_NOTEBOOK_NAME"] = "../../notebooks/ParlerTTS_v1_finetuning_on_single_wolof_tts_dataset.ipynb"
os.environ["WANDB_PROJECT"]       = "Parler_TTS"
os.environ["WANDB_LOG_MODEL"]     = "end"

In [7]:
import wandb
wandb.init(project="parler_tts")

[34m[1mwandb[0m: Currently logged in as: [33mmbayederguene[0m ([33mcadair[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
!accelerate launch ./training/run_parler_tts_training.py \
    --model_name_or_path "parler-tts/parler-tts-mini-v1" \
    --feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
    --description_tokenizer_name "parler-tts/parler-tts-mini-v1" \
    --prompt_tokenizer_name "parler-tts/parler-tts-mini-v1" \
    --report_to "wandb" \
    --overwrite_output_dir true \
    --train_dataset_name "derguene/anta_women_tts" \
    --train_metadata_dataset_name "derguene/parler_tts-descriptions-tags" \
    --train_dataset_config_name "default" \
    --train_split_name "train" \
    --eval_dataset_name "derguene/anta_women_tts" \
    --eval_metadata_dataset_name "derguene/parler_tts-descriptions-tags" \
    --eval_dataset_config_name "default" \
    --eval_split_name "train" \
    --max_eval_samples 8 \
    --per_device_eval_batch_size 8 \
    --target_audio_column_name "audio" \
    --description_column_name "text_description" \
    --prompt_column_name "text" \
    --max_duration_in_seconds 20 \
    --min_duration_in_seconds 2.0 \
    --max_text_length 400 \
    --preprocessing_num_workers 2 \
    --do_train true \
    --num_train_epochs 200 \
    --gradient_accumulation_steps 18 \
    --gradient_checkpointing true \
    --per_device_train_batch_size 2 \
    --learning_rate 0.0001 \
    --adam_beta1 0.9 \
    --adam_beta2 0.99 \
    --weight_decay 0.01 \
    --lr_scheduler_type "constant_with_warmup" \
    --warmup_steps 50 \
    --logging_steps 2 \
    --freeze_text_encoder true \
    --audio_encoder_per_device_batch_size 5 \
    --dtype "float16" \
    --seed 456 \
    --output_dir "./output_dir_training/" \
    --temporary_save_to_disk "./audio_code_tmp/" \
    --save_to_disk "./tmp_dataset_audio/" \
    --dataloader_num_workers 2 \
    --do_eval \
    --predict_with_generate \
    --include_inputs_for_metrics \
    --group_by_length true

## Inference

The full training on the free T4 from Google Colab took about an hour.
Now, let's see how to do inference with the newly fine-tuned model!

First install the Parler-TTS library:

In [None]:
!pip install --no-cache-dir --upgrade git+https://github.com/huggingface/parler-tts.git

In [1]:
%cd ../../src/parler-tts

/home/caytu/Wolof-TTS/src/parler-tts


Then:

In [3]:
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("output_dir_training/checkpoint-500-epoch-0").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

  WeightNorm.apply(module, name, dim)
Config of the text_encoder: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> is overwritten by shared text_encoder config: T5Config {
  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

Config of the audio_encoder: <

In [4]:
prompt      = "Màngi tudd Anta, di wax ak yéen ci wolof ngir nu mën dégganté bu baax"
description = "'Jenny's delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks fast.'"

input_ids        = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, do_sample=False)
audio_arr  = generation.cpu().numpy().squeeze()

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [5]:
len(audio_arr)

5632

In [5]:
from IPython.display import Audio
Audio('parler_tts_out.wav', rate=44100)

In [8]:
# exporting audio to a file
import soundfile as sf
audio_path = "parler_tts_out.wav"
sf.write(audio_path, audio_arr, model.config.sampling_rate)

This is great! As you can see, the model now managed to get a **consistent** voice throughout generation that looks like **Anta**!

Since we're quite happy about it, let's push it to the hub to be able to re-use it!

In [6]:
model.push_to_hub("galsenai/parler-tts-mini-v1-wolof")
tokenizer.push_to_hub("galsenai/parler-tts-mini-v1-wolof")

model.safetensors:   0%|          | 0.00/3.51G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/galsenai/parler-tts-mini-v1-wolof/commit/5d2f938ee9bcf83cf02ebaee6c699689613983e1', commit_message='Upload tokenizer', commit_description='', oid='5d2f938ee9bcf83cf02ebaee6c699689613983e1', pr_url=None, repo_url=RepoUrl('https://huggingface.co/galsenai/parler-tts-mini-v1-wolof', endpoint='https://huggingface.co', repo_type='model', repo_id='galsenai/parler-tts-mini-v1-wolof'), pr_revision=None, pr_num=None)

You'll now be able to load the model and the tokenizer using the direct repository id of your model, i.e `<your_HF_handle>/parler-tts-mini-wolof-colab`.

```python
model = ParlerTTSForConditionalGeneration.from_pretrained("<your_HF_handle>/parler-tts-mini-wolof-colab").to(device)
tokenizer = AutoTokenizer.from_pretrained("<your_HF_handle>/parler-tts-mini-wolof-colab")
```



## Conclusion

To conclude, we've shown here how to fine-tune Parler-TTS Mini v1 on this newly created dataset!

**If you want to fine-tune the model on your own dataset, you can follow and/or adapt the current notebook to make it work!