<a href="https://colab.research.google.com/github/SaleemMalik632/-Machine_Learning_Course_/blob/main/Finetuning_Parler_TTS_on_a_single_speaker_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Parler-TTS

## Goal of this notebook

In the following notebook, we'll fine-tune [Parler-TTS Mini v0.1](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) on a 5h subset of the [Jenny TTS dataset](https://github.com/dioco-group/jenny-tts-dataset), a 30 hours high-quality mono-speaker TTS dataset, from an Irish female speaker named Jenny.

In particular, we'll:
- Annotate the Jenny dataset with natural language speech description using [Data-Speech](https://github.com/huggingface/dataspeech).
- Fine-tune Parler-TTS with the created dataset.

**You should be able to adapt this notebook to your own datasets quite easily.**





## Prepare the Environment

Throughout this tutorial, we'll use a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit "Connect T4" in the top right-hand corner of the screen.

##### <a name="installation"> We'll install Parler-TTS and Data-Speech from source in order to train our model.

In [None]:
!git clone https://github.com/huggingface/dataspeech.git
!cd dataspeech
!pip install --quiet -r ./dataspeech/requirements.txt

Cloning into 'dataspeech'...
remote: Enumerating objects: 383, done.[K
remote: Counting objects: 100% (101/101), done.[K
remote: Compressing objects: 100% (67/67), done.[K
remote: Total 383 (delta 57), reused 55 (delta 34), pack-reused 282[K
Receiving objects: 100% (383/383), 102.50 KiB | 846.00 KiB/s, done.
Resolving deltas: 100% (219/219), done.
[2K     [32m|[0m [32m49.0 MB[0m [31m43.0 MB/s[0m [33m0:00:04[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.

In [None]:
!git clone https://github.com/huggingface/parler-tts.git
%cd parler-tts
!pip install --quiet -e .[train]

Cloning into 'parler-tts'...
remote: Enumerating objects: 683, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 683 (delta 16), reused 15 (delta 15), pack-reused 662[K
Receiving objects: 100% (683/683), 252.49 KiB | 1.02 MiB/s, done.
Resolving deltas: 100% (418/418), done.
/content/parler-tts
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 k

On Colab, we need to run an additional set-up, that you can skip if you're on your local machine.

In [None]:
!pip install --upgrade protobuf wandb==0.16.6

Collecting protobuf
  Using cached protobuf-5.26.1-cp37-abi3-manylinux2014_x86_64.whl (302 kB)


You should link you Hugging Face account so that you can push model repositories on the Hub. This will allow you to save your trained models on the Hub so that you can share them with the community.

Run the command below and then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not have one already. You should make sure that this token has "write" privileges.

In [None]:
!git config --global credential.helper store
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## 1. Creating our fine-tuning dataset


The aim here is to create an annotated version of Jenny TTS, in order to fine-tune the [Parler-TTS v0.1 checkpoint](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) on this dataset.

Thanks to a [script similar to what's described in the Data-Speech FAQ](https://github.com/huggingface/dataspeech?tab=readme-ov-file#how-do-i-use-datasets-that-i-have-with-this-repository), we've uploaded the dataset to the HuggingFace hub, under the name [reach-vb/jenny_tts_dataset](https://huggingface.co/datasets/reach-vb/jenny_tts_dataset).

The purpose of this notebook is demonstration so we've pushed a 6h subset of the dataset that we'll work with: [ylacombe/jenny-tts-6h](https://huggingface.co/datasets/ylacombe/jenny-tts-6h).

Feel free to follow the link above to listen to some samples of the Jenny TTS dataset thanks to the hub viewer.

> Refer to the [Data-Speech README](https://github.com/huggingface/dataspeech?tab=readme-ov-file#data-speech) for more detailed explanations of what's going on under-the-hood.

We'll:
1. Annotate the Jenny dataset with continuous variables that measures the speech characteristics
2. Map those annotations to text bins that characterize the speech characteristics.
3. Create natural language descriptions from those text bins

In [None]:
%cd ../dataspeech

/content/dataspeech


But first, let's look at a few samples from the Jenny dataset!

In [None]:
from datasets import load_dataset
dataset = load_dataset("ylacombe/jenny-tts-6h")

In [None]:
from IPython.display import Audio
print(dataset["train"][0]["transcription"])
Audio(dataset["train"][0]["audio"]["array"], rate=dataset["train"][0]["audio"]["sampling_rate"])

It was a bright cold day in April, and the clocks were striking thirteen.


In [None]:
from IPython.display import Audio
print(dataset["train"][1]["transcription"])
Audio(dataset["train"][1]["audio"]["array"], rate=dataset["train"][1]["audio"]["sampling_rate"])

'I wonder if I shall ever be happy enough to have real lace on my clothes and bows on my caps?'


In [None]:
del dataset


### Annotating the Jenny dataset

We'll use [`main.py`](https://github.com/huggingface/dataspeech/blob/main/main.py) to get the following continuous variables:
- Speaking rate `(nb_phonemes / utterance_length)`
- Signal-to-noise ratio (SNR)
- Reverberation
- Speech monotony


In [None]:
!python main.py "ylacombe/jenny-tts-6h" \
  --configuration "default" \
  --text_column_name "transcription" \
  --audio_column_name "audio" \
  --cpu_num_workers 2 \
  --num_workers_per_gpu_for_pitch 2 \
  --rename_column \
  --repo_id "jenny-tts-tags-6h"

2024-04-30 13:39:55.287417: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 13:39:55.287475: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-30 13:39:55.288924: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-30 13:39:55.296346: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  torchaudio.set_audio_backend("soundfile")
C

The whole process took under 10mn!

The resulting dataset will be pushed to the HuggingFace hub under your HuggingFace handle. Mine was push to [ylacombe/jenny-tts-tags-6h](https://huggingface.co/datasets/ylacombe/jenny-tts-tags-6h).

Let's see what the new dataset looks like:

In [None]:
from datasets import load_dataset
dataset = load_dataset("ylacombe/jenny-tts-tags-6h")
print("SNR 1st sample", dataset["train"][0]["snr"])
print("C50 2nd sample", dataset["train"][0]["c50"])
del dataset

Downloading readme:   0%|          | 0.00/609 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/958k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4000 [00:00<?, ? examples/s]

SNR 1st sample 54.890892028808594
C50 2nd sample 59.73095703125


As you can see, the current annotations are continuous variables. To use it with Parler-TTS, we need to convert it to textual description, something that the two next steps will take care of.

### 2. Map annotations to text bins

Since the ultimate goal here is to fine-tune the [Parler-TTS v0.1 checkpoint](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) on the Jenny dataset, we want to stay consistent with the text bins of the datasets on which the latter model was trained.

This is easy to do thanks to the following:

In [None]:
!python ./scripts/metadata_to_text.py \
    "ylacombe/jenny-tts-tags-6h" \
    --repo_id "jenny-tts-tags-6h" \
    --configuration "default" \
    --cpu_num_workers 2 \
    --path_to_bin_edges "./examples/tags_to_annotations/v01_bin_edges.json" \
    --avoid_pitch_computation

Already computed bin edges have been passed for speaking_rate. Will use: [3.508771929824561, 6.187242299296628, 8.865712668768696, 11.544183038240764, 14.22265340771283, 16.901123777184896, 19.579594146656966, 22.258064516129032].
Map (num_proc=2): 100% 4000/4000 [00:03<00:00, 1202.34 examples/s]
Already computed bin edges have been passed for noise. Will use: [50.0, 53.460838317871094, 56.92167663574219, 60.38251495361328, 63.843353271484375, 67.30419158935547, 70.76502990722656, 74.22586822509766].
Map (num_proc=2): 100% 4000/4000 [00:03<00:00, 1193.89 examples/s]
Already computed bin edges have been passed for reverberation. Will use: [30.498437881469727, 34.706024169921875, 38.91361045837402, 43.12119674682617, 47.32878303527832, 51.53636932373047, 55.74395561218262, 59.951541900634766].
Map (num_proc=2): 100% 4000/4000 [00:03<00:00, 1048.53 examples/s]
Already computed bin edges have been passed for speech_monotony. Will use: [0.0, 17.430070059640066, 34.86014011928013, 52.2902101

Thanks to [`v01_bin_edges.json`](https://github.com/huggingface/dataspeech/blob/main/examples/tags_to_annotations/v01_bin_edges.json), we don't need to recompute bins from scratch and the above script takes a few seconds.

The resulting dataset will be pushed to the HuggingFace hub under your HuggingFace handle. Mine was push to [ylacombe/jenny-tts-tags-6h](https://huggingface.co/datasets/ylacombe/jenny-tts-tags-6h).

You can notice that text bins such as `quite noisy`, `very fast` have been added to the samples.

In [None]:
from datasets import load_dataset
dataset = load_dataset("ylacombe/jenny-tts-tags-6h")
print("Noise 1st sample:", dataset["train"][0]["noise"])
print("Speaking rate 2nd sample:", dataset["train"][0]["speaking_rate"])
del dataset

Downloading readme:   0%|          | 0.00/728 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/935k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4000 [00:00<?, ? examples/s]

Noise 1st sample: quite noisy
Speaking rate 2nd sample: quite fast



### 3. Create natural language descriptions from those text bins

Now that we have text bins associated to the Jenny dataset, the next step is to create natural language descriptions out of the few created features.

Here, we decided to create prompts that use the name `Jenny`, prompts that'll look like the following:
`In a very expressive voice, Jenny pronounces her words incredibly slowly. There's some background noise in this room with a bit of echo'`

This step generally demands more resources and times and should use one or many GPUs.

The following command shows how to do it using the [2B version of the Gemma model from Google](https://huggingface.co/google/gemma-2b-it), which should run in about 15 minutes in this Colab free T4.


As usual, we precise the dataset name and configuration we want to annotate. `model_name_or_path` should point to a `transformers` model for prompt annotation. You can find a list of such models [here](https://huggingface.co/models?pipeline_tag=text-generation&library=transformers&sort=trending).

**Note** how we've been able to specify that the dataset is mono-speaker and that we should name the voice Jenny thanks to the flags:


`--speaker_name "Jenny" --is_single_speaker`.


In [None]:
!python ./scripts/run_prompt_creation.py \
  --speaker_name "Jenny" \
  --is_single_speaker \
  --dataset_name "ylacombe/jenny-tts-tags-6h" \
  --output_dir "./tmp_jenny" \
  --dataset_config_name "default" \
  --model_name_or_path "google/gemma-2b-it" \
  --per_device_eval_batch_size 12 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 2 \
  --push_to_hub \
  --hub_dataset_id "jenny-tts-6h-tagged" \
  --preprocessing_num_workers 2

04/30/2024 14:17:59 - INFO - __main__ - *** Load annotated dataset ***
04/30/2024 14:18:01 - INFO - __main__ - *** Load pretrained model ***
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100% 2/2 [00:14<00:00,  7.41s/it]
 ... :   0% 0/334 [00:00<?, ?it/s]04/30/2024 14:18:21 - INFO - __main__ - Resuming train from step 334
  self.pid = os.fork()

Postprocessing dataset (num_proc=2):   0% 0/4000 [00:00<?, ? examples/s][A2024-04-30 14:18:25.911693: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 14:18:25.911693: E external/local_xla/xla/str

Let's take a look at some created prompts:

In [None]:
from datasets import load_dataset
dataset = load_dataset("ylacombe/jenny-tts-6h-tagged")
print("1st sample:", dataset["train"][0]["text_description"])
print("2nd sample:", dataset["train"][1]["text_description"])
del dataset

1st sample: 'The speech sample is very noisy, contains a lot of background noise, and is delivered in a monotone tone with occasional fast bursts.'
2nd sample: 'Jenny's speech is very noisy, but she speaks in a very monotone voice with minimal variation in speed.'.


**Observation:** The first sample unfortunately doesn't have the name Jenny in it. This is probably because we use a smaller and thus less precise model that one we would have gone for if this notebook had more resources (e.g we've used [Mistral 7B v2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) to create the Parler-TTS training dataset). This shouldn't prevent our model to learn what we want though.

## Fine-tuning Parler-TTS



In [None]:
%cd ../parler-tts

/content/parler-tts


We can know fully focus on fine-tuning Parler-TTS. Luckily, [the Parler-TTS library](https://github.com/huggingface/.parler-tts) has a training script available [here](https://github.com/huggingface/parler-tts/tree/main/training), that can be used with just a few arguments.


> **Note:** you need to enter your choice concerning WandB. If you don't have an account, you can enter `3` to avoid logging on WandB. Otherwise; you can logging to follow how your model trained.

In [None]:
!accelerate launch ./training/run_parler_tts_training.py \
    --model_name_or_path "parler-tts/parler_tts_mini_v0.1" \
    --feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
    --description_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
    --prompt_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
    --report_to "wandb" \
    --overwrite_output_dir true \
    --train_dataset_name "ylacombe/jenny-tts-6h" \
    --train_metadata_dataset_name "ylacombe/jenny-tts-6h-tagged" \
    --train_dataset_config_name "default" \
    --train_split_name "train" \
    --eval_dataset_name "ylacombe/jenny-tts-6h" \
    --eval_metadata_dataset_name "ylacombe/jenny-tts-6h-tagged" \
    --eval_dataset_config_name "default" \
    --eval_split_name "train" \
    --max_eval_samples 8 \
    --per_device_eval_batch_size 8 \
    --target_audio_column_name "audio" \
    --description_column_name "text_description" \
    --prompt_column_name "text" \
    --max_duration_in_seconds 20 \
    --min_duration_in_seconds 2.0 \
    --max_text_length 400 \
    --preprocessing_num_workers 2 \
    --do_train true \
    --num_train_epochs 2 \
    --gradient_accumulation_steps 18 \
    --gradient_checkpointing true \
    --per_device_train_batch_size 2 \
    --learning_rate 0.00008 \
    --adam_beta1 0.9 \
    --adam_beta2 0.99 \
    --weight_decay 0.01 \
    --lr_scheduler_type "constant_with_warmup" \
    --warmup_steps 50 \
    --logging_steps 2 \
    --freeze_text_encoder true \
    --audio_encoder_per_device_batch_size 4 \
    --dtype "float16" \
    --seed 456 \
    --output_dir "./output_dir_training/" \
    --temporary_save_to_disk "./audio_code_tmp/" \
    --save_to_disk "./tmp_dataset_audio/" \
    --dataloader_num_workers 2 \
    --do_eval \
    --predict_with_generate \
    --include_inputs_for_metrics \
    --group_by_length true

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2024-04-30 14:53:59.378318: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 14:53:59.378373: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-30 14:53:59.379895: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W

## Inference

The full training on the free T4 from Google Colab took about an hour.
Now, let's see how to do inference with the newly fine-tuned model!

First install the Parler-TTS library:

In [None]:
!pip install git+https://github.com/huggingface/parler-tts.git

Collecting git+https://github.com/huggingface/parler-tts.git
  Cloning https://github.com/huggingface/parler-tts.git to /tmp/pip-req-build-arq29sva
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/parler-tts.git /tmp/pip-req-build-arq29sva
  Resolved https://github.com/huggingface/parler-tts.git to commit 10016fb0300c0dc31a0fb70e26f3affee7b62f16
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting protobuf<3.20,>=3.9.2 (from descript-audiotools>=0.7.2->descript-audio-codec->parler_tts==0.1)
  Using cached protobuf-3.19.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
Building wheels for collected packages: parler_tts
  Building wheel for parler_tts (pyproject.toml) ... [?25l[?25hdone
  Created wheel for parler_tts: filename=parler_tts-0.1-py3-none-any.whl size=40796 sha256=08cffceafde39484b5a5a844c87

Then:

In [None]:
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("/content/parler-tts/output_dir_training", torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "Hey, how are you doing today?"
description = "'Jenny delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks fast.'"

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()



In [None]:
from IPython.display import Audio
Audio(audio_arr, rate=model.config.sampling_rate)

In [None]:
prompt = "Wow, I've really got the same voice as Jenny, huh?"

prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

Audio(audio_arr, rate=model.config.sampling_rate)



In [None]:
prompt = "What a time to be alive!"
description = "'Jenny's speech is very clear, and she speaks in a very monotone voice, really slowly and with minimal variation in speed.'"

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

Audio(audio_arr, rate=model.config.sampling_rate)



This is great! As you can see, the model now managed to get a **consistent** voice throughout generation that looks like **Jenny**!

Since we're quite happy about it, let's push it to the hub to be able to re-use it!

In [None]:
model.push_to_hub("parler-tts-mini-Jenny-colab")
tokenizer.push_to_hub("parler-tts-mini-Jenny-colab")

model.safetensors:   0%|          | 0.00/1.29G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ylacombe/parler-tts-mini-Jenny-colab/commit/4f73db54a28230d3b95fb9b4507408ab71692b01', commit_message='Upload tokenizer', commit_description='', oid='4f73db54a28230d3b95fb9b4507408ab71692b01', pr_url=None, pr_revision=None, pr_num=None)

You'll now be able to load the model and the tokenizer using the direct repository id of your model, i.e `<your_HF_handle>/parler-tts-mini-Jenny-colab`.

```python
model = ParlerTTSForConditionalGeneration.from_pretrained("<your_HF_handle>/parler-tts-mini-Jenny-colab").to(device)
tokenizer = AutoTokenizer.from_pretrained("<your_HF_handle>/parler-tts-mini-Jenny-colab")
```



## Conclusion

To conclude, we've shown here:
1. how to annotate a single-speaker 6-hours-long dataset
2. how to fine-tune Parler-TTS Mini v0.1 on this newly created dataset!

**If you want to fine-tune the model on your own dataset, you can follow and/or adapt the current notebook to make it work! Don't forget to check how to push your own local dataset on the HuggingFace Hub using a [script similar to what's described in the Data-Speech FAQ](https://github.com/huggingface/dataspeech?tab=readme-ov-file#how-do-i-use-datasets-that-i-have-with-this-repository)!**