# Fine-tuning du modèle xTTS v2
#### Détail:
Le modèle xTTS utilisé ici est un fork qui a été modifié pour supporter les langues inconnues.

## Clonnage de notre model

Comme nous l'avons déjà mentionné, durant tout le processus d'entraînement du modèle, nous allons utiliser ce fork https://github.com/mohaskii/XTTSv2-Finetuning-for-New-Languages.git, qui contient tout ce dont nous avons besoin pour entraîner notre modèle.

In [2]:
!ls ../../

LICENSE    data      notebooks	       scripts	wandb
README.md  metadata  requirements.txt  src


In [3]:
%cd ../../src/

/home/caytu/Wolof-TTS/src


In [4]:
!git clone https://github.com/mohaskii/XTTSv2-Finetuning-for-New-Languages.git

Cloning into 'XTTSv2-Finetuning-for-New-Languages'...
remote: Enumerating objects: 621, done.[K
remote: Counting objects: 100% (619/619), done.[K
remote: Compressing objects: 100% (439/439), done.[K
remote: Total 621 (delta 147), reused 598 (delta 133), pack-reused 2 (from 1)[Kcts:  39% (243/621)
Receiving objects: 100% (621/621), 2.08 MiB | 12.77 MiB/s, done.
Resolving deltas: 100% (147/147), done.


In [6]:
!mv XTTSv2-Finetuning-for-New-Languages/ xTTS/

In [None]:
!pip install -r requirements.txt

In [None]:
!pip install --upgrade tensorflow tensorboard numpy

## Formatage du dataset 
Comme dans le code de l'entraînement, nous choisissons le formateur `coqui` :
```
BaseDatasetConfig(
            formatter="coqui",
            dataset_name="ft_dataset",
            path=os.path.dirname(train_csv),
            meta_file_train=os.path.basename(train_csv),
            meta_file_val=os.path.basename(eval_csv),
            language=language,
        )
```
Nous devons donc organiser le dataset comme suit :
```
├── datasets/
│   ├── wavs/
│   │   ├── xxx.wav
│   │   ├── yyy.wav
│   │   ├── zzz.wav
│   │   └── ...
│   ├── metadata_train.csv
│   ├── metadata_eval.csv
```
Et les métadonnées comme ceci :
```
audio_file|text|speaker_name
wavs/xxx.wav|How do you do?|@X
wavs/yyy.wav|Nice to meet you.|@Y
wavs/zzz.wav|Good to see you.|@Z
```



In [2]:
from datasets import load_dataset
ds = load_dataset("galsenai/wolof_tts")

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
import os
import csv
import numpy as np
from scipy.io.wavfile import write
from tqdm import tqdm
from wolof_integer_speller import spell_number_in_wolof
import re
from re import Match


output_dir = "dataset"
wavs_dir = os.path.join(output_dir, "wavs")
os.makedirs(wavs_dir, exist_ok=True)


metadata_train_path = os.path.join(output_dir, "metadata_train.csv")
metadata_eval_path = os.path.join(output_dir, "metadata_eval.csv")


def replace_match(match: Match) -> str:
    number_str = match.group(0)
    try:
        number = int(number_str)
        return " " + spell_number_in_wolof(number) + " "
    except ValueError:
        return number_str


# Process train and test splits separately but maintain continuous IDs
file_id_counter = 0


def process_dataset_split(split_name, output_path, file_id_start) -> int:
    file_id = file_id_start
    with open(output_path, mode="w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile, delimiter="|")
        writer.writerow(["audio_file", "text", "speaker_name"])  # Header

        # Process the split
        for sample in tqdm(ds[split_name], desc=f"Exporting {split_name} dataset"):
            audio_array = sample["audio"]["array"]
            sampling_rate = sample["audio"]["sampling_rate"]
            
            # Replace all numeric digits with their spelled-out form in Wolof
            text = re.sub(r"\d+", replace_match, sample["text"])

            gender = sample["gender"]

            wav_filename = f"{file_id:06d}.wav"
            wav_path = os.path.join(wavs_dir, wav_filename)

            write(wav_path, sampling_rate, np.array(audio_array, dtype=np.float32))

            speaker_name = f"@{gender[0].upper()}"  # Example: @M or @F
            writer.writerow([f"wavs/{wav_filename}", text, speaker_name])
            file_id += 1

    return file_id


# Process train split
file_id_counter = process_dataset_split("train", metadata_train_path, file_id_counter)

# Process test split
file_id_counter = process_dataset_split("test", metadata_eval_path, file_id_counter)


print("Export terminé.")

## Téléchargement du Modèle pré-entraîné 

In [7]:
!python xTTS/download_checkpoint.py --output_path xTTS/checkpoints/

 > Downloading DVAE files!
  0%|                                              | 0.00/1.07k [00:00<?, ?iB/s]
100%|█████████████████████████████████████| 1.07k/1.07k [00:00<00:00, 2.64kiB/s][A

  2%|▋                                     | 3.73M/211M [00:00<00:05, 36.8MiB/s][A
  4%|█▍                                    | 8.15M/211M [00:00<00:04, 41.2MiB/s][A
  6%|██▏                                   | 12.4M/211M [00:00<00:04, 41.5MiB/s][A
  8%|██▉                                   | 16.6M/211M [00:00<00:04, 41.6MiB/s][A
 10%|███▊                                  | 20.9M/211M [00:00<00:04, 42.3MiB/s][A
 12%|████▌                                 | 25.3M/211M [00:00<00:04, 42.7MiB/s][A
 14%|█████▎                                | 29.5M/211M [00:00<00:04, 42.7MiB/s][A
 16%|██████                                | 33.8M/211M [00:00<00:04, 42.4MiB/s][A
 18%|██████▊                               | 38.1M/211M [00:00<00:04, 42.1MiB/s][A
 20%|███████▋                              | 42.5M/

## Extension du vocabulaire et ajustement de la configuration

In [11]:
!python xTTS/extend_vocab_config.py --output_path=xTTS/checkpoints/ \
--metadata_path ../data/metadata.csv \
--language wo --extended_vocab_size 2000 

<class 'list'>
[2K[00:00:00] Tokenize words                 ██████████████████ 13905    /    13905[00:00:00] Tokenize words                 ██████████████████ 0        /        0
[2K[00:00:00] Count pairs                    ██████████████████ 13905    /    13905
[2K[00:00:00] Compute merges                 ██████████████████ 1938     /     1938


In [12]:
!ls xTTS/checkpoints/XTTS_v2.0_original_model_files

config.json  dvae.pth  mel_stats.pth  model.pth  vocab.json


## Finetuning

In [None]:
!CUDA_VISIBLE_DEVICES=0 python xTTS/train_gpt_xtts.py --output_path xTTS/checkpoints/ \
--metadatas ../data/metadata_train.csv,../data/metadata_eval.csv,wo \
--num_epochs 1500 \
--batch_size 4 \
--grad_acumm 4 \
--max_text_length 400 \
--max_audio_length 330750 \
--weight_decay 1e-2 \
--lr 5e-6 

# Inference

In [27]:
!ls xTTS/checkpoints/GPT_XTTS_FT-November-16-2024_08+24PM-38f71bd

best_model.pth	      events.out.tfevents.1731788642.asr-galsenai.2074919.0
best_model_72250.pth  train_gpt_xtts.py
config.json	      trainer_0_log.txt


Dans cette étape, il suffit juste de remplacer les noms du dossier du checkpoint et du fichier du modèle généré lors de l'entraînement.

In [28]:
import os
checkpoint_path = "xTTS/checkpoints/GPT_XTTS_FT-November-16-2024_08+24PM-38f71bd"
model_path      = "best_model_72250.pth"

In [29]:
import torch
import torchaudio
from tqdm import tqdm

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts


device = "cuda:0" if torch.cuda.is_available() else "cpu"


xtts_checkpoint = os.path.join(checkpoint_path, model_path)
xtts_config     = os.path.join(checkpoint_path,"config.json")
xtts_vocab      = "xTTS/checkpoints/XTTS_v2.0_original_model_files/vocab.json"


# Load model
config = XttsConfig()
config.load_json(xtts_config)
XTTS_MODEL = Xtts.init_from_config(config)
XTTS_MODEL.load_checkpoint(config, 
                           checkpoint_path   = xtts_checkpoint, 
                           vocab_path        = xtts_vocab, 
                           speaker_file_path = 'speakers_xtts.pth',
                           use_deepspeed     = False)
XTTS_MODEL.to(device)

print("Model loaded successfully!")

Model loaded successfully!


In [24]:
!ls ../data

anta_sample.wav  metadata.csv  metadata_eval.csv  metadata_train.csv  wavs


In [30]:
reference = "../data/anta_sample.wav"

gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(audio_path=[reference],
    gpt_cond_len=XTTS_MODEL.config.gpt_cond_len,
    max_ref_length=XTTS_MODEL.config.max_ref_len,
    sound_norm_refs=XTTS_MODEL.config.sound_norm_refs)

text = "màngi lay jaajëfël bu baax ci ligéey bu am solo bi nga fi def."

In [32]:
#from scipy.io import wavfile
import numpy as np

result = XTTS_MODEL.inference(
    text = text,
    gpt_cond_latent    = gpt_cond_latent,
    speaker_embedding  = speaker_embedding,
    do_sample          = True,
    temperature        = 0.7,
    num_beams          = 2,
    speed              = 1.05,
    repetition_penalty = 350.9,
    language           = "en",
    enable_text_splitting=True
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [33]:
audio_signal = result['wav']
sample_rate  = 24000
#audio_signal = audio_signal / np.max(np.abs(audio_signal))

from IPython.display import Audio, display

display(Audio(audio_signal, rate=sample_rate))