# FINE-TUNING XTTS-v2 #

This notebook will contain code for training XTTS-v2 on kaggle. However, the main point is to introduce you to the various things you should consider when fine tuning XTTS-v2. The tips/advice/guides in the official coqui [docs](https://docs.coqui.ai/en/latest/) are great, but it's not always clear which docs content pertains to which model/s.

This notebook has been tested with its original environment, Persistence set to 'Files only' and accelerator set to GPU P100.

## Creating Your Dataset:


### Dataset of this Notebook:
The dataset attached to this notebook comes from the freely available [M-AILABS Speech DataSet](https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/). (It is supposed to be a UK English speaker reading the novel Jane Eyre, but to my ears sounds more like a US English speaker putting on an accent.)

### Preparing your own Data:
I am assuming you want to fine-tune on your own data. Load your audio files into [Audacity](https://www.audacityteam.org/), select them, and export as WAV (channels: mono, sample rate: 22050, encoding: Signed 16-bit PCM).

Doing additional audio pre-processing in Audacity is risky. My experiments with normalising loudness between different source audio recordings ended up creating a voice that was lacking in dynamic range. Similarly, my attempts to remove noise from source recordings also removed too much of the actual voice/signal and ended up producing bad results. 

My advice (unless you are experienced with working with audio or you have a lot of time to play around) is just to listen to the audio you're thinking of using and have a quick look at the wave forms/spectrograms in Audacity. Simply discard any audio that is significantly worse quality than the rest. Examples of unpromising source audio include: constant background noise (e.g., coughing, clapping, laughter), excessive clipping in waveform view of Audacity, poor quality recording with constant whine/noise/etc. .

### Making an LJSpeech Style Dataset:
The format for LJSpeech is a dir that contains two things: a metadata.csv file and a dir called 'wavs' that contains your voice recordings. Each line of the metadata.csv file includes:

1. The name of an audio file
2. The text for that file. E.g., "Jane eyre by Charlotte Bronte. Chapter 1."
3. The normalised text. E.g., "Jane eyre by Charlotte Bronte. Chapter one."

**If you are fine-tuning XTTS-v2 you don't need to worry about normalising your text, because it gets done for you automatically at training time. So your second and third columns can be identical.**

[Here](https://github.com/zuverschenken/XTTSv2Scripts) is my github repo showing how to create an LJSpeech style dataset from 1 or more WAV files that may contain multiple speakers. (Alternatively you can try the [WhisperX](https://github.com/m-bain/whisperX) project for this task.)

After you have finished with these, your dataset will be in the LJSpeech format and ready for use with this notebook.

[Here](https://www.kaggle.com/code/maxbr0wn/inspect-tts-dataset/) is my kaggle notebook showing you how to check the quality of your dataset and sanitise it.

### Note on Model Performance:
Some degree of repetition/mushy mouth sounds seems to be inherent to the model. Even the pre-trained voices that comes packaged with TTS suffer from this problem to a small extent. There are two ways I'm aware of to improve your performance (these are already covered in other parts of this/my other notebook, but I'm putting it here again since it's pretty important):

1. Improve the quality of your training data. Cull problematic items. Get more training data if your dataset is really small.
2. The model does not generalise well to unseen sequence lengths. If you only fine-tune on 10s long audio clips and then try to produce a 1s clip at inference time, it will probably struggle. Make sure you have a good distribution of training lengths. Note that when you try to generate audio from a long text string, *this program is automatically splitting that long string of text into several shorter strings*, because the model cannot generate sequences of arbitrary length. If you are suffering from garbled/repetitious outputs, then I recommend putting some print statements in the 'split_sentence" function in TTS.tts.layers.xtts.tokenizer. This will show you how your long text is being split up. If you see that your bad outputs are only occuring when the model is trying to generate audio for very short sequences or very long sequences, then you know what needs to be addressed. 

In [None]:
!pip install git+https://github.com/coqui-ai/TTS

In [None]:
!pip install transformers==4.37.1

In [None]:
###updated training zone####

In [None]:
from trainer import Trainer, TrainerArgs
#from trainer.logging.wandb_logger import WandbLogger
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.layers.xtts.trainer.gpt_trainer import GPTArgs, GPTTrainer, GPTTrainerConfig, XttsAudioConfig
from TTS.utils.manage import ModelManager

import sys
import os
import wandb

### Monkey Patching for wandb (!!!) ###

XTTS-v2 uses tensorboard for logging by default. Officially wandb is supported, but it breaks things when I've used it (after a few epochs creating massive amounts of artifact files). For this reason I've monkey patched the offending method so that no artifacts are added.

If you're not using wandb, then you don't need to do this.

In [None]:
from trainer.logging.wandb_logger import WandbLogger

In [None]:
def add_artifact(self, file_or_dir, name, artifact_type, aliases=None):
    ###instead of adding artifact, do nothing###
    print(f"========Ignoring artifact: {name} {file_or_dir}========")
    return


WandbLogger.add_artifact = add_artifact

In [None]:
# Logging parameters
RUN_NAME = "kaggletest"
PROJECT_NAME = "gore" 
DASHBOARD_LOGGER = "wandb" 
LOGGER_URI = None

### Dir for Training Run ###

Set the training run to store model files in the persistent /kaggle/working dir. 

Note that disk space is limited to 20GB here which will fill up quickly if you are saving more than a few checkpoints. In that case, if you must train on kaggle, you can create a dir at /kaggle/temp/ and  store the runs there. **Those files will not persist once the notebook session ends, so you will need to manually download them or copy them across to the /kaggle/working/ dir before the session ends.**

In [None]:
OUT_PATH = 'run/'
os.makedirs(OUT_PATH, exist_ok=True)

Retreive the base model files. 

In [None]:
# Define the path where XTTS v2.0.1 files will be downloaded
CHECKPOINTS_OUT_PATH = os.path.join(OUT_PATH, "XTTS_v2.0_original_model_files/")
os.makedirs(CHECKPOINTS_OUT_PATH, exist_ok=True)

# DVAE files
DVAE_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/dvae.pth"
MEL_NORM_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/mel_stats.pth"

# Set the path to the downloaded files
DVAE_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(DVAE_CHECKPOINT_LINK))
MEL_NORM_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(MEL_NORM_LINK))

# download DVAE files if needed
if not os.path.isfile(DVAE_CHECKPOINT) or not os.path.isfile(MEL_NORM_FILE):
    print(" > Downloading DVAE files!")
    ModelManager._download_model_files([MEL_NORM_LINK, DVAE_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True)

# Download XTTS v2.0 checkpoint if needed
TOKENIZER_FILE_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json"
XTTS_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/model.pth"

# XTTS transfer learning parameters: You we need to provide the paths of XTTS model checkpoint that you want to do the fine tuning.
TOKENIZER_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(TOKENIZER_FILE_LINK))  # vocab.json file
XTTS_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CHECKPOINT_LINK))  # model.pth file

# download XTTS v2.0 files if needed
if not os.path.isfile(TOKENIZER_FILE) or not os.path.isfile(XTTS_CHECKPOINT):
    print(" > Downloading XTTS v2.0 files!")
    ModelManager._download_model_files(
        [TOKENIZER_FILE_LINK, XTTS_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True
    )

In [None]:
training_dir = "/culledjane-eyre-ljspeech"

### Batch Size ###

* BATCH_SIZE is the amount of items being loaded into VRAM/memory at once.

* GRAD_ACCUM_STEPS is the amount of times we perform a forward pass with BATCH_SIZE amount of items before updating the parameters according to the SGD algorithm.

So a BATCH_SIZE of 2 and GRAD_ACCUM_STEPS of 32 would give an 'effective batch size' of 64 (i.e., 64 items considered per optimisation step). 

The creators of XTTS-v2 recommend an effective batch size of 252 for proper training. (I found that reducing this to 126 gave me similar performance and faster fine-tuning, but we should probably listen to them.)

Your BATCH_SIZE will depend on how big your dataset items are and how much VRAM you have available. Make it as large as possible while ensuring that everything fits in VRAM and then make BATCH_SIZE*GRAD_ACCUM_STEPS==252.

In [None]:

OPTIMIZER_WD_ONLY_ON_WEIGHTS = True  
START_WITH_EVAL = True  
BATCH_SIZE = 1
GRAD_ACUMM_STEPS = 252
LANGUAGE = "en"

### Dataset Config ###

**NOTE: if you completed my [Inspect TTS Dataset](https://www.kaggle.com/code/maxbr0wn/inspect-tts-dataset/) notebbok, you should use the maximum audio length from your dataset as max_wav_length and ensure the length of the reference audio you want to use is within the min-max range.**

See the comments below for things you will want to change.

Note that the lengths below are lengths of WAV files. So if your WAV file has a sample rate of 22050, then a a max_wav_length of 370000 is: 370000/22050 = ~16.78 seconds long.




In [None]:

model_args = GPTArgs(
    max_conditioning_length=143677,#the audio you will use for conditioning latents should be less than this 
    min_conditioning_length=66150,#and more than this
    debug_loading_failures=True,#this will print output to console and help you find problems in your ds
    max_wav_length=223997,#set this to >= the longest audio in your dataset  
    max_text_length=200, 
    mel_norm_file=MEL_NORM_FILE,
    dvae_checkpoint=DVAE_CHECKPOINT,
    xtts_checkpoint=XTTS_CHECKPOINT,  
    tokenizer_file=TOKENIZER_FILE,
    gpt_num_audio_tokens=1026, 
    gpt_start_audio_token=1024,
    gpt_stop_audio_token=1025,
    gpt_use_masking_gt_prompt_approach=True,
    gpt_use_perceiver_resampler=True,
)

### Audio Config ###

The coqui TTS docs mention inspecting your data with the CheckSpectrograms.ipynb notebook to help decide on audio parameters. I think this is irrelevant for XTTS-v2, because it doesn't use the same audio config as some of the older coqui models and doesn't have the same parameters.

The default is 22050 for input and 24000 for output. 

**the only reason my sample rate is lower is the dataset I chose has a sample rate of 16000. Assuming you are using your own data, sample_rate and dvae_sample_rate should match your data which should probably be 22050.** 

In [None]:
audio_config = XttsAudioConfig(sample_rate=16000, dvae_sample_rate=16000, output_sample_rate=24000) 

### Speaker Reference ###

This is the audio file that will be used for creating the conditioning latent and speaker embedding. I think this is *not* used directly for training, but just for creating the audio outputs that are generated at checkpoints for you to review the training process.

Choosing the right speaker reference is **VERY** important for XTTS-v2. It can completely change how your model will sound. Even two clips taken from the same recording of the same speaker can produce markedly different outputs. Unfortunately I can't provide an algorithm for selecting this. I recommend that you manually go through your dataset and select approximately 10 clips of your speaker where they are saying a full sentence with an intonation/rythm/speed/style that sounds pretty good. Then just experiment with all of them and find one you like. This is especially important at inference time.

Note that you can give a speaker reference that 'doesn't belong' to your model. For example if you want to make a US English speaker model impersonate a UK English speaker, you can provide it with a UK speaker reference file. (A better way to acheive this impersonation effect might be to fine-tune very briefly on the voice you are 'impersonating'. This seems to work better than simply changing the embedding.)

In [None]:
SPEAKER_REFERENCE = "culledjane-eyre-ljspeech/wavs/jane_eyre_01_f000015.wav"

### Trainer Config ###

How long your fine-tuning needs to run for before it is 'done' depends on many factors. A common problem when fine-tuning/training generative models is encountered here: we don't have a satisfactory algorithm for evaluating which outputs are superior. After a few epochs the decreases in loss are small and it's difficult to tell by listening to the test outputs whether things are improving or not. I personally have ended training after roughly 100,000 dataset items put through the trainer (i.e., 20 epochs with a dataset of 5,000 audio files). If you are listening to the test outputs and your model is performing well after just a few epochs, then you can finish early. Listening to test outputs will give you a better sense of how your model is training rather than just looking at loss values.

If your target voice is a US male, then it will be faster than a female speaker with Russian accent. 

Note: you can write your own text for thte test_sentences list. Keep these the same between different runs so you can compare like with like. If you're using wandb, you can easily listen to these on their site while your model is training.

**DISCLAIMER:** Some of the parameters of this config don't make a lot of sense to me (such as the LR scheduler milestones being greater than the total amount of steps we would expect during the fine-tuning process). Anything I don't understand, I have just left the same as it was provided by the Coqui team. 

I've put comments by some parameters below

In [None]:
config = GPTTrainerConfig(
    run_eval=True,
    epochs = 1000, # assuming you want to end training manually w/ keyboard interrupt
    output_path=OUT_PATH,
    model_args=model_args,
    run_name=RUN_NAME,
    project_name=PROJECT_NAME,
    run_description="""
        GPT XTTS training
        """,
    dashboard_logger=DASHBOARD_LOGGER,
    wandb_entity=None,
    logger_uri=LOGGER_URI,
    audio=audio_config,
    batch_size=BATCH_SIZE,
    batch_group_size=48,
    eval_batch_size=BATCH_SIZE,
    num_loader_workers=8, #consider decreasing if your jupyter env is crashing or similar
    eval_split_max_size=256, 
    print_step=50, 
    plot_step=100, 
    log_model_step=1000, 
    save_step=9999999999, #ALREADY SAVES EVERY EPOCHMaking this high on kaggle because Output dir is limited in size. I changed this to be size of training set/2 so I would effectively have a checkpoint every half epoch 
    save_n_checkpoints=1,#if you want to store multiple checkpoint rather than just 1, increase this
    save_checkpoints=False,# Making this False on kaggle because Output dir is limited
    print_eval=False,
    optimizer="AdamW",
    optimizer_wd_only_on_weights=OPTIMIZER_WD_ONLY_ON_WEIGHTS,
    optimizer_params={"betas": [0.9, 0.96], "eps": 1e-8, "weight_decay": 1e-2},
    lr=5e-06,  
    lr_scheduler="MultiStepLR",
    lr_scheduler_params={"milestones": [50000 * 18, 150000 * 18, 300000 * 18], "gamma": 0.5, "last_epoch": -1},
    test_sentences=[ 
        {
            "text": "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "speaker_wav": SPEAKER_REFERENCE, 
            "language": LANGUAGE,
        },
        {
            "text": "This cake is great. It's so delicious and moist.",
            "speaker_wav": SPEAKER_REFERENCE,
            "language": LANGUAGE,
        },
        {
            "text": "And soon, nothing more terrible, nothing more true, and specious stuff that says no rational being can fear a thing it will not feel, not seeing that this is what we fear.",
            "speaker_wav": SPEAKER_REFERENCE,
            "language": LANGUAGE,
        }
        
    ],
) 

model = GPTTrainer.init_from_config(config)

### Load Dataset ###

The evaluation set is 1% of the training data by default. This seems very low, but when you consider that you will probably want to evaluate performance by listening to tests rather than just comparing loss values and that you might want to make the most of your potentially small dataset, then it looks more reasonable.

In [None]:
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="metadata.csv", language=LANGUAGE, path=training_dir
)
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True, eval_split_size=0.02)

### Train! ###

**Note on warnings:**

The trainer will print out warnings if it encounters items in your dataset where the text exceeds 250 chars or the length of your audio exceeds max_wav_length (It will also have problems if you have data items that are < ~0.2s long, which you won't want anyway). You should remove these from your dataset or re-think how you're creating your dataset.




In [None]:
trainer = Trainer(
    TrainerArgs(
        restore_path=None,
        skip_train_epoch=False,
        start_with_eval=START_WITH_EVAL,
        grad_accum_steps=GRAD_ACUMM_STEPS,
    ),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

Your fine-tuned model will be stored in /kaggle/working/run