<center><h1> XTTS-v2: Single-Speaker Fine-Tuning </center></h1>
<center> Roberto Caamano, Giuseppe Di Roberto </center>

# Table of Contents

1. [Introduction](#introduction)  
2. [Model Architecture Overview](#model-architecture-overview)  
3. [Data Preparation Workflow](#data-preparation-workflow)  
4. [Fine-Tuning Process](#fine-tuning-process)  
5. [Live Demo](#live-demo)  

<center><h1> Introduction </center></h1>

## Note we will be using syntheic voices for this presentation. Theses voices come from models fine-tuned by us, on our own personal voices. 

<center><h1> XTTS Model Architecture Overview </center></h1>

# <img src="img/XTTS.png" width=600 height=500 />

In [None]:
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch
import torchaudio

In [None]:
# Set up file paths

BPE_path = "XTTS-files/vocab.json"

checkpoint_dir = f"training_outputs/xttsv2_finetune_20250504_1250-May-04-2025_12+50PM-ca1939c"
config_path =f"{model_path}/config.json"

speaker_ref = f"datasets/noramlized_personal_voice/wavs/chunk_0016.wav"


In [None]:
# Define input text
text = "Hello world, I now have a cloned voice."

# Init Xtts and load config object
cfg = XttsConfig()
cfg.load_json(config_path)


# Init model and load from checkpoint
model = Xtts.init_from_config(cfg)

model.load_checkpoint(
    cfg,
    checkpoint_dir=checkpoint_dir,
    vocab_path=vocab_path,
    eval=True, 
)

# Set to eval
model.to(device).eval()

# Get the gpt conditonal latent codes and speaker_encoder from the reference audio mel spectrogram
gpt_cond_latent, speaker_encoder = model.get_conditioning_latents(
    audio_path=[speaker_ref], # Speaker reference wav pointed here. Multiple can be used. Important for quality of output
    gpt_cond_len=cfg.gpt_cond_len, #  Context window size of latents being passed to GPT 
    gpt_cond_chunk_len=cfg.gpt_cond_chunk_len, # How many chunks audio tokens split into before going to PercieverResampler
    max_ref_length=cfg.max_ref_len, # Limits how much of the speaker reference audio is used.
)


# Model's inference method
output = model.inference(
            text=text, # Input text
            language="en", # Set language to english 
            gpt_cond_latent=gpt_cond_latent, # Pass conditional latents to GPT
            speaker_embedding=speaker_encoder, # Pass Speaker Encoder to Decoder  
            temperature=0.75,
            speed=1,
            length_penalty=cfg.length_penalty,
            repetition_penalty=cfg.repetition_penalty,
            top_k=cfg.top_k,
            top_p=cfg.top_p,
        )

# Create wav tensor
wav_tensor = torch.tensor(out["wav"]).unsqueeze(0)  # shape: (1, samples)

# Save tensor output to audio file using torchaudio
torchaudio.save(output_path, wav_tensor, sample_rate=cfg.audio.output_sample_rate)

<center><h1> Data Preparation for Fine-Tuning XTTS </center></h1>

# <img src="img/workflow.png" width=600 height=500 />