# Inference_manually_module

- https://docs.coqui.ai/en/latest/models/xtts.html
- rename ~1G .pth to speaker_xtts.pth. This file is the speaker-embedding vector for the fine-tuned voice. XTTS uses this vector to adjust model to a specific voice.
- rename one of the ~5.7G models to model.pth
- No need to set paths directly to the model and speaker embeddings. Just set the dir. If the vocab.json is in the same dir, no need to use vocab_path.

In [1]:
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch
import torchaudio
import os
import re

  from .autonotebook import tqdm as notebook_tqdm


In [9]:

def genAudioManual(text: str,checkpoint_dir: str,vocab_path: str, reference_wav: str,
                   output_path: str,
                   split_sentences:bool=True,
                   device: str = "cuda:0",temperature: float = 0.75,
) -> str:
    
    ### Follow docs page for inference without the TTS wrapper.
    
    
    # Load the config file in. 
    print("Loading model...")
    cfg = XttsConfig()
    cfg.load_json(os.path.join(checkpoint_dir, "config.json"))

    # Init model using the config. No TTS wrapper, do as done in the xtts_demo.py
    model = Xtts.init_from_config(cfg)

    # Load from checkpoint. Here is where the model gets loaded in using the base model/speaker embeedings learned
    model.load_checkpoint(
        cfg,
        checkpoint_dir=checkpoint_dir,
        vocab_path=vocab_path,
        eval=True,
        strict=False,
        use_deepspeed=False, # Need Deepspeed for this. Difficult on Windows...
    )

    # Set to eval
    model.to(device).eval()

    #
    print("Compute speaker latents...")
    
    # This is from tortoise.py. Notes from original file:
    '''
    Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
    These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
    properties.
    '''
    gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
        audio_path=[reference_wav],
        gpt_cond_len=cfg.gpt_cond_len,
        gpt_cond_chunk_len=cfg.gpt_cond_chunk_len,
        max_ref_length=cfg.max_ref_len,
    )
    
    if split_sentences:
        # Break text into distinct sentences
        sentences = re.split(r'(?<=[.!?]) +', text.strip())
    else:
        sentences = [text]

    segments = []
    # Loop for through sentence. Do inference one at at time
    for sentence in sentences:
        print(f"Generating audio for: {sentence}")

        out = model.inference(
            text=sentence,
            language="en",
            gpt_cond_latent=gpt_cond_latent,
            speaker_embedding=speaker_embedding,
            temperature=temperature,
            speed=1,
            length_penalty=cfg.length_penalty,
            repetition_penalty=cfg.repetition_penalty,
            top_k=cfg.top_k,
            top_p=cfg.top_p,
        )
        
        # Create wav tensor then add to segements list
        wav_tensor = torch.tensor(out["wav"]).unsqueeze(0)  # shape: (1, samples)
        segments.append(wav_tensor)


    # Convert the output in wav format, set to a tensor so torchaudio can be used.
    # Concatenate all wav tensors along the time axis (dim=1)
    finalAudio = torch.cat(segments, dim=1)
    
    torchaudio.save(output_path, finalAudio, sample_rate=cfg.audio.output_sample_rate)
    
    print(f"Output saved to {output_path}")
    # Return output path
    return output_path

In [10]:



vocab_path = "XTTS-files/vocab.json"

# models = ["xttsv2_finetune_20250418_2027-April-18-2025_08+27PM-7d4c6a1", 
#          "xttsv2_finetune_20250430_2033-April-30-2025_08+33PM-ca1939c",
#          "xttsv2_finetune_20250503_2111-May-03-2025_09+11PM-ca1939c"]

models = ["xttsv2_finetune_20250504_1250-May-04-2025_12+50PM-ca1939c"]

for i, voice in enumerate(models):
    checkpoint_dir = f"training_outputs/{voice}"


    DATASET = "noramlized_personal_voice"
    speaker_ref = f"datasets/{DATASET}/wavs/chunk_0016.wav"

    text = '''
    
For our project together, we worked on fine tuning KoKey’s XTT S model on a single speaker. XTT S is a multilingual Text to Speech model that is able to produce high quality synthetic speech. Note that this voice was trained using audio samples of me reading. So I may sound like I am reading directly from a script.
I will first be walking us through the model shown here. Beginning at the bottom of this flow chart, we can see the model’s inputs. At the bottom, we can see three inputs. A Reference spectrogram, a text input of some funny text, and a spectrogram marked as being the ground truth. 

The spectrograms here are referring to mel spectrograms. This is a specially formatted audio format that encodes for melody. If you are into music, this may already be familiar to you. Raw audio samples are often converted into mel spectrograms for text to speech models because it is a highly-compressed and information rich format. Which is useful for the neural network. 

The ground truth spectrogram here is used during training. This is an audio sample from a sample batch that will be used as the target. In cases of training the text input will be a transcript of the audio clip. A training set is composed of a set of audio files, each file between 4 to 10 seconds, labeled with transcribed text.

The reference spectrogram is used during both training and inference. This audio clip also should be between 4 to 10 seconds, but is not labeled with a transcript. This should be a high-quality audio sample of the target speaker. This reference sample will be used by the model to extract information on the target voice. For example if you are not fine-tuning the model, you can still get the model to mimic a voice using only a short reference audio sample. It is important this sample is of high quality. 

The text here during training is the labeled transcript for the target spectrogram. When using the model for reference, this text is what the model audio will attempt to output.

From this we can see the model’s general goal. During training it will try to match the speech of the target spectrogram. During inference it will try to convert the text into speech. 

Moving up to the next layer, we have the processing units. This VQ VAE is a type of discrete variational autoencoder. It will use a learned codebook to match words in the target spectrogram into discrete words. This turns a continuous task of having to match an infinite amount of possible words to a discrete problem, dramatically reducing the work the model needs to do. The BPE is taking the text, then converting them into fixed subtokens. So instead of needing unique tokens dedicated to every word and its varying tenses, it will take the base word, then modify it to match tense, or other grammatical transformations.

The Perceiver conditioner is a combination of the Conditional Encoder and perciever resampler. In combination, theses will take the reference audio clip, then encode for melody and rhythm in speech as well as speaker identity. The conditional latent codes, alongside the audio code embeddings and text token embeddings from the target, will be fed to the main GPT unit. Additionally, the speaker identity will separately be passed to the Decoder. An interesting note here, it is the speaker encoder here that makes XTT S able to work with multiple kinds of voices without needing to fine-tune the GPT layer. If training the same base model on multiple voices, you can fix the GPT layer and only save fine-tuned Speaker encoders which can be swapped in and out. 

The GPT here will take the combined inputs then try to predict the next audio code token at each step. The Duel language model heads here, one for text and one for audio, will be used to compute the loss. The text head computes logits over the text vocabulary. This is only done when training. The audio head computes logits over the audio-codebook. 

On the left side here, the Decoder will take the encoded audio output from the GPT, then reconstruct a mel spectrogram. This is then passed to a gan unit. The discriminator unit of the gan will judge the model’s output, this will be our gan loss. This is done to try to improve the realism of the output.

The speaker consistency loss is compares the final output to the reference spectrogram to improve consistency between them. 

Moving down to some code, here is a quick walk through of the inputs moving along during inference. First we need to set file path to the needed units. This includes where our vocab list is, our model and configuration, and the speaker reference audio file.

Then we can define a text input, initialize and load our model, then set the mode to evaluation mode. To process the inputs, we can call the model’s get conditioning latents method. This will take as input the speaker reference, then output the conditional latent codes and speaker encoder.

We can then pass these units over to the model’s inference method. This will output a dictionary which contains a waveform array. To save this output, simply convert it to a tensor and use torchaudio. Notice many hyperparameters here are set based on the configuration file. This will be explained later during the fine-tuning section.


Now that we have a basic overview of the model’s architecture, we can look at the data preparation process. For fine-tuning each speaker, we began with an audio file in waveform format. This raw audio file should at least be an hour long, and contain mostly audio from the target speaker. Since this will be used for training, it is important that external noises, like music or other non speaker audio, not be included in this file. We observed that leaving theses noises in will cause them to be reproduced during inference. For example, you may hear me taking the occasional deep breath, since I did not edit theses out of my training set. 

XTT S does not take long audio files. The max and minimum lengths can be set in the fine-tuning configuration, but we found it is best to keep each sample between four and eleven  seconds. There are various possible methods that can be used to chunk the raw audio file into many different samples. The method I used was to sample audio lengths, bound between four and eleven seconds, from a normal distribution. This ensures that my audio lengths are diverse, and that the typical six second length is well represented. A similar method would be to use a uniform distribution, ensuring each possible input length is shown equally. Yet another method we observed that performs well is to chunk the audio such that each sample represents one full sentence.

While chunking the audio files, it is important to give the output file names descriptive names. When forming the final dataset, we will need to associate a transcript with each chunk. For this, we used a file naming system that adds the iteration number for the chunk in the file name. This was possible because we chunked the auto files sequentially. 

Once chunking is complete, you should inspect the chunks to ensure no obvious audio errors are present. For example, it is possible if you did not chunk using the sentence method, that a chunk may contain no speech. Remove chunks with obvious errors or poor quality. Since this is a tedious task, and may not be feasible if handling hours worth of audio, later I will discuss other methods for removing poor quality chunks.

Next in the data processing workflow, we need to associate text labels for each chunk. The format of this dataset file will be a text file with three features. Feature one is the file name for the chunk. Feature two is the transcription text for the audio content. Feature three is a normalized transcription. This means that, for example, if the transcription had the number fourteen hundred, you could write it as one thousand four hundred in the normalized column. It is important to note that the file structure for this text file uses the vertical bar symbol as a separator. 

For the transcription process, we used Open AI’s Whisper model. This allowed us to transcribe hours of audio without having to manually go through each chunk. However, this process is not perfect. There are sometimes transcription errors, and this can cause errors. If you use this method, you can scan the transcription file manually, and check for any obvious error. This is an iterative process and the more work you put in here, the better the model will sound.

One optional method we used to clean up transcription and other chunk errors was outlier detection. For example, using the transcribed text and audio files, we used Z-score outlier detection to remove chunks that had unusually long or short words to seconds metrics. This squashes the variance in word to audio length duration within the dataset, hopefully giving the model more consistent samples without reducing audio length duration. Then finally, once the data file is prepared with chunked audio files and transcriptions, we are ready to move onto fine-tuning the model. 

Note that there are many other options here that can be used to improve the final output. For example, the outlier detection section can be expanded to capture more nuanced deviations. The raw audio file can have its sample rate normalized to be more consistent. 

An important observation that should be noted is the context for the speaker. For the models trained in this project, we used samples from speakers reading. People tend to have a specific tone and cadence when reading, which really affects the model output. If you want normal and casual speech, do your best to get samples from people speaking casually. 
    '''

    # Example call:
    out = genAudioManual(
        text=text,
        checkpoint_dir=checkpoint_dir,
        vocab_path=vocab_path,
        reference_wav=speaker_ref,
        output_path=f"output/project_voice_test.wav",
        split_sentences=True
    )

Loading model...
Compute speaker latents...
Generating audio for: For our project together, we worked on fine tuning KoKey’s XTT S model on a single speaker.
Generating audio for: XTT S is a multilingual Text to Speech model that is able to produce high quality synthetic speech.
Generating audio for: Note that this voice was trained using audio samples of me reading.
Generating audio for: So I may sound like I am reading directly from a script.
I will first be walking us through the model shown here.
Generating audio for: Beginning at the bottom of this flow chart, we can see the model’s inputs.
Generating audio for: At the bottom, we can see three inputs.
Generating audio for: A Reference spectrogram, a text input of some funny text, and a spectrogram marked as being the ground truth.
Generating audio for: 

The spectrograms here are referring to mel spectrograms.
Generating audio for: This is a specially formatted audio format that encodes for melody.
Generating audio for: If you are

Generating audio for: When forming the final dataset, we will need to associate a transcript with each chunk.
Generating audio for: For this, we used a file naming system that adds the iteration number for the chunk in the file name.
Generating audio for: This was possible because we chunked the auto files sequentially.
Generating audio for: 

Once chunking is complete, you should inspect the chunks to ensure no obvious audio errors are present.
Generating audio for: For example, it is possible if you did not chunk using the sentence method, that a chunk may contain no speech.
Generating audio for: Remove chunks with obvious errors or poor quality.
Generating audio for: Since this is a tedious task, and may not be feasible if handling hours worth of audio, later I will discuss other methods for removing poor quality chunks.

Next in the data processing workflow, we need to associate text labels for each chunk.
Generating audio for: The format of this dataset file will be a text file wi



split work:

--model arch
--Report model arch intro, BDE, dVAE, Conditioning encoder, the perceiver resampler.
Introduction section I
--VI inference
--V. Fine-tuning: Section B: Training and C. Experiments.
-- Mention in training section loss metrics were not used to determine best model. Done by sampling checkpoints by ear.
VII Novel Future Applications
Acknowledgements


Presentation:
1. Quick intro
2. Model overview
3. Data prep
4. Fine tuning

notes: Use jupyter notebooks. They should contain markdown blocks before each section clearing explaining things