<a href="https://colab.research.google.com/github/Shafeeq260/Projects/blob/main/Copy_of_Tutorial_2_train_your_first_TTS_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train your first 🐸 TTS model 💫

### 👋 Hello and welcome to Coqui (🐸) TTS

The goal of this notebook is to show you a **typical workflow** for **training** and **testing** a TTS model with 🐸.

Let's train a very small model on a very small amount of data so we can iterate quickly.

In this notebook, we will:

1. Download data and format it for 🐸 TTS.
2. Configure the training and testing runs.
3. Train a new model.
4. Test the model and display its performance.

So, let's jump right in!


In [1]:
## Install Coqui TTS
! pip install -U pip
! pip install TTS



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## ✅ Data Preparation

### **First things first**: we need some data.

We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. More details about data requirements such as recording characteristics, background noise and vocabulary coverage can be found in the [🐸TTS documentation](https://tts.readthedocs.io/en/latest/formatting_your_dataset.html).

If you have a single audio file and you need to **split** it into clips. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using **wav** file format.

The data format we will be adopting for this tutorial is taken from the widely-used  **LJSpeech** dataset, where **waves** are collected under a folder:

<span style="color:purple;font-size:15px">
/wavs<br />
 &emsp;| - audio1.wav<br />
 &emsp;| - audio2.wav<br />
 &emsp;| - audio3.wav<br />
  ...<br />
</span>

and a **metadata.csv** file will have the audio file name in parallel to the transcript, delimited by `|`:

<span style="color:purple;font-size:15px">
# metadata.csv <br />
audio1|This is my sentence. <br />
audio2|This is maybe my sentence. <br />
audio3|This is certainly my sentence. <br />
audio4|Let this be your sentence. <br />
...
</span>

In the end, we should have the following **folder structure**:

<span style="color:purple;font-size:15px">
/MyTTSDataset <br />
&emsp;| <br />
&emsp;| -> metadata.csv<br />
&emsp;| -> /wavs<br />
&emsp;&emsp;| -> audio1.wav<br />
&emsp;&emsp;| -> audio2.wav<br />
&emsp;&emsp;| ...<br />
</span>

🐸TTS already provides tooling for the _LJSpeech_. if you use the same format, you can start training your models right away. <br />

After you collect and format your dataset, you need to check two things. Whether you need a **_formatter_** and a **_text_cleaner_**. <br /> The **_formatter_** loads the text file (created above) as a list and the **_text_cleaner_** performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).

If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own **_formatter_** and  **_text_cleaner_**.

## ⏳️ Loading your dataset
Load one of the dataset supported by 🐸TTS.

We will start by defining dataset config and setting LJSpeech as our target dataset and define its path.


In [3]:
import os

# Define the path to your dataset here
dataset_path = "/content/drive"

# The rest of your code
output_path = "tts_train_dir"
if not os.path.exists(output_path):
    os.makedirs(output_path)

In [4]:
# Download and extract LJSpeech dataset.

!wget -O $output_path/LJSpeech-1.1.tar.bz2 https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
!tar -xf $output_path/LJSpeech-1.1.tar.bz2 -C $output_path

--2025-07-25 09:23:40--  https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
Resolving data.keithito.com (data.keithito.com)... 143.244.50.83, 2400:52e0:1a01::900:1
Connecting to data.keithito.com (data.keithito.com)|143.244.50.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2748572632 (2.6G) [text/plain]
Saving to: ‘tts_train_dir/LJSpeech-1.1.tar.bz2’


2025-07-25 09:24:13 (79.8 MB/s) - ‘tts_train_dir/LJSpeech-1.1.tar.bz2’ saved [2748572632/2748572632]



In [20]:
from TTS.tts.configs.shared_configs import BaseDatasetConfig

dataset_config = BaseDatasetConfig(
    formatter="ljspeech",
    meta_file_train="metadata.csv",
    path="tts_train_dir/LJSpeech-1.1"
)


## ✅ Train a new model

Let's kick off a training run 🚀🚀🚀.

Deciding on the model architecture you'd want to use is based on your needs and available resources. Each model architecture has it's pros and cons that define the run-time efficiency and the voice quality.
We have many recipes under `TTS/recipes/` that provide a good starting point. For this tutorial, we will be using `GlowTTS`.

We will begin by initializing the model training configuration.

In [21]:
# GlowTTSConfig: all model related values for training, validating and testing.
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
config = GlowTTSConfig(
    batch_size=32,
    eval_batch_size=16,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=5,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    save_step=1000,
)

Next we will initialize the audio processor which is used for feature extraction and audio I/O.

In [22]:
from TTS.utils.audio import AudioProcessor
ap = AudioProcessor.init_from_config(config)
# Modify sample rate if for a custom audio dataset:
# ap.sample_rate = 22050


 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024


Next we will initialize the tokenizer which is used to convert text to sequences of token IDs.  If characters are not defined in the config, default characters are passed to the config.

In [23]:
from TTS.tts.utils.text.tokenizer import TTSTokenizer
tokenizer, config = TTSTokenizer.init_from_config(config)

Next we will load data samples. Each sample is a list of ```[text, audio_file_path, speaker_name]```. You can define your custom sample loader returning the list of samples.

In [38]:
import os
import csv
from sklearn.model_selection import train_test_split
from TTS.tts.configs.shared_configs import BaseDatasetConfig

# Define the path to your dataset here - make sure this matches the path used in other cells
# Assuming you have uploaded your dataset to Google Drive and it's in a folder named 'myTTSDataset'
# directly under 'My Drive'. If not, adjust the path accordingly.
dataset_path = "/content/tts_train_dir/LJSpeech-1.1"
metadata_file = os.path.join(dataset_path, "metadata.csv")
wavs_path = os.path.join(dataset_path, "wavs")

# Manually load and format the data
samples = []
# Check if the metadata file exists before trying to open it
if not os.path.exists(metadata_file):
    print(f"Error: metadata file not found at {metadata_file}")
else:
    with open(metadata_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='|')
        for row in reader:
            # Assuming the format is audio_file_name|transcription
            if len(row) >= 2:
                audio_file_name = row[0].strip()
                text = row[1].strip()
                # Construct the full path to the audio file
                # Ensure no extra quotes or duplicated extensions
                full_audio_path = os.path.join(wavs_path, audio_file_name + ".wav").replace('"', '')
                # Extract the base name without extension for audio_unique_name
                audio_unique_name = os.path.splitext(audio_file_name)[0]
                # Add the sample to the list, including audio_unique_name and language
                # Use the language defined in the config (from cell ac2ffe3e-ad0c-443e-800c-9b076ee811b4)
                language = "en-us" # This should match phoneme_language in your config
                samples.append({"text": text, "audio_file": full_audio_path, "speaker_name": "default", "audio_unique_name": audio_unique_name, "language": language})
            else:
                print(f"Skipping malformed row: {row}")


    # Split the samples into training and evaluation sets
    if samples: # Only split if samples were loaded
        train_samples, eval_samples = train_test_split(
            samples,
            test_size=0.05,  # At least 1 sample out of 20
            random_state=42  # for reproducibility
        )

        # Print the number of samples loaded to verify
        print(f"Loaded {len(samples)} samples.")
        print(f"Train samples: {len(train_samples)}")
        print(f"Eval samples: {len(eval_samples)}")
    else:
        train_samples = []
        eval_samples = []
        print("No samples loaded from metadata file.")


# The original code had incorrect replace calls and relied on the problematic load_tts_samples
# We have replaced it with manual data loading and formatting to fix the IndexError and FileNotFoundError.
# Ensure the dataset_path defined here matches the one used in the config initialization cell.

Skipping malformed row: ["audio1.wav|Hello, my name is Muhammed Shafeeq and I'm creating a voice cloning dataset."]
Skipping malformed row: ['audio4.wav|Technology has advanced rapidly in recent years, especially artificial intelligence.']
Skipping malformed row: ['audio18.wav|Data science combines statistics, programming, and domain expertise effectively.']
Loaded 17 samples.
Train samples: 16
Eval samples: 1


Now we're ready to initialize the model.

Models take a config object and a speaker manager as input. Config defines the details of the model like the number of layers, the size of the embedding, etc. Speaker manager is used by multi-speaker models.

In [42]:
from TTS.tts.models.glow_tts import GlowTTS
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig # Import BaseDatasetConfig

# Define dataset_path and output_path again to ensure they are available
import os
output_path = "tts_train_dir"
# Update this path to your dataset location in Google Drive
dataset_path = "/content/drive/MyDrive/myTTSDataset/" # Ensure correct dataset_path

# Redefine dataset_config to ensure it is available
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="metadata.csv", path=dataset_path
)

# Redefine config with GlowTTSConfig to ensure the correct configuration is used
config = GlowTTSConfig(
    batch_size=32,
    eval_batch_size=16,
    num_loader_workers=0,  # Reduced number of workers
    num_eval_loader_workers=0, # Reduced number of eval workers
    run_eval=True,
    test_delay_epochs=-1,
    epochs=5,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    print_step=25,
    print_eval=False,
    mixed_precision=False,
    output_path=output_path,
    datasets=[dataset_config],
    save_step=1000,
)

# Initialize the GlowTTS model with the correct config
model = GlowTTS(config, ap, tokenizer, speaker_manager=None)

Exception ignored in: Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 178, in close
    self._close()
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 377, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor
<function _ConnectionBase.__del__ at 0x7d1103ffa8e0>
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 133, in __del__
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 178, in close
    self._close()
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 377, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor
    self._close()
  File "/usr/lib/python3.11/multiprocessing/connect

Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training, distributed training, etc.

In [40]:
from trainer import Trainer, TrainerArgs

# Explicitly set the language key for each sample to ensure it's present
language = "en-us" # This should match phoneme_language in your config
for sample in train_samples:
    sample['language'] = language
for sample in eval_samples:
    sample['language'] = language

trainer = Trainer(
    TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
)

 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Num. of CPUs: 2
 | > Num. of Torch Threads: 1
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=tts_train_dir/run-July-25-2025_10+04AM-0000000

 > Model has 28610257 parameters


### AND... 3,2,1... START TRAINING 🚀🚀🚀

In [43]:
trainer.fit()


[4m[1m > EPOCH: 0/5[0m
 --> tts_train_dir/run-July-25-2025_10+04AM-0000000

[1m > TRAINING (2025-07-25 10:07:49) [0m


wʊd ju laɪk tə d͡ʒɔɪn mi fɚ kɔfi ðɪs æftɚnun?
 [!] Character '͡' not found in the vocabulary. Discarding it.



[1m   --> TIME: 2025-07-25 10:08:04 -- STEP: 0/1 -- GLOBAL_STEP: 0[0m
     | > current_lr: 2.5e-07 
     | > step_time: 10.6027  (10.602692127227783)
     | > loader_time: 4.2207  (4.220726490020752)

 [!] `train_step()` retuned `None` outputs. Skipping training step.

[1m > EVALUATION [0m





> DataLoader initialization
| > Tokenizer:
	| > add_blank: False
	| > use_eos_bos: False
	| > use_phonemes: True
	| > phonemizer:
		| > phoneme language: en-us
		| > phoneme backend: gruut
| > Number of instances : 1
 | > Preprocessing samples
 | > Max text length: 64
 | > Min text length: 64
 | > Avg text length: 64.0
 | 
 | > Max audio length: 311079.0
 | > Min audio length: 311079.0
 | > Avg audio length: 311079.0
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.


  with autocast(enabled=False):  # avoid mixed_precision in criterion


 | > Synthesizing test sentences.



  [1m--> EVAL PERFORMANCE[0m
     | > avg_loader_time: 0.48752498626708984 [0m(+0)
     | > avg_loss: 8893436928.0 [0m(+0)
     | > avg_log_mle: 8893436928.0 [0m(+0)
     | > avg_loss_dur: 4.549579620361328 [0m(+0)

 > BEST MODEL : tts_train_dir/run-July-25-2025_10+04AM-0000000/best_model_1.pth

[4m[1m > EPOCH: 1/5[0m
 --> tts_train_dir/run-July-25-2025_10+04AM-0000000

[1m > TRAINING (2025-07-25 10:08:14) [0m
 [!] `train_step()` retuned `None` outputs. Skipping training step.

[1m > EVALUATION [0m

  with autocast(enabled=False):  # avoid mixed_precision in criterion


 | > Synthesizing test sentences.



  [1m--> EVAL PERFORMANCE[0m
     | > avg_loader_time:[92m 0.3417623043060303 [0m(-0.14576268196105957)
     | > avg_loss: 8893436928.0 [0m(+0.0)
     | > avg_log_mle: 8893436928.0 [0m(+0.0)
     | > avg_loss_dur: 4.549579620361328 [0m(+0.0)


[4m[1m > EPOCH: 2/5[0m
 --> tts_train_dir/run-July-25-2025_10+04AM-0000000

[1m > TRAINING (2025-07-25 10:08:32) [0m
 [!] `train_step()` retuned `None` outputs. Skipping training step.

[1m > EVALUATION [0m



 | > Synthesizing test sentences.



  [1m--> EVAL PERFORMANCE[0m
     | > avg_loader_time:[91m 0.3541584014892578 [0m(+0.012396097183227539)
     | > avg_loss: 8893436928.0 [0m(+0.0)
     | > avg_log_mle: 8893436928.0 [0m(+0.0)
     | > avg_loss_dur: 4.549579620361328 [0m(+0.0)


[4m[1m > EPOCH: 3/5[0m
 --> tts_train_dir/run-July-25-2025_10+04AM-0000000

[1m > TRAINING (2025-07-25 10:08:53) [0m
 [!] `train_step()` retuned `None` outputs. Skipping training step.

[1m > EVALUATION [0m



 | > Synthesizing test sentences.



  [1m--> EVAL PERFORMANCE[0m
     | > avg_loader_time:[91m 0.5080325603485107 [0m(+0.15387415885925293)
     | > avg_loss: 8893436928.0 [0m(+0.0)
     | > avg_log_mle: 8893436928.0 [0m(+0.0)
     | > avg_loss_dur: 4.549579620361328 [0m(+0.0)


[4m[1m > EPOCH: 4/5[0m
 --> tts_train_dir/run-July-25-2025_10+04AM-0000000

[1m > TRAINING (2025-07-25 10:09:11) [0m
 [!] `train_step()` retuned `None` outputs. Skipping training step.

[1m > EVALUATION [0m



 | > Synthesizing test sentences.



  [1m--> EVAL PERFORMANCE[0m
     | > avg_loader_time:[92m 0.36446475982666016 [0m(-0.14356780052185059)
     | > avg_loss: 8893436928.0 [0m(+0.0)
     | > avg_log_mle: 8893436928.0 [0m(+0.0)
     | > avg_loss_dur: 4.549579620361328 [0m(+0.0)



#### 🚀 Run the Tensorboard. 🚀
On the notebook and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs.

In [44]:
!pip install tensorboard
!tensorboard --logdir=tts_train_dir

2025-07-25 10:11:16.651672: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753438276.677353   14745 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753438276.684789   14745 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-07-25 10:11:21.151976: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, u

## ✅ Test the model

We made it! 🙌

Let's kick off the testing run, which displays performance metrics.

We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇

You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.

When you start training your own models, make sure your testing data doesn't include your training data 😅

Let's get the latest saved checkpoint.

In [69]:
import glob, os
output_path = "tts_train_dir"
ckpts = sorted([f for f in glob.glob(output_path+"/*/*.pth")])
# Modify glob pattern to also look for config.json directly in output_path
configs = sorted(glob.glob(output_path + "/**/*.json", recursive=True))


# Assign the latest checkpoint and config to variables
test_ckpt = ckpts[-1] if ckpts else ""
test_config = configs[-1] if configs else ""

print(f"Latest checkpoint: {test_ckpt}")
print(f"Latest config: {test_config}")

Latest checkpoint: tts_train_dir/run-July-25-2025_10+04AM-0000000/best_model_1.pth
Latest config: 


In [76]:
!tts --text "Text for TTS" \
     --model_path $test_ckpt \
     --config_path $test_config \
     --out_path out.wav

usage: tts [-h] [--list_models [LIST_MODELS]]
           [--model_info_by_idx MODEL_INFO_BY_IDX]
           [--model_info_by_name MODEL_INFO_BY_NAME] [--text TEXT]
           [--model_name MODEL_NAME] [--vocoder_name VOCODER_NAME]
           [--config_path CONFIG_PATH] [--model_path MODEL_PATH]
           [--out_path OUT_PATH] [--use_cuda USE_CUDA] [--device DEVICE]
           [--vocoder_path VOCODER_PATH]
           [--vocoder_config_path VOCODER_CONFIG_PATH]
           [--encoder_path ENCODER_PATH]
           [--encoder_config_path ENCODER_CONFIG_PATH] [--pipe_out [PIPE_OUT]]
           [--speakers_file_path SPEAKERS_FILE_PATH]
           [--language_ids_file_path LANGUAGE_IDS_FILE_PATH]
           [--speaker_idx SPEAKER_IDX] [--language_idx LANGUAGE_IDX]
           [--speaker_wav SPEAKER_WAV [SPEAKER_WAV ...]]
           [--gst_style GST_STYLE]
           [--capacitron_style_wav CAPACITRON_STYLE_WAV]
           [--capacitron_style_text CAPACITRON_STYLE_TEXT]
           [--list_speak

## 📣 Listen to the synthesized wave 📣

In [84]:
# ✅ Import display module
from IPython.display import Audio
import os

# 📂 Check files in current directory (optional)
print("Current files:", os.listdir())

# 🎧 Play audio if available
audio_path = "out.wav"  # Adjust if your file is inside a folder
if os.path.isfile(audio_path):
    Audio(filename=audio_path)
else:
    print(f"File not found: {audio_path}")


Current files: ['.config', 'tts_train_dir', 'drive', 'sample_data']
File not found: out.wav


## 🎉 Congratulations! 🎉 You now have trained your first TTS model!
Follow up with the next tutorials to learn more advanced material.