# Train your first 🐸 TTS model 💫

### 👋 Hello and welcome to Coqui (🐸) TTS

The goal of this notebook is to show you a **typical workflow** for **training** and **testing** a TTS model with 🐸.

Let's train a very small model on a very small amount of data so we can iterate quickly.

In this notebook, we will:

1. Download data and format it for 🐸 TTS.
2. Configure the training and testing runs.
3. Train a new model.
4. Test the model and display its performance.

So, let's jump right in!


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
## Install Coqui TTS
! pip install -U pip
! pip install TTS

Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.0.1
Collecting TTS
  Downloading TTS-0.22.0-cp311-cp311-manylinux1_x86_64.whl.metadata (21 kB)
Collecting anyascii>=0.3.0 (from TTS)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pysbd>=0.3.4 (from TTS)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting umap-learn>=0.5.1 (from TTS)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pandas<2.0,>=1.4 (from TTS)
  Downloading pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Colle

In [4]:
!pip install phonemizer
!sudo apt-get install espeak-ng -y  # Pe Google Colab / Linux


Collecting phonemizer
  Downloading phonemizer-3.3.0-py3-none-any.whl.metadata (48 kB)
Collecting segments (from phonemizer)
  Downloading segments-2.2.1-py2.py3-none-any.whl.metadata (3.3 kB)
Collecting dlinfo (from phonemizer)
  Downloading dlinfo-2.0.0-py3-none-any.whl.metadata (1.1 kB)
Collecting clldutils>=1.7.3 (from segments->phonemizer)
  Downloading clldutils-3.24.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting csvw>=1.5.6 (from segments->phonemizer)
  Downloading csvw-3.5.1-py2.py3-none-any.whl.metadata (10 kB)
Collecting colorlog (from clldutils>=1.7.3->segments->phonemizer)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting bibtexparser>=2.0.0b4 (from clldutils>=1.7.3->segments->phonemizer)
  Downloading bibtexparser-2.0.0b8-py3-none-any.whl.metadata (5.4 kB)
Collecting pylatexenc (from clldutils>=1.7.3->segments->phonemizer)
  Downloading pylatexenc-2.10.tar.gz (162 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting isodate (from

In [5]:
!pip install --upgrade pandas==2.2.2 networkx==3.2


Collecting pandas==2.2.2
  Downloading pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting networkx==3.2
  Downloading networkx-3.2-py3-none-any.whl.metadata (5.2 kB)
Downloading pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m98.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading networkx-3.2-py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m80.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: networkx, pandas
  Attempting uninstall: networkx
    Found existing installation: networkx 2.8.8
    Uninstalling networkx-2.8.8:
      Successfully uninstalled networkx-2.8.8
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERROR: pip's dependency res

In [6]:
import TTS
print(TTS.__version__)


0.22.0


!pip install TTS --target=/content/TTS_env
import sys
sys.path.append('/content/TTS_env')


## ✅ Data Preparation

### **First things first**: we need some data.

We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. More details about data requirements such as recording characteristics, background noise and vocabulary coverage can be found in the [🐸TTS documentation](https://tts.readthedocs.io/en/latest/formatting_your_dataset.html).

If you have a single audio file and you need to **split** it into clips. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using **wav** file format.

The data format we will be adopting for this tutorial is taken from the widely-used  **LJSpeech** dataset, where **waves** are collected under a folder:

<span style="color:purple;font-size:15px">
/wavs<br />
 &emsp;| - audio1.wav<br />
 &emsp;| - audio2.wav<br />
 &emsp;| - audio3.wav<br />
  ...<br />
</span>

and a **metadata.csv** file will have the audio file name in parallel to the transcript, delimited by `|`:

<span style="color:purple;font-size:15px">
# metadata.csv <br />
audio1|This is my sentence. <br />
audio2|This is maybe my sentence. <br />
audio3|This is certainly my sentence. <br />
audio4|Let this be your sentence. <br />
...
</span>

In the end, we should have the following **folder structure**:

<span style="color:purple;font-size:15px">
/MyTTSDataset <br />
&emsp;| <br />
&emsp;| -> metadata.csv<br />
&emsp;| -> /wavs<br />
&emsp;&emsp;| -> audio1.wav<br />
&emsp;&emsp;| -> audio2.wav<br />
&emsp;&emsp;| ...<br />
</span>

🐸TTS already provides tooling for the _LJSpeech_. if you use the same format, you can start training your models right away. <br />

After you collect and format your dataset, you need to check two things. Whether you need a **_formatter_** and a **_text_cleaner_**. <br /> The **_formatter_** loads the text file (created above) as a list and the **_text_cleaner_** performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).

If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own **_formatter_** and  **_text_cleaner_**.

## ⏳️ Loading your dataset
Load one of the dataset supported by 🐸TTS.

We will start by defining dataset config and setting LJSpeech as our target dataset and define its path.


In [7]:
import os

# BaseDatasetConfig: defines name, formatter and path of the dataset.
from TTS.tts.configs.shared_configs import BaseDatasetConfig

output_path = "tts_train_dir"
if not os.path.exists(output_path):
    os.makedirs(output_path)


In [8]:
# Descarca dataset-ul
dataset_path="/content/drive/MyDrive/dataset_ro/wavs"
transcriptions="/content/drive/MyDrive/dataset_ro/list.txt"


In [9]:
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train=transcriptions, path=dataset_path
)

## ✅ Train a new model

Let's kick off a training run 🚀🚀🚀.

Deciding on the model architecture you'd want to use is based on your needs and available resources. Each model architecture has it's pros and cons that define the run-time efficiency and the voice quality.
We have many recipes under `TTS/recipes/` that provide a good starting point. For this tutorial, we will be using `GlowTTS`.

We will begin by initializing the model training configuration.

In [8]:
!apt-get update
!apt-get install espeak-ng -y

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Connecting to archive.ubuntu.com (91.189.91.82)] [Connecting to security.ubuntu.com (185.125.190                                                                                                    Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 391 kB in 1s (

In [9]:
!which espeak-ng
!espeak-ng --version

/usr/bin/espeak-ng
eSpeak NG text-to-speech: 1.50  Data at: /usr/lib/x86_64-linux-gnu/espeak-ng-data


In [10]:
# GlowTTSConfig: all model related values for training, validating and testing.
from TTS.tts.configs.glow_tts_config import GlowTTSConfig

config = GlowTTSConfig(
    batch_size=16,
    eval_batch_size=8,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=100,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="ro", # Update the language code to ro-RO
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    save_step=1000,
   phonemizer="espeak",
)

Next we will initialize the audio processor which is used for feature extraction and audio I/O.

In [11]:
from TTS.utils.audio import AudioProcessor
ap = AudioProcessor.init_from_config(config)
# Modify sample rate if for a custom audio dataset:
# ap.sample_rate = 22050


 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024


Next we will initialize the tokenizer which is used to convert text to sequences of token IDs.  If characters are not defined in the config, default characters are passed to the config.

In [12]:
from TTS.tts.utils.text.tokenizer import TTSTokenizer
tokenizer, config = TTSTokenizer.init_from_config(config)

Next we will load data samples. Each sample is a list of ```[text, audio_file_path, speaker_name]```. You can define your custom sample loader returning the list of samples.

In [13]:
from TTS.tts.datasets import load_tts_samples
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

 | > Found 2785 files in /content/drive/MyDrive/dataset_ro/wavs


In [14]:
# Funcție care corectează dublarea extensiei .wav
def fix_audio_paths(samples):
    fixed_samples = []
    for sample in samples:
        corrected_path = sample["audio_file"].replace(".wav.wav", ".wav")
        sample["audio_file"] = corrected_path
        fixed_samples.append(sample)
    return fixed_samples

# Aplică funcția după ce `train_samples` este generat
train_samples = fix_audio_paths(train_samples)
eval_samples = fix_audio_paths(eval_samples)

# Verifică primele 5 căi după corectare
for sample in train_samples[:5]:
    print("✅ Cale corectată:", sample["audio_file"])

✅ Cale corectată: /content/drive/MyDrive/dataset_ro/wavs/2375.wav
✅ Cale corectată: /content/drive/MyDrive/dataset_ro/wavs/334.wav
✅ Cale corectată: /content/drive/MyDrive/dataset_ro/wavs/2330.wav
✅ Cale corectată: /content/drive/MyDrive/dataset_ro/wavs/2708.wav
✅ Cale corectată: /content/drive/MyDrive/dataset_ro/wavs/1520.wav


Now we're ready to initialize the model.

Models take a config object and a speaker manager as input. Config defines the details of the model like the number of layers, the size of the embedding, etc. Speaker manager is used by multi-speaker models.

In [15]:
from TTS.tts.models.glow_tts import GlowTTS
model = GlowTTS(config, ap, tokenizer, speaker_manager=None)

Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training, distributed training, etc.

In [16]:
from trainer import Trainer, TrainerArgs
trainer = Trainer(
    TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
)

 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: True
 | > Precision: fp16
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 2
 | > Num. of Torch Threads: 1
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=tts_train_dir/run-February-18-2025_02+58PM-0000000
  self.scaler = torch.cuda.amp.GradScaler()

 > Model has 28610257 parameters


### AND... 3,2,1... START TRAINING 🚀🚀🚀

In [17]:

# Calea către dataset
met = "/content/drive/MyDrive/dataset_ro/list.txt"

# Verifică existența fișierelor .wav
missing_files = []

with open(met, "r", encoding="utf-8") as f:
    lines = f.readlines()

for line in lines:
    parts = line.strip().split("|")
    if len(parts) >= 1:
        audio_file = parts[0]
        if not os.path.exists(audio_file):
            missing_files.append(audio_file)

# Afișează fișierele lipsă
if missing_files:
    print("⚠️ Fișiere lipsă:", missing_files[:10])  # Afișează primele 10
else:
    print("✅ Toate fișierele .wav există!")


✅ Toate fișierele .wav există!


In [70]:
!ls /content/drive/MyDrive/dataset_ro/wavs/


 1000.wav   1315.wav   162.wav	  1944.wav   2258.wav	     2568.wav   374.wav   689.wav
 1001.wav   1316.wav   1630.wav   1945.wav   2259.wav	     2569.wav   375.wav   68.wav
 1002.wav   1317.wav   1631.wav   1946.wav   225.wav	     256.wav    376.wav   690.wav
 1003.wav   1318.wav   1632.wav   1947.wav   2260.wav	     2570.wav   377.wav   691.wav
 1004.wav   1319.wav   1633.wav   1948.wav   2261.wav	     2571.wav   378.wav   692.wav
 1005.wav   131.wav    1634.wav   1949.wav   2262.wav	     2572.wav   379.wav   693.wav
 1006.wav   1320.wav   1635.wav   194.wav    2263.wav	     2573.wav   37.wav	  694.wav
 1007.wav   1321.wav   1636.wav   1950.wav   2264.wav	     2574.wav   380.wav   695.wav
 1008.wav   1322.wav   1637.wav   1951.wav   2265.wav	     2575.wav   381.wav   696.wav
 1009.wav   1323.wav   1638.wav   1952.wav   2266.wav	     2576.wav   382.wav   697.wav
 100.wav    1324.wav   1639.wav   1953.wav   2267.wav	     2577.wav   383.wav   698.wav
 1010.wav   1325.wav   163.wav	  195

In [124]:
with open("/content/drive/MyDrive/dataset_ro/list.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

for i, line in enumerate(lines[:10]):  # Afișează primele 10 linii pentru verificare
    print(f"Linia {i+1}: {repr(line)}")

Linia 1: '/content/drive/MyDrive/dataset_ro/wavs/1.wav|vă mulțumesc din nou pentru sprijin|va multumesc din nou pentru sprijin\n'
Linia 2: '/content/drive/MyDrive/dataset_ro/wavs/2.wav|parlamentul a făcuto la acest nivel|parlamentul a facuto la acest nivel\n'
Linia 3: '/content/drive/MyDrive/dataset_ro/wavs/3.wav|cred că este nevoie de mai mult dialog|cred ca este nevoie de mai mult dialog\n'
Linia 4: '/content/drive/MyDrive/dataset_ro/wavs/4.wav|ion tirinescu deține funcția de șef al poliției rutiere hunedoara|ion tirinescu detine functia de sef al politiei rutiere hunedoara\n'
Linia 5: '/content/drive/MyDrive/dataset_ro/wavs/5.wav|este important să evităm alte cazuri ca acesta|este important sa evitam alte cazuri ca acesta\n'
Linia 6: '/content/drive/MyDrive/dataset_ro/wavs/6.wav|suntem cel mai mare investitor din această regiune|suntem cel mai mare investitor din aceasta regiune\n'
Linia 7: '/content/drive/MyDrive/dataset_ro/wavs/7.wav|prin urmare este o situație dificilă|prin urmar

In [None]:
trainer.fit()


[4m[1m > EPOCH: 0/100[0m
 --> tts_train_dir/run-February-18-2025_02+58PM-0000000


[*] Pre-computing phonemes...


100%|██████████| 2758/2758 [00:37<00:00, 73.07it/s]




> DataLoader initialization
| > Tokenizer:
	| > add_blank: False
	| > use_eos_bos: False
	| > use_phonemes: True
	| > phonemizer:
		| > phoneme language: ro
		| > phoneme backend: espeak
| > Number of instances : 2758



[1m > TRAINING (2025-02-18 15:01:34) [0m


 | > Preprocessing samples
 | > Max text length: 97
 | > Min text length: 18
 | > Avg text length: 44.04894851341552
 | 
 | > Max audio length: 171483.0
 | > Min audio length: 44475.0
 | > Avg audio length: 86607.86656997824
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.



[1m   --> TIME: 2025-02-18 15:01:48 -- STEP: 0/173 -- GLOBAL_STEP: 0[0m
     | > current_lr: 2.5e-07 
     | > step_time: 2.9156  (2.915602922439575)
     | > loader_time: 11.1167  (11.116697788238525)

 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
 [!] `train_step()` retuned `None` outputs. Skipping training step.
  with autocast(enabled=False):  # avoid mixed_precision in criterion

[1m   --> TIME: 2025-02-18 15:02:32 -- STE

#### 🚀 Run the Tensorboard. 🚀
On the notebook and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs.

In [None]:
!pip install tensorboard
!tensorboard --logdir=tts_train_dir

## ✅ Test the model

We made it! 🙌

Let's kick off the testing run, which displays performance metrics.

We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇

You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.

When you start training your own models, make sure your testing data doesn't include your training data 😅

Let's get the latest saved checkpoint.

In [None]:
import glob, os
output_path = "tts_train_dir"
ckpts = sorted([f for f in glob.glob(output_path+"/*/*.pth")])
configs = sorted([f for f in glob.glob(output_path+"/*/*.json")])

In [None]:
 !tts --text "Text for TTS" \
      --model_path $test_ckpt \
      --config_path $test_config \
      --out_path out.wav

## 📣 Listen to the synthesized wave 📣

In [None]:
import IPython
IPython.display.Audio("out.wav")

## 🎉 Congratulations! 🎉 You now have trained your first TTS model!
Follow up with the next tutorials to learn more advanced material.