The code for this training pipeline is derived from ESPnet github page : https://espnet.github.io/espnet/notebook/ESPnetEZ/TTS/TTS_finetune_vctk_dump.html

# Intalling espnet, espnet model zoo and camel tools

Installing espnet and espnet model zoo

In [None]:
!pip install espnet espnet_model_zoo

Collecting espnet
  Downloading espnet-202412-py3-none-any.whl.metadata (70 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/70.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.5/70.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting espnet_model_zoo
  Downloading espnet_model_zoo-0.1.7-py3-none-any.whl.metadata (10 kB)
Collecting setuptools<74.0.0,>=38.5.1 (from espnet)
  Downloading setuptools-73.0.1-py3-none-any.whl.metadata (6.6 kB)
Collecting configargparse>=1.2.1 (from espnet)
  Downloading ConfigArgParse-1.7-py3-none-any.whl.metadata (23 kB)
Collecting humanfriendly (from espnet)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Collecting librosa==0.9.2 (from espnet)
  Downloading librosa-0.9.2-py3-none-any.whl.metadata (8.2 kB)
Collecting jamo==0.4.1 (from espnet)
  Downloading jamo-0.4.1-py3-none-any.whl.metadata (2.3 kB)
Collecting kaldiio>=2.18.0 (from espn

Installing Camel Tools for converting the Buckwalter transcription to standard Arabic script

In [None]:
!pip install camel-tools --no-build-isolation

Collecting camel-tools
  Downloading camel_tools-1.5.6-py3-none-any.whl.metadata (10 kB)
Collecting docopt (from camel-tools)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill (from camel-tools)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting transformers<4.44.0,>=4.0 (from camel-tools)
  Downloading transformers-4.43.4-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting emoji (from camel-tools)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting pyrsistent (from camel-tools)
  Downloading pyrsistent-0.20.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting muddler (from camel-tools)
  Downloading muddler-0.1.3-py3-none-any.whl.metadata (7.5 kB)
Collecting camel-kenlm>=2025.4.8 (from camel-tools)
  Downloading camel-kenlm-2025.4.8.zip (5

# Downloading and Pre-Processing Dataset

Downloading and Extracting Arabic Speech Corpus Dataset

Link to website: https://en.arabicspeechcorpus.com/

Sometimes the dataset will show error on downloading multiple times continously from the url. But it will resolve when you try again after some time

In [None]:
import os
import zipfile
import urllib.request

# Define the URL and output path
url = "https://en.arabicspeechcorpus.com/arabic-speech-corpus.zip"
zip_path = "/content/arabic-speech-corpus.zip"
extract_path = "/content"

# Download the dataset
print("Downloading dataset...")
urllib.request.urlretrieve(url, zip_path)
print("Download complete.")

# Unzip the dataset
print("Extracting files...")
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("Extraction complete.")

# List extracted files
os.listdir(extract_path)


Downloading dataset...
Download complete.
Extracting files...
Extraction complete.


['.config', 'arabic-speech-corpus.zip', 'arabic-speech-corpus', 'sample_data']

Cloning Speaker Embeddings (X-vector) generated from the Dataset.

Link: https://github.com/Addalin-CP3445/speaker_embedding/tree/main

In [None]:
!git clone https://github.com/Addalin-CP3445/speaker_embedding.git
%cp speaker_embedding/extract_spk_embedding.py extract_spk_embedding.py
%cp speaker_embedding/train_speaker_embeddings -r train_speaker_embeddings
%cp speaker_embedding/test_speaker_embeddings -r test_speaker_embeddings

Cloning into 'speaker_embedding'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 9 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (9/9), 3.46 MiB | 7.37 MiB/s, done.


Renaming Audio files for Training and Test sets

In [None]:
import os

# Set the directory where your wav files are stored
wav_directory = '/content/arabic-speech-corpus/test set/wav'  # <-- update this path

# Define the prefix to remove
prefix = "ARA NORM  "

# Iterate through all files in the directory
for filename in os.listdir(wav_directory):
    if filename.startswith(prefix):
        # Remove the prefix
        new_filename = filename[len(prefix):]
        old_filepath = os.path.join(wav_directory, filename)
        new_filepath = os.path.join(wav_directory, new_filename)
        print(f"Renaming: {old_filepath} -> {new_filepath}")
        os.rename(old_filepath, new_filepath)

print("Renaming testing audio files complete!")


Renaming: /content/arabic-speech-corpus/test set/wav/ARA NORM  0031.wav -> /content/arabic-speech-corpus/test set/wav/0031.wav
Renaming: /content/arabic-speech-corpus/test set/wav/ARA NORM  0075.wav -> /content/arabic-speech-corpus/test set/wav/0075.wav
Renaming: /content/arabic-speech-corpus/test set/wav/ARA NORM  0008.wav -> /content/arabic-speech-corpus/test set/wav/0008.wav
Renaming: /content/arabic-speech-corpus/test set/wav/ARA NORM  0034.wav -> /content/arabic-speech-corpus/test set/wav/0034.wav
Renaming: /content/arabic-speech-corpus/test set/wav/ARA NORM  0001.wav -> /content/arabic-speech-corpus/test set/wav/0001.wav
Renaming: /content/arabic-speech-corpus/test set/wav/ARA NORM  0094.wav -> /content/arabic-speech-corpus/test set/wav/0094.wav
Renaming: /content/arabic-speech-corpus/test set/wav/ARA NORM  0080.wav -> /content/arabic-speech-corpus/test set/wav/0080.wav
Renaming: /content/arabic-speech-corpus/test set/wav/ARA NORM  0011.wav -> /content/arabic-speech-corpus/test s

In [None]:
import os

# Set the directory where your wav files are stored
wav_directory = '/content/arabic-speech-corpus/wav'  # <-- update this path

# Define the prefix to remove
prefix = "ARA NORM  "

# Iterate through all files in the directory
for filename in os.listdir(wav_directory):
    if filename.startswith(prefix):
        # Remove the prefix
        new_filename = filename[len(prefix):]
        old_filepath = os.path.join(wav_directory, filename)
        new_filepath = os.path.join(wav_directory, new_filename)
        print(f"Renaming: {old_filepath} -> {new_filepath}")
        os.rename(old_filepath, new_filepath)

print("Renaming training audio files complete!")


Renaming: /content/arabic-speech-corpus/wav/ARA NORM  1086.wav -> /content/arabic-speech-corpus/wav/1086.wav
Renaming: /content/arabic-speech-corpus/wav/ARA NORM  0503.wav -> /content/arabic-speech-corpus/wav/0503.wav
Renaming: /content/arabic-speech-corpus/wav/ARA NORM  0244.wav -> /content/arabic-speech-corpus/wav/0244.wav
Renaming: /content/arabic-speech-corpus/wav/ARA NORM  0883.wav -> /content/arabic-speech-corpus/wav/0883.wav
Renaming: /content/arabic-speech-corpus/wav/ARA NORM  1094.wav -> /content/arabic-speech-corpus/wav/1094.wav
Renaming: /content/arabic-speech-corpus/wav/ARA NORM  1397.wav -> /content/arabic-speech-corpus/wav/1397.wav
Renaming: /content/arabic-speech-corpus/wav/ARA NORM  0031.wav -> /content/arabic-speech-corpus/wav/0031.wav
Renaming: /content/arabic-speech-corpus/wav/ARA NORM  1682.wav -> /content/arabic-speech-corpus/wav/1682.wav
Renaming: /content/arabic-speech-corpus/wav/ARA NORM  0315.wav -> /content/arabic-speech-corpus/wav/0315.wav
Renaming: /content/

Creating Kaldi-style files: wav.scp, text, and utt2spk and converting Buckwalter to Standard Arabic

In [None]:
import os
from camel_tools.utils.transliterate import Transliterator
from camel_tools.utils.charmap import CharMapper

# Initialize the Buckwalter transliterator
bw2ar = CharMapper.builtin_mapper("bw2ar")
bt = Transliterator(bw2ar)

# Set your dataset paths (update these paths as needed)
dataset_path = '/content/arabic-speech-corpus'  # Replace with your dataset root directory
wav_dir = os.path.join(dataset_path, 'wav')
transcript_file = os.path.join(dataset_path, 'phonetic-transcipt.txt')

# Output directory for Kaldi-style files (e.g., for training)
kaldi_data_dir = '/content/arabic-speech-corpus/kaldi_data/train'  # Update this as needed
os.makedirs(kaldi_data_dir, exist_ok=True)

# Open output files for Kaldi-style directory
wav_scp = open(os.path.join(kaldi_data_dir, 'wav.scp'), 'w', encoding='utf-8')
text_f = open(os.path.join(kaldi_data_dir, 'text'), 'w', encoding='utf-8')
utt2spk = open(os.path.join(kaldi_data_dir, 'utt2spk'), 'w', encoding='utf-8')

# Set a default speaker ID (adjust if you have multiple speakers)
default_spk = "arabic"

with open(transcript_file, 'r', encoding='utf-8') as f:
    for line in f:
        # Remove extra quotes and split the line into fields
        # Expected format: "ARA NORM  0002.wav" "buckwalter transcription"
        parts = line.strip().split('" "')
        if len(parts) < 2:
            continue
        # Clean up the fields (remove any remaining quotes)
        utt_field = parts[0].replace('"', '').strip()
        buckwalter_transcription = parts[1].replace('"', '').strip()

        # Extract the filename from utt_field. Example: "ARA NORM  0002.wav"
        utt_filename = utt_field.split()[-1]
        # Remove the file extension to create an utterance ID (e.g., "0002")
        utt_id = os.path.splitext(utt_filename)[0]

        #Has been commented out to test with phonetic transcription

        # Convert the Buckwalter transcription to standard Arabic script
        # arabic_transcription = bt.transliterate(buckwalter_transcription)

        # Write to wav.scp (assumes the wav files are in the wav/ directory)
        wav_path = os.path.join(wav_dir, utt_filename)
        wav_scp.write(f"{utt_id} {wav_path}\n")

        # Write the converted Arabic transcript to the text file
        text_f.write(f"{utt_id} {buckwalter_transcription}\n")

        # Write to utt2spk (assign default speaker)
        utt2spk.write(f"{utt_id} {default_spk}\n")

# Close the files
wav_scp.close()
text_f.close()
utt2spk.close()

print("Kaldi-style training data files have been created in:", kaldi_data_dir)


Kaldi-style training data files have been created in: /content/arabic-speech-corpus/kaldi_data/train


In [None]:
# Set your dataset paths (update these paths as needed)
dataset_path = '/content/arabic-speech-corpus/test set'  # Replace with your dataset root directory
wav_dir = os.path.join(dataset_path, 'wav')
transcript_file = os.path.join(dataset_path, 'phonetic-transcript.txt')

# Output directory for Kaldi-style files (e.g., for training)
kaldi_data_dir = '/content/arabic-speech-corpus/kaldi_data/test'  # Update this as needed
os.makedirs(kaldi_data_dir, exist_ok=True)

# Open output files for Kaldi-style directory
wav_scp = open(os.path.join(kaldi_data_dir, 'wav.scp'), 'w', encoding='utf-8')
text_f = open(os.path.join(kaldi_data_dir, 'text'), 'w', encoding='utf-8')
utt2spk = open(os.path.join(kaldi_data_dir, 'utt2spk'), 'w', encoding='utf-8')

# Set a default speaker ID (adjust if you have multiple speakers)
default_spk = "arabic"

with open(transcript_file, 'r', encoding='utf-8') as f:
    for line in f:
        # Remove extra quotes and split the line into fields
        # Expected format: "ARA NORM  0002.wav" "buckwalter transcription"
        parts = line.strip().split('" "')
        if len(parts) < 2:
            continue
        # Clean up the fields (remove any remaining quotes)
        utt_field = parts[0].replace('"', '').strip()
        buckwalter_transcription = parts[1].replace('"', '').strip()

        # Extract the filename from utt_field. Example: "ARA NORM  0002.wav"
        utt_filename = utt_field.split()[-1]
        # Remove the file extension to create an utterance ID (e.g., "0002")
        utt_id = os.path.splitext(utt_filename)[0]

        #Has been commented out to test with phonetic transcription

        # Convert the Buckwalter transcription to standard Arabic script
        # arabic_transcription = bt.transliterate(buckwalter_transcription)

        # Write to wav.scp (assumes the wav files are in the wav/ directory)
        wav_path = os.path.join(wav_dir, utt_filename)
        wav_scp.write(f"{utt_id} {wav_path}\n")

        # Write the converted Arabic transcript to the text file
        text_f.write(f"{utt_id} {buckwalter_transcription}\n")

        # Write to utt2spk (assign default speaker)
        utt2spk.write(f"{utt_id} {default_spk}\n")

# Close the files
wav_scp.close()
text_f.close()
utt2spk.close()

print("Kaldi-style testing data files have been created in:", kaldi_data_dir)

Kaldi-style testing data files have been created in: /content/arabic-speech-corpus/kaldi_data/test


In [None]:
import os

# Set the paths to your utt2spk and the output spk2utt file
utt2spk_path = '/content/arabic-speech-corpus/kaldi_data/train/utt2spk'  # Update this path
spk2utt_path = '/content/arabic-speech-corpus/kaldi_data/train/spk2utt'    # Update this path

# Dictionary to accumulate utterances for each speaker
speaker_dict = {}

# Read utt2spk file
with open(utt2spk_path, 'r', encoding='utf-8') as f:
    for line in f:
        parts = line.strip().split()
        if len(parts) != 2:
            continue  # Skip any malformed lines
        utt, spk = parts
        if spk not in speaker_dict:
            speaker_dict[spk] = []
        speaker_dict[spk].append(utt)

# Write spk2utt file
with open(spk2utt_path, 'w', encoding='utf-8') as f:
    for spk, utt_list in speaker_dict.items():
        f.write(f"{spk} {' '.join(utt_list)}\n")

print(f"train spk2utt file has been created at: {spk2utt_path}")


train spk2utt file has been created at: /content/arabic-speech-corpus/kaldi_data/train/spk2utt


In [None]:
import os

# Set the paths to your utt2spk and the output spk2utt file
utt2spk_path = '/content/arabic-speech-corpus/kaldi_data/test/utt2spk'  # Update this path
spk2utt_path = '/content/arabic-speech-corpus/kaldi_data/test/spk2utt'    # Update this path

# Dictionary to accumulate utterances for each speaker
speaker_dict = {}

# Read utt2spk file
with open(utt2spk_path, 'r', encoding='utf-8') as f:
    for line in f:
        parts = line.strip().split()
        if len(parts) != 2:
            continue  # Skip any malformed lines
        utt, spk = parts
        if spk not in speaker_dict:
            speaker_dict[spk] = []
        speaker_dict[spk].append(utt)

# Write spk2utt file
with open(spk2utt_path, 'w', encoding='utf-8') as f:
    for spk, utt_list in speaker_dict.items():
        f.write(f"{spk} {' '.join(utt_list)}\n")

print(f"test spk2utt file has been created at: {spk2utt_path}")


test spk2utt file has been created at: /content/arabic-speech-corpus/kaldi_data/test/spk2utt


Creating Token list

In [None]:
kaldi_data_dir = "/content/arabic-speech-corpus/kaldi_data/train"

token_set = set()
text_path = os.path.join(kaldi_data_dir, 'text')
with open(text_path, 'r', encoding='utf-8') as f:
    for line in f:
        parts = line.strip().split(maxsplit=1)
        if len(parts) == 2:
            transcript = parts[1]
            token_set.update(list(transcript))

# Write token list
token_list_path = os.path.join(kaldi_data_dir, 'token_list.txt')
with open(token_list_path, 'w', encoding='utf-8') as f:
    for token in sorted(token_set):
        f.write(token + "\n")
print("Token list created at:", token_list_path)


Token list created at: /content/arabic-speech-corpus/kaldi_data/train/token_list.txt


Need to install 1.26.4 version of Numpy as Trainer.train() requires Dtypes. To check if other numpy versions are suitable use the below code.



```
import numpy as np
print(np.__version__)
print(hasattr(np, "dtypes"))
```



# Re-Installing Libraries and Model Weights for Training

In [None]:
!pip install numpy==1.26.4

Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m107.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.5
    Uninstalling numpy-1.23.5:
      Successfully uninstalled numpy-1.23.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
espnet 202412 requires numpy<1.24, but you h

Downloading the model checkpoint and training configuration file from HuggingFace

In [None]:
from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader()  # <module_dir> is used as cachedir by default
# model_id = "espnet/kan-bayashi_libritts_xvector_vits" #originally used to train VITS model
model_id = "kan-bayashi/ljspeech_tacotron2"

model_dir = d.download_and_unpack(model_id)
print(f"Model '{model_id}' downloaded and unpacked at: {model_dir}")

https://zenodo.org/record/3989498/files/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best.zip?download=1: 100%|██████████| 102M/102M [00:32<00:00, 3.29MB/s] 


Model 'kan-bayashi/ljspeech_tacotron2' downloaded and unpacked at: {'train_config': '/usr/local/lib/python3.11/dist-packages/espnet_model_zoo/3b0a779f28d99232479e782d4d20292b/exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/config.yaml', 'model_file': '/usr/local/lib/python3.11/dist-packages/espnet_model_zoo/3b0a779f28d99232479e782d4d20292b/exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/199epoch.pth'}


Filling the variables DUMP_DIR and data_info for training configuration

In [None]:
arabic_data_dir = "/content/arabic-speech-corpus/kaldi_data"
# Directory containing your dumped Arabic dataset in Kaldi-style
DUMP_DIR = arabic_data_dir

# Data information mapping keys to file names and types:
data_info = {
    "speech": ["wav.scp", "sound"],
    "text": ["text", "text"],
}

Installing Protobuf with version 3.20.1 since trainer.collect_stats() requires it. The pip depency error for Google can be ignored as it does not affect the training.

In [None]:
!pip install protobuf==3.20.1

Collecting protobuf==3.20.1
  Downloading protobuf-3.20.1-py2.py3-none-any.whl.metadata (720 bytes)
Downloading protobuf-3.20.1-py2.py3-none-any.whl (162 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/162.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 5.29.4
    Uninstalling protobuf-5.29.4:
      Successfully uninstalled protobuf-5.29.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
espnet 202412 requires numpy<1.24, but you have numpy 1.26.4 which is incompatible.
google-cloud-resource-manager 1.14.2 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<7.0.0,>=3.20.2, but you hav

Importing espnetez

In [None]:
import espnetez as ez

Failed to import Flash Attention, using ESPnet default: No module named 'flash_attn'


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)


Logging into Wandb for metric gathering. It will ask for API key from Wandb to store and display the metrics

In [None]:
!wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


# Configuring for Training

Configuring the training config downloaded from HuggingFace. Depending on the task, finetune_config["tts"] accepts certain models. The models can be seen in the link below through the files. This error will be shown in the Trainer.train()

Link: https://github.com/espnet/espnet/tree/master/espnet2/tasks

In [None]:
TASK = "tts" #Depending on the model the task changes Eg: VITS works only with gan_tts task

pretrain_config = ez.config.from_yaml(TASK, model_dir["train_config"])

# Update the configuration with the downloaded model file path
pretrain_config["model_file"] = model_dir["model_file"]

# Modify configuration for fine-tuning
finetune_config = pretrain_config.copy()
finetune_config["tts"] = "tacotron2" #Models that comply with the task
finetune_config["batch_size"] = 1
finetune_config["num_workers"] = 1
finetune_config["max_epoch"] = 100
finetune_config["batch_bins"] = 500000
finetune_config["num_iters_per_epoch"] = 2
finetune_config["generator_first"] = True
finetune_config["use_wandb"] = True
finetune_config["wandb_project"] = "ESPnet Training"
finetune_config["wandb_name"] = "ESPnet Tacatron2 run 100 epochs"

# Disable distributed training
finetune_config["distributed"] = False
finetune_config["multiprocessing_distributed"] = False
finetune_config["dist_world_size"] = None
finetune_config["dist_rank"] = None
finetune_config["local_rank"] = None
finetune_config["dist_master_addr"] = None
finetune_config["dist_master_port"] = None
finetune_config["dist_launcher"] = None
finetune_config["pretrain_path"] = None

Contents of the configuration yaml file been dumped to verify

In [None]:
import yaml

print("Fine-tuning configuration:")
print(yaml.dump(finetune_config, sort_keys=False))

Fine-tuning configuration:
log_level: INFO
drop_last_iter: false
dry_run: false
iterator_type: sequence
valid_iterator_type: null
output_dir: exp/tts_train_tacotron2_raw
seed: 0
num_workers: 1
num_att_plot: 3
dist_backend: nccl
dist_init_method: env://
dist_world_size: null
dist_rank: null
local_rank: null
dist_master_addr: null
dist_master_port: null
dist_launcher: null
multiprocessing_distributed: false
unused_parameters: false
sharded_ddp: false
use_deepspeed: false
deepspeed_config: null
cudnn_enabled: true
cudnn_benchmark: false
cudnn_deterministic: true
use_tf32: false
collect_stats: false
write_collected_feats: false
max_epoch: 100
patience: null
val_scheduler_criterion:
- valid
- loss
early_stopping_criterion:
- valid
- loss
- min
best_model_criterion:
- - valid
  - loss
  - min
- - train
  - loss
  - min
keep_nbest_models: 5
nbest_averaging_interval: 0
grad_clip: 1.0
grad_clip_type: 2.0
grad_noise: false
accum_grad: 1
no_forward_run: false
resume: true
train_dtype: float32
use

Defining Experiment and Stats Directory, and initializes ez.Trainer

In [None]:
DATASET_NAME = "asc"
EXP_DIR = f"./exp/finetune_{TASK}_{DATASET_NAME}" ## output directory containing the trained model weights and config.yaml file
STATS_DIR = f"./exp/stats_{DATASET_NAME}"
ngpu = 1

trainer = ez.Trainer(
    task=TASK,
    train_config=finetune_config,
    train_dump_dir=f"{DUMP_DIR}/train",
    valid_dump_dir=f"{DUMP_DIR}/test",
    data_info=data_info,
    output_dir=EXP_DIR,
    stats_dir=STATS_DIR,
    ngpu=ngpu,
)

# Add the xvector paths to the configuration
trainer.train_config.train_data_path_and_name_and_type += [
    ["/content/train_speaker_embeddings/train_spk_embed.scp", "spembs", "kaldi_ark"],
]
trainer.train_config.valid_data_path_and_name_and_type += [
    ["/content/test_speaker_embeddings/test_spk_embed.scp", "spembs", "kaldi_ark"],
]

Downloading NLTK POS Tagger. Even if the it is been downloded while importing espnetez, an error is thrown that averaged_perceptron_tagger_eng is missing

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

Collecting stats from the dataset

In [None]:
# Temporarily disable normalization to collect stats
trainer.train_config.normalize = None
trainer.train_config.pitch_normalize = None
trainer.train_config.energy_normalize = None

# Collect statistics from the training dump
trainer.collect_stats()

# After collecting stats, re-enable normalization if required.
trainer.train_config.write_collected_feats = False
if finetune_config.get("normalize") is not None:
    trainer.train_config.normalize = finetune_config["normalize"]
    trainer.train_config.normalize_conf["stats_file"] = f"{STATS_DIR}/train/feats_stats.npz"
if finetune_config.get("pitch_normalize") is not None:
    trainer.train_config.pitch_normalize = finetune_config["pitch_normalize"]
    trainer.train_config.pitch_normalize_conf["stats_file"] = f"{STATS_DIR}/train/pitch_stats.npz"
if finetune_config.get("energy_normalize") is not None:
    trainer.train_config.energy_normalize = finetune_config["energy_normalize"]
    trainer.train_config.energy_normalize_conf["stats_file"] = f"{STATS_DIR}/train/energy_stats.npz"


/usr/bin/python3 /usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py -f /root/.local/share/jupyter/runtime/kernel-9d3c8587-d5e9-4de8-a70d-14a233da3083.json


# Training

Training/Fine-tuning the model. The output of this training is in the exp folder, the exp folder will be need for inferencing.  

In [None]:
trainer.train()

/usr/bin/python3 /usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py -f /root/.local/share/jupyter/runtime/kernel-9d3c8587-d5e9-4de8-a70d-14a233da3083.json
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33maddalin[0m ([33maddalinqmul[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[1;30;43mStreaming output truncated to the last 5000 lines.[0m


KeyboardInterrupt: 

# Inferencing

Inferencing the model using the fine-tuned model weights. Sometimes other models only require config.yaml file but others will throw an error if the exp folder is missing.

In [None]:
from espnet2.bin.tts_inference import Text2Speech
import kaldiio  # This is commonly used to read Kaldi-style scp files
sf.write("output.wav", wav.numpy(), tts.fs, "PCM_16")

# The scp file is just a mapping file - you need to get an actual embedding
# First, load the mapping
spk_dict = kaldiio.load_scp("/content/train_speaker_embeddings/train_spk_embed.scp")

# Get the first speaker embedding
spk_id = list(spk_dict.keys())[0]  # Get the first speaker ID
spembs = spk_dict[spk_id]  # Get the embedding for that speaker

# with local model
tts = Text2Speech.from_pretrained(model_file="/content/67epoch.pth")
wav = tts("sil w a r a' jj A H a tt A q r ii0' r u0 ll a * i0 < a E a' dd a h u0 m a' E h a d u0 < a b H aa' ^ i0 h A D A' b a t i0 tt i1' b t i0 f i0 l < a k aa d ii0 m ii0' y a t i0 SS II0 n ii0' y a t i0 l u0 l E u0 l uu0' m i0 sil < a' n t a s t a m i0' rr a d a r a j aa' t u0 l H a r aa' r a t i0 w a m u0 s t a w a y aa' t u0 rr U0 T UU0' b a t i0 f i0 l Ah i0 r t i0 f aa' E i0 T A' w A l a h aa' * a l q A' r n sil",spembs=spembs)["wav"]
sf.write("output.wav", wav.numpy(), tts.fs, "PCM_16")