# Train your TTS (VITS model) with Coqui TTS 🐸and Whisper

This notebook demonstrates a gentle guide to training and testing your own VITS model with Coqui TTS. I have reorganized the Colab notebook code from the amazing YouTuber [NanoNomad](https://www.youtube.com/watch?v=6QAGk_rHipE&t=318s&ab_channel=NanoNomad). I am deeply thankful to all Coqui TTS contributors and the OpenAI team for making these remarkable AI models accessible to all of us.

In [None]:
from google.colab import drive

drive.mount("/content/drive")

## 💡 Define the variables

Additional Information about the directory paths


*   ds_name: Dataset name which will be the name of the directory.
*   output_dir : All of subfolders created during the training such as config.json or pth file will be located.
*   MODEL_FILE: (No need to change) default path of VITS from Coqui.
*   RUN_NAME: (Optional) if you want to name your training.
- test_path : Save your output audio file for to check your training results.
- wavs: Original wavs files chunked for the training.
- mono : Converted 22050hz mono wav files will be saved here.
- output_path: full path of output_dir where training data will be stored.
- open_path: your metadata.csv file will be saved here.





In [None]:
ds_name = "vits-ds-a" #@param ["vits-ds-a","vits-ds-b","vits-ds-c"]
output_dir = "traineroutput-a" #@param ["traineroutput-a","traineroutput-b","traineroutputt-c"]
MODEL_FILE = "/root/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth" #@param {type:"string"}
RUN_NAME = "VITS-eng" #@param {type:"string"}

test_path = "/content/drive/MyDrive/"+ds_name+"/testoutput/"
wavs = "/content/drive/MyDrive/"+ds_name+"/wavs/"  #@param {type:"string"}
mono =  "/content/drive/MyDrive/"+ds_name+"/mono" #@param {type:"string"}

output_path = "/content/drive/MyDrive/"+ds_name + "/"+output_dir+"/" #output for the training
meta_name = ds_name + '_metadata.csv'
open_path = '/content/drive/MyDrive/'+ds_name+'/'+meta_name #metadata.csv path


In [None]:
!mkdir /content/drive/MyDrive/$ds_name
!mkdir $test_path
!mkdir $wavs
!mkdir $output_path
!mkdir $mono
!mkdir $mono/wavs/

### 📂Upload your audio files (Optional)

*   Recommended for the small number of audio files.
*   If there are large number of files, upload directly to $wavs directory.

In [None]:
from google.colab import files
print("Select your audio samples for the training")
target_files = files.upload()
target_files = list(target_files.keys())
ds_path = "/content/drive/MyDrive/"+ds_name

cnt = 0
for sample in target_files:
    cnt += 1
    save_path = os.path.join(ds_path+'/wavs', sample)
    !ffmpeg-normalize $sample -nt rms -t=-27 -o {save_path} -ar 16000 -f

saved_files= os.listdir(ds_path)
print("Saved sample files: " )
print(saved_files)

assert len(saved_files) == cnt, "Failed to save audio files"
print("Audio files successfully saved")

### Convert audio files to 22050hz mono wav files.

In [None]:
import os
import subprocess

def convert_to_mono_22050hz_sox(input_file, output_file):
    subprocess.run(["sox", input_file, "-r", "22050", "-c", "1", output_file])

def convert_files_in_directory(ds_dir):
    for root, _, files in os.walk(ds_dir):
        for file in files:
            if file.lower().endswith(('.mp3', '.wav')):
                input_file = os.path.join(root, file)
                output_file = os.path.join(mono+"/wavs/", f"{os.path.splitext(file)[0]}.wav")
                print(output_file)
                convert_to_mono_22050hz_sox(input_file, output_file)
        print("Files saved successfully to $mono/wavs/")

convert_files_in_directory(wavs)

##Install Whisper and Coqui-ai  🐸

Install whisper

In [None]:
%cd /content
!sudo apt install sox
!git clone https://github.com/openai/whisper.git
!pip install git+https://github.com/openai/whisper.git

Install Coqui TTS

In [None]:
%cd /content
!sudo apt-get install espeak-ng
!git clone https://github.com/coqui-ai/TTS.git
!pip install TTS
!pip install Trainer==0.0.20

## Let's make metadata.csv file ✍

In [None]:
import glob
import pandas as pd
from tqdm import tqdm
from pathlib import Path

all_filenames = []
transcript_text = []

paths = glob.glob(os.path.join(mono+'/wavs/', '*.wav'))
print("Number of wav files: ",len(paths))

In [None]:
import whisper
model = whisper.load_model("medium.en") #suitable for English audio file
# model = whisper.load_model("large-v2") #you may look for other whisper models

In [None]:
with open(open_path, 'w', encoding='utf-8') as outfile:
	for filepath in paths:
		base = os.path.basename(filepath)
		all_filenames.append(base)
	for filepath in tqdm(paths):
		result = model.transcribe(filepath)
		output = result["text"].lstrip()
		output = output.replace("\n","")
		thefile = str(os.path.basename(filepath).lstrip(".")).rsplit(".")[0]
		outfile.write(thefile + '|' + output + '|' + output + '\n')
		print(thefile + '|' + output + '|' + output + '\n')

In [None]:
#Check your metadata.csv file
!cat $open_path

##Prepare Training

If there is an error about mecab-python 3, refer to this [github](https://github.com/SamuraiT/mecab-python3#common-issues)

In [None]:
!tts --text "I am the very model of a modern Major General" --model_name "tts_models/en/ljspeech/vits" --out_path /content/ljspeech-vits.wav

In [None]:
import torch
%load_ext tensorboard

In [None]:
!ls -al $output_path

In [None]:
#Set the tensorboard. Refresh the tensorboard to track the training process
%tensorboard --logdir $output_path

In [None]:
from trainer import Trainer, TrainerArgs
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsAudioConfig
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor

In [None]:
SKIP_TRAIN_EPOCH = False

In [None]:
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train=open_path, path=mono
)

We are using ***VITS*** model here! You can utilize other pre-trained models. Refer to official [Coqui-AI](https://tts.readthedocs.io/en/latest/models/vits.html)

In [None]:
audio_config = VitsAudioConfig(
    sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)

config = VitsConfig(
    audio=audio_config,
    run_name="vits_ljspeech",
    batch_size=16,
    eval_batch_size=16,
    batch_group_size=16,
#    num_loader_workers=8,
    num_loader_workers=2,
    num_eval_loader_workers=2,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=25000, #adjust the epoch
    save_step=20,
	  save_checkpoints=True,
	  save_n_checkpoints=4,
	  save_best_after=1000,
    #text_cleaner="english_cleaners",
    text_cleaner="multilingual_cleaners",
    use_phonemes=False,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=True,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    cudnn_benchmark=False,
)

# INITIALIZE THE AUDIO PROCESSOR
ap = AudioProcessor.init_from_config(config)


In [None]:
# INITIALIZE THE TOKENIZER
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size= config.eval_split_size, #Small number of samples can cause an error
)

In [None]:
model = Vits.init_from_config(config)

In [None]:
run_type = "restore" #@param ["continue","restore","restore-ckpt"]
print(run_type + " run selected")

In [None]:
run_folder = "Replace here to the checkpoint folder you want to continue on" #@param {type:"string"}

In [None]:
import datetime
def get_today_yymmdd():
    today = datetime.datetime.now()
    return today.strftime("%y%m%d")

date = get_today_yymmdd()

ckpt_file = "checkpoint_"+date+"_"+ds_name+".pth" #@param {type:"string"}
print(ckpt_file + " selected for restore run")
if run_type=="continue":
  print("Warning:\n restore checkpoint selected, but run type set to continue.\nTrainer will load best loss from checkpoint directory.\n Are you sure this is what you want to do?\n\nIf not, change the run type below to 'restore'")
elif run_type=="restore-ckpt":
  print("Warning:\n restore checkpoint selected, run type set to restore from selected checkpoint, not default base model.\nIf this is not correct, adjust the run type.")


In [None]:
run_type = "restore" #@param ["continue","restore","restore-ckpt"]
print(run_type + " run selected")

In [None]:
print(run_type)
if run_type=="continue":
  CONTINUE_PATH= output_path+run_folder
  trainer = Trainer(
    TrainerArgs(continue_path=CONTINUE_PATH, skip_train_epoch=SKIP_TRAIN_EPOCH),
    config,
    output_path=output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
elif run_type=="restore":
    trainer = Trainer(
    TrainerArgs(restore_path=MODEL_FILE, skip_train_epoch=SKIP_TRAIN_EPOCH),
    config,
    output_path=output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
elif run_type=="restore-ckpt":
  trainer = Trainer(
  TrainerArgs(restore_path=output_path + run_folder+"/"+ckpt_file, skip_train_epoch=SKIP_TRAIN_EPOCH),
  config,
  output_path=output_path,
  model=model,
  train_samples=train_samples,
  eval_samples=eval_samples,
)

LET'S START TRAINING!

In [None]:
trainer.fit()

## Take a look at the results 🔊

In [None]:
ckpts = sorted([f for f in glob.glob(output_path+"/*/*.pth")])
configs = sorted([f for f in glob.glob(output_path+"/*/*.json")])
save_file = test_path + "test_audio.wav"

print("ckpts: ",ckpts[0])
print("configs: ",configs[0])
print("Saved file_name: ",save_file)

In [None]:
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
import subprocess

text = "Hello, nice to meet you."

command = f"tts --text \"{text}\" --model_path \"{ckpts[0]}\" --config_path \"{configs[0]}\" --out_path \"{save_file}\""
subprocess.run(command, shell=True)


In [None]:
import IPython
import librosa

# Load the audio file and get the audio data and sampling rate
audio_data, sampling_rate = librosa.load(save_file, sr=None)

# Display the audio using IPython.display.Audio
IPython.display.Audio(data=audio_data, rate=sampling_rate)
