**Fine tune or train a VITS model in Hindi with the Coqui TTS framework using Hindi audio samples.**


Thank you to all of the [Coqui TTS](https://https://github.com/coqui-ai/TTS) contributors

**Install Coqui TTS** (https://github.com/coqui-ai/TTS), espeak-ng phonemeizer (https://github.com/espeak-ng/espeak-ng), download Coqui TTS source and examples from GitHub.

In [None]:
!pip install requests
!pip install aiohttp
!pip install numpy==1.21
!pip install TTS

#%cd /content
!sudo apt-get install espeak-ng
#!git clone https://github.com/coqui-ai/TTS.git
#%cd TTS
#!pip install -e .[all,dev,notebooks]


**Run this cell to connect your Google Drive account to save files.**

In [None]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


**Set paths and then run the next cell**

ds_path is the base folder in your google drive containing the dataset (should contain the txt and wav48_silence_trimmed directories)

output_directory is training storage directory

MODEL_FILE is the default path to the VITS model downloaded using Coqui (do not need to change).
Default model path: /root/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth

RUN_NAME is a short name describing your training run

Dataset should be in "VCTK" format:

-Dataset_Directory

->wav48_silence_trimmed

->->Speaker_Name_Subdirectory

->->->22050hz audio files, mono, .flac format, with a filename [base]_mic1.flac

->txt

->->Speaker_Name_Subdirectory

->->->Text in a transcript named [base].txt

In [None]:
import os

ds_path = "vctk-hi-22k-ds" #@param {type:"string"}
output_directory = "traineroutput" #@param {type:"string"}
MODEL_FILE = "/root/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth" #@param {type:"string"}
RUN_NAME = "VITS-Hindi-test" #@param {type:"string"}


OUT_PATH = "/content/drive/MyDrive/"+ds_path+"/traineroutput/"

**Set run type.**

Continue to resume an interrupted session

restore to begin a new session from the defalt model model file above (download from Coqui Hub using the download cell later on).

restore-ckpt is for beginning a new session using a prior fine-tuned checkpoint. You can set this later on in the training section.

newmodel is for beginning a new training session with an empty VITS model.

In [None]:
run_type = "restore-ckpt" #@param ["continue","restore","restore-ckpt","newmodel"]
print(run_type + " run selected")

restore-ckpt run selected


**(Optional) List pretrained models available on the Coqui Hub**

In [None]:
#@title
!tts --list_models

 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/your_tts
 2: tts_models/bg/cv/vits
 3: tts_models/cs/cv/vits
 4: tts_models/da/cv/vits
 5: tts_models/et/cv/vits
 6: tts_models/ga/cv/vits
 7: tts_models/en/ek1/tacotron2
 8: tts_models/en/ljspeech/tacotron2-DDC
 9: tts_models/en/ljspeech/tacotron2-DDC_ph
 10: tts_models/en/ljspeech/glow-tts
 11: tts_models/en/ljspeech/speedy-speech
 12: tts_models/en/ljspeech/tacotron2-DCA
 13: tts_models/en/ljspeech/vits
 14: tts_models/en/ljspeech/vits--neon
 15: tts_models/en/ljspeech/fast_pitch
 16: tts_models/en/ljspeech/overflow
 17: tts_models/en/ljspeech/neural_hmm
 18: tts_models/en/vctk/vits
 19: tts_models/en/vctk/fast_pitch
 20: tts_models/en/sam/tacotron-DDC
 21: tts_models/en/blizzard2013/capacitron-t2-c50
 22: tts_models/en/blizzard2013/capacitron-t2-c150_v2
 23: tts_models/es/mai/tacotron2-DDC
 24: tts_models/es/css10/vits
 25: tts_models/fr/mai/tacotron2-DDC
 26: tts_models/fr/css10/vits
 27: tts_model

**Download VITS model and Generate Sample Wav File to /content/ljspeech-vits.wav  This will be deleted when your Colab session is closed.**

In [None]:
#@title
!tts --text "I am the very model of a modern Major General" --model_name "tts_models/en/ljspeech/vits" --out_path /content/ljspeech-vits.wav
#!tts --text "I am the very model of a modern Major General" --model_name "tts_models/en/vctk/vits" --out_path /content/ljspeech-vits.wav

 > tts_models/en/ljspeech/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Text: I am the very model of a modern Major General
 > Text splitted to sentences.
['I am the very model of a modern Major General']
 > Processing time: 2.840787649154663
 > Rea

**Load Tensorboard**

In [None]:
import torch
%load_ext tensorboard

**Load Dashboard**
May take several minutes to appear from a blank white box.  Ad blockers probably need to whitelist a bunch of Colab stuff or this won't work.

In [None]:
%tensorboard --logdir /content/drive/MyDrive/$ds_path/$output_directory/

**If continuning a run: use the next cell to list all run directories.**

**Copy and paste the run you want to or restore a checkpoint from into the next box**

In [None]:
#@title
!ls -al /content/drive/MyDrive/$ds_path/traineroutput

total 8
drwx------ 2 root root 4096 Mar 12 09:41 vits-vctk-hi-22k-March-12-2023_09+41AM-0000000
drwx------ 2 root root 4096 Mar 12 09:45 vits-vctk-hi-22k-March-12-2023_09+45AM-0000000


**Run folder to continue from or Run folder that contains your restore checkpoint**

In [None]:
run_folder = "vits-vctk-hi-22k-March-12-2023_09+45AM-0000000" #@param {type:"string"}


List checkpoints in run folder. The checkpoint only needs to be selected for a restore run.

Continuing a run will load the last best loss checkpoint according to the stored config.json in the run directory on its own (a directory is specified for a continue run, and a model file is specified for a restore run)

In [None]:
!ls -al /content/drive/MyDrive/$ds_path/traineroutput/$run_folder

**If changing to a different "restore" checkpoint to begin a new training session with a model you are already training, set the checkpoint filename here**

In [None]:
ckpt_file = "checkpoint_15000.pth" #@param {type:"string"}
print(ckpt_file + " selected for restore run")
if run_type=="continue":
  print("Warning:\n restore checkpoint selected, but run type set to continue.\nTrainer will load best loss from checkpoint directory.\n Are you sure this is what you want to do?\n\nIf not, change the run type below to 'restore'")
elif run_type=="restore-ckpt":
  print("Warning:\n restore checkpoint selected, run type set to restore from selected checkpoint, not default base model.\nIf this is not correct, adjust the run type.")


**Last chance to change run type**

In [None]:
run_type = "restore-ckpt" #@param ["continue","restore","restore-ckpt","newmodel"]
print(run_type + " run selected")

restore-ckpt run selected


**Run the next cells in order to begin training**

In [None]:
import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsArgs, VitsAudioConfig, CharactersConfig
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.bin.compute_embeddings import compute_embeddings
from TTS.tts.utils.data import get_length_balancer_weights
from TTS.tts.utils.languages import LanguageManager, get_language_balancer_weights
from TTS.tts.utils.speakers import SpeakerManager, get_speaker_balancer_weights, get_speaker_manager

In [None]:
output_path = os.path.dirname("/content/drive/MyDrive/"+ds_path+"/traineroutput/")
SKIP_TRAIN_EPOCH=False
#https://github.com/coqui-ai/TTS/releases/tag/speaker_encoder_model
## Extract speaker embeddings

SPEAKER_ENCODER_CHECKPOINT_PATH = "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar"
SPEAKER_ENCODER_CONFIG_PATH = "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json"
SKIP_TRAIN_EPOCH = False
BATCH_SIZE = 16
SAMPLE_RATE = 22050
MAX_AUDIO_LEN_IN_SECONDS = 10
NUM_RESAMPLE_THREADS = 10


vctk_hi = BaseDatasetConfig(
    formatter="vctk",
    meta_file_train="",
    phonemizer=None,
    dataset_name="vctk-hi-22k-ds",
    language="hi",
    path="/content/drive/MyDrive/"+ds_path
)

DATASETS_CONFIG_LIST = [vctk_hi]
D_VECTOR_FILES=[]

In [None]:
characters_config = CharactersConfig(
    characters_class="TTS.tts.models.vits.VitsCharacters",
    pad="<PAD>",
    eos="<EOS>",
    bos="<BOS>",
    blank="<BLNK>",
    phonemes = None,
    characters = "ABCDEFGHIJKLMNOPRSTVWXYZabcdefghijklmnopqrstuvwxyzँगऊोग़डटणढ़ॉएपदझ़ंृघभसछिठक़कःहऔजाओत्ऋऐधईीथञज़लूखढचऑबनवशफआयख़ौड़रइऍअमफ़ॠैउषुेँंः",
    punctuations = "|।–!,-. ?"
    )

In [None]:
audio_config = VitsAudioConfig(
    sample_rate=22050,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    mel_fmin=0,
    mel_fmax=None
)

vitsArgs = VitsArgs(
    use_d_vector_file=True,
    d_vector_file=D_VECTOR_FILES,
    d_vector_dim=512,
    num_layers_text_encoder=6,
    embedded_language_dim=4,
    speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
    use_language_embedding=False,
    use_speaker_embedding=False,
    use_speaker_encoder_as_loss=True,
    use_sdp=True,
)

config = VitsConfig(
    model_args=vitsArgs,
    characters=characters_config,
    audio=audio_config,
    run_name="vits-vctk-hi-22k",
    max_audio_len=SAMPLE_RATE * MAX_AUDIO_LEN_IN_SECONDS,
    add_blank=True,
    min_text_len=1,
    min_audio_len=1,
    #max_text_len=325,
    batch_size=16,
    eval_batch_size=16,
    batch_group_size=16,
    num_loader_workers=1,
    num_eval_loader_workers=1,
    eval_split_max_size=256,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=10000,
    save_step=1000,
    save_checkpoints=True,
    save_best_after=1000,
    save_n_checkpoints=4,
    use_weighted_sampler=True,
    #use_weighted_sampler=False,
    start_by_longest=True,
    weighted_sampler_attrs={"speaker_name": 1.0},
    #https://github.com/coqui-ai/TTS/pull/2234#issuecomment-1369538965
    weighted_sampler_multipliers={"speaker_name": {}},
    speaker_encoder_loss_alpha=9.0,
    text_cleaner="multilingual_cleaners",
    #text_cleaner="basic_cleaners"
    use_phonemes=False,
    compute_input_seq_cache=True,
    print_step=50,
    print_eval=True,
    mixed_precision=False,
    output_path=output_path,
    datasets=[vctk_hi],
    cudnn_benchmark=False,
)

In [None]:
#This is from the Coqui TTS recipe.
#I am unclear on what exactly it is doing,
#or if it is still necessary. -nn
#
### force the convertion of the custom characters to a config attribute
#
config.from_dict(config.to_dict())

In [None]:
# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

In [None]:
# Saves embeddings to a file {dataset_directory}/(dataset_name)_speaker.pth
#
# If you alter your dataset audio files after computing embeddings,
# please delete the embeddings file in your dataset directory to ensure that
# the embeddings are calculated accurately for the current version of your dataset
#
# Using old embeddings with altered audio files will result in poor training,
# but it may not be obvious why.
# Its easy to forget if you're fiddling with a messy dataset. -nn
#
for dataset_conf in DATASETS_CONFIG_LIST:
    # Check if the embeddings weren't already computed, if not compute it
    print(dataset_conf.path)
    embbase=str(dataset_conf.dataset_name)
    #embeddings_file = MODEL_DIR+"speakers.pth"
    embeddings_file = os.path.join(dataset_conf.path, embbase+"_speakers.pth")
    print(embeddings_file)
    if not os.path.isfile(embeddings_file):
        print(f">>> Computing the speaker embeddings for the {dataset_conf.dataset_name} dataset")
        compute_embeddings(
            SPEAKER_ENCODER_CHECKPOINT_PATH,
            SPEAKER_ENCODER_CONFIG_PATH,
            embeddings_file,
            old_spakers_file=None,
            config_dataset_path=None,
            formatter_name=dataset_conf.formatter,
            dataset_name=dataset_conf.dataset_name,
            dataset_path=dataset_conf.path,
            meta_file_train=dataset_conf.meta_file_train,
            meta_file_val=dataset_conf.meta_file_val,
            disable_cuda=False,
            no_eval=False,
        )
    D_VECTOR_FILES.append(embeddings_file)

In [None]:
train_samples, eval_samples = load_tts_samples(
    DATASETS_CONFIG_LIST,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

In [None]:
# init speaker manager for multi-speaker training
# it maps speaker-id to speaker-name in the model and data-loader
#speaker_manager = SpeakerManager()
speaker_manager = SpeakerManager(
    d_vectors_file_path=D_VECTOR_FILES,
    encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH
    )
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers
language_manager = LanguageManager(config=config)
language_manager.set_language_ids_from_config(config)
config.model_args.num_languages = language_manager.num_languages

In [None]:
ALL_SPEAKERS = []
ALL_SPEAKERS = speaker_manager.speaker_names
ALL_SENTENCES = []
TEST_SENTENCES_PH1 = "गजधर वास्तुकार थे। गांव-समाज हो या नगर-समाज - उसके नव निर्माण की, रख-रखाव की ज़िम्मेदारी गजधर निभाते थे। नगर नियोजन से लेकर छोटे से छोटे निर्माण के काम गजधर के कधों पर टिके थे।"
TEST_SENTENCES_PH2 = "वे योजना बनाते थे, कुल काम की लागत निकालते थे, काम में लगने वाली सारी सामग्री जुटाते थे और इस सबके बदले वे अपने जजमान से ऐसा कुछ नहीं मांग बैठते थे, जो वे दे न पाएं। लोग भी ऐसे थे कि उनसे जो कुछ बनता,वे गजधर को भेंट कर देते। "
TEST_SENTENCES_PH3="पसिखाई जाती थी तो कहीं यह जात से हट कर एक विशेष पांत भी जाती थी। बनाने वाले लोग कहीं एक जगह बसे मिलते थे तो कहीं -घूम कर इस काम को करते थे।"

for speakername in ALL_SPEAKERS:
    ph1 = [TEST_SENTENCES_PH1,speakername,None,"hi"]
    ph2 = [TEST_SENTENCES_PH2,speakername,None,"hi"]
    ph3 = [TEST_SENTENCES_PH3,speakername,None,"hi"]
    ALL_SENTENCES.append(ph1)
    ALL_SENTENCES.append(ph2)
    ALL_SENTENCES.append(ph3)
config.test_sentences=ALL_SENTENCES

In [None]:
tokenizer, config = TTSTokenizer.init_from_config(config)

Layer related model arguments.

Must reinitilize trainer after changing settings.

In [None]:
config.model_args.reinit_text_encoder=False
config.model_args.reinit_DP=False
config.model_args.freeze_encoder=False
config.model_args.freeze_PE=False
config.model_args.freeze_DP=False
config.model_args.freeze_flow_decoder=False
config.model_args.freeze_waveform_decoder=False

In [None]:
model = Vits(config, ap, tokenizer, speaker_manager, language_manager)

 > External Speaker Encoder Loaded !!


In [None]:
#@title
print(run_type)

if run_type=="continue":
  CONTINUE_PATH="/content/drive/MyDrive/"+ds_path+"/traineroutput/"+run_folder
  trainer = Trainer(
    TrainerArgs(continue_path=CONTINUE_PATH, skip_train_epoch=SKIP_TRAIN_EPOCH),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
elif run_type=="restore":
    trainer = Trainer(
    TrainerArgs(restore_path=MODEL_FILE, skip_train_epoch=SKIP_TRAIN_EPOCH),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
elif run_type=="restore-ckpt":
  trainer = Trainer(
  TrainerArgs(restore_path="/content/drive/MyDrive/"+ds_path+"/traineroutput/"+run_folder+"/"+ckpt_file, skip_train_epoch=SKIP_TRAIN_EPOCH),
  config,
  output_path=OUT_PATH,
  model=model,
  train_samples=train_samples,
  eval_samples=eval_samples,
)
elif run_type=="newmodel":
  trainer = Trainer(
  TrainerArgs(),
  config,
  output_path=OUT_PATH,
  model=model,
  train_samples=train_samples,
  eval_samples=eval_samples,
)

**Run trainer**

In [None]:
trainer.fit()

In [None]:
!tts --model_path /content/drive/MyDrive/vits-vctk-multi-ds/traineroutput/vits_vctk-February-09-2023_12+12AM-914280a5/best_model_1003928.pth \
--config_path /content/drive/MyDrive/vits-vctk-multi-ds/traineroutput/vits_vctk-February-09-2023_12+12AM-914280a5/config.json \
--list_speaker_idxs \
--text ""

In [None]:
out_wav_file ="/content/drive/MyDrive/me-mmj.wav"

In [None]:
!tts --model_path /content/drive/MyDrive/vits-vctk-multi-ds/traineroutput/vits_vctk-February-09-2023_12+12AM-914280a5/best_model_1003928.pth \
--config_path /content/drive/MyDrive/vits-vctk-multi-ds/traineroutput/vits_vctk-February-09-2023_12+12AM-914280a5/config.json \
--speaker_idx VCTK_me \
--text "I am the very model of a modern Major-General,\
 I've information vegetable, animal, and mineral, \
 I know the kings of England, and I quote the fights historical \
 From Marathon to Waterloo, in order categorical; \
 I'm very well acquainted, too, with matters mathematical, \
 I understand equations, both the simple and quadratical, \
  About binomial theorem I'm teeming with a lot o' news, \
  With many cheerful facts about the square of the hypotenuse." \
  --out_path $out_wav_file

In [None]:
from IPython.display import Audio
from IPython.display import display
wn = Audio(out_wav_file, autoplay=False) ##
display(wn)##