<a href="https://colab.research.google.com/github/Conv-AI/TTS-Dev/blob/main/Finetuning_HiFiGAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cd /content/drive/MyDrive/TTS-Finetuning-ConvAI

/content/drive/MyDrive/TTS-Finetuning-ConvAI


In [3]:
BRANCH = 'r1.8.2'
!apt-get install sox libsndfile1 ffmpeg
!pip install wget unidecode pynini==2.1.4
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
!pip install hydra-core==1.1

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libsndfile1 is already the newest version (1.0.28-4ubuntu0.18.04.2).
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3 sox
0 upgraded, 8 newly installed, 0 to remove and 45 not upgraded.
Need to get 760 kB of archives.
After this operation, 6,717 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3-2.1 [92.0 kB]
Get:2 http://a

In [None]:
!wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/hifigan_finetune.py


--2022-06-04 18:15:11--  https://raw.githubusercontent.com/nvidia/NeMo/r1.8.2/examples/tts/hifigan_finetune.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1192 (1.2K) [text/plain]
Saving to: ‘hifigan_finetune.py.2’


2022-06-04 18:15:11 (26.5 MB/s) - ‘hifigan_finetune.py.2’ saved [1192/1192]



Download the files from the hifigan folder : https://github.com/NVIDIA/NeMo/tree/main/examples/tts/conf/hifigan  

Downloaded using downgit : https://minhaskamal.github.io/DownGit/#/home

In [None]:
!cd conf && unzip hifigan.zip && cd .. 

Archive:  hifigan.zip
replace hifigan/model/validation_ds/val_ds.yaml? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
!curl -LO https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    93    0    93    0     0     65      0 --:--:--  0:00:01 --:--:--    65
  0  300M    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0^C


In [4]:
import numpy as np
import os
import json

import torch
import IPython.display as ipd
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt
from pathlib import Path

In [5]:
def infer(spec_gen_model, vocoder_model, str_input, speaker = None):
    """
    Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.
    
    Arguments:
    spec_gen_model -- Instance of FastPitch model
    vocoder_model -- Instance of a vocoder model (HiFiGAN in our case)
    str_input -- Text input for the synthesis
    speaker -- Speaker number (in the case of a multi-speaker model -- in the mixing case)
    
    Returns:
    spectrogram, waveform of the synthesized audio.
    """
    parser_model = spec_gen_model
    with torch.no_grad():
        parsed = parser_model.parse(str_input)
        if speaker is not None:
            speaker = torch.tensor([speaker]).long().cuda()
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker = speaker)
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
        
    if spectrogram is not None:
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return spectrogram, audio

def get_best_ckpt(experiment_base_dir, new_speaker_id, duration_mins, mixing_enabled, original_speaker_id):
    """
    Gives the model checkpoint paths of an experiment  we ran. 
    
    Arguments:
    experiment_base_dir -- Base experiment directory (specified on top of this notebook as exp_base_dir)
    new_speaker_id -- Speaker id of new HiFiTTS speaker we finetuned FastPitch on
    duration_mins -- total minutes of the new speaker data
    mixing_enabled -- True or False depending on whether we want to mix the original speaker data or not
    original_speaker_id -- speaker id of the original HiFiTTS speaker
    
    Returns:
    List of all checkpoint paths sorted by validation error, Last checkpoint path
    """
    if not mixing_enabled:
        exp_dir = "{}/{}_to_{}_no_mixing_{}_mins".format(experiment_base_dir, original_speaker_id, new_speaker_id, duration_mins)
    else:
        exp_dir = "{}/{}_to_{}_mixing_{}_mins".format(experiment_base_dir, original_speaker_id, new_speaker_id, duration_mins)
    
    ckpt_candidates = []
    last_ckpt = None
    for root, dirs, files in os.walk(exp_dir):
        for file in files:
            if file.endswith(".ckpt"):
                val_error = float(file.split("v_loss=")[1].split("-epoch")[0])
                if "last" in file:
                    last_ckpt = os.path.join(root, file)
                ckpt_candidates.append( (val_error, os.path.join(root, file)))
    ckpt_candidates.sort()
    
    return ckpt_candidates, last_ckpt

In [6]:
from nemo.collections.tts.models import HifiGanModel
from nemo.collections.tts.models import FastPitchModel

[NeMo W 2022-06-05 07:30:05 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

[NeMo W 2022-06-05 07:30:06 experimental:28] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.


In [None]:
!cat ./6097_5_mins/manifest.json | tail -n 2 > ./6097_manifest_dev_ns_all_local.json
!cat ./6097_5_mins/manifest.json | head -n -2 > ./6097_manifest_train_dur_5_mins_local.json

In [7]:
new_speaker_id = 6097
duration_mins = 5
mixing = False
original_speaker_id = "ljspeech"

_ ,last_ckpt = get_best_ckpt("./", new_speaker_id, duration_mins, mixing, original_speaker_id)
print(last_ckpt)

.//ljspeech_to_6097_no_mixing_5_mins/FastPitch/2022-06-03_19-22-49/checkpoints/FastPitch--v_loss=1.7047-epoch=199-last.ckpt


In [8]:
!ln -s ./6097_5_mins/audio audio

ln: failed to create symbolic link 'audio/audio': File exists


In [10]:
!ls -l audio

lrw------- 1 root root 19 Jun  3 19:13 audio -> ./6097_5_mins/audio


In [11]:
# Get records from the training manifest
manifest_train_path = "./6097_manifest_train_dur_5_mins_local.json"
manifest_val_path = "./6097_manifest_dev_ns_all_local.json"

records_train = []
records_val = []

with open(manifest_train_path, "r") as f:
    for i, line in enumerate(f):
        records_train.append(json.loads(line))

with open(manifest_val_path, "r") as f:
    for i, line in enumerate(f):
        records_val.append(json.loads(line))

In [12]:
_ ,last_ckpt = get_best_ckpt("./", new_speaker_id, duration_mins, mixing, original_speaker_id)
print(last_ckpt)

spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)
spec_model.eval().cuda()
parser_model = spec_model

.//ljspeech_to_6097_no_mixing_5_mins/FastPitch/2022-06-03_19-22-49/checkpoints/FastPitch--v_loss=1.7047-epoch=199-last.ckpt
[NeMo I 2022-06-05 07:34:47 tokenize_and_classify:88] Creating ClassifyFst grammars.


[NeMo W 2022-06-05 07:34:53 g2ps:85] apply_to_oov_word=None, it means that some of words will remain unchanged if they are not handled by one of rule in self.parse_one_word(). It is useful when you use tokenizer with set of phonemes and chars together, otherwise it can be not.
[NeMo W 2022-06-05 07:34:54 modelPT:149] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: ./6097_manifest_train_dur_5_mins_local.json
      sample_rate: 22050
      sup_data_path: ./fastpitch_sup_data
      sup_data_types:
      - align_prior_matrix
      - pitch
      n_fft: 1024
      win_length: 1024
      hop_length: 256
      window: hann
      n_mels: 80
      lowfreq: 0
      highfreq: 8000
      max_duration: null
      min_duration: 0.1
      ignore_file: null
      trim:

[NeMo I 2022-06-05 07:34:54 features:259] PADDING: 1
[NeMo I 2022-06-05 07:34:54 features:276] STFT using torch


In [13]:
# Generate a spectrogram for each item
for i, r in enumerate(records_train):
  with torch.no_grad():
      parsed = parser_model.parse(r['text'])
      spectrogram = spec_model.generate_spectrogram(tokens=parsed)
      if isinstance(spectrogram, torch.Tensor):
          spectrogram = spectrogram.to('cpu').numpy()
      if len(spectrogram.shape) == 3:
          spectrogram = spectrogram[0]
      np.save(f"mel_{i}", spectrogram)
      r["mel_filepath"] = f"mel_{i}.npy"

# Save to a new json
with open("hifigan_train_ft.json", "w") as f:
  for r in records_train:
    f.write(json.dumps(r) + '\n')


In [14]:
!cat hifigan_train_ft.json

{"audio_filepath": "audio/presentpictureofnsw_02_mann_0532.wav", "text": "not to stop more than ten minutes by the way", "duration": 2.6, "text_no_preprocessing": "not to stop more than ten minutes by the way,", "text_normalized": "not to stop more than ten minutes by the way,", "mel_filepath": "mel_0.npy"}
{"audio_filepath": "audio/roots_19_morris_0269.wav", "text": "they were men having no country to go back to", "duration": 2.68, "text_no_preprocessing": "they were men having no country to go back to,", "text_normalized": "they were men having no country to go back to,", "mel_filepath": "mel_1.npy"}
{"audio_filepath": "audio/swag_06_tompkins_0883.wav", "text": "no mistake can well be made", "duration": 1.88, "text_no_preprocessing": "no mistake can well be made.", "text_normalized": "no mistake can well be made.", "mel_filepath": "mel_2.npy"}
{"audio_filepath": "audio/glitteringplain_15_morris_0108.wav", "text": "if thou needs must depart", "duration": 1.88, "text_no_preprocessing":

In [15]:
# Generate a spectrogram for each item
for i, r in enumerate(records_val):
  with torch.no_grad():
      parsed = parser_model.parse(r['text'])
      spectrogram = spec_model.generate_spectrogram(tokens=parsed)
      if isinstance(spectrogram, torch.Tensor):
          spectrogram = spectrogram.to('cpu').numpy()
      if len(spectrogram.shape) == 3:
          spectrogram = spectrogram[0]
      np.save(f"mel_{i}", spectrogram)
      r["mel_filepath"] = f"mel_{i}.npy"

# Save to a new json
with open("hifigan_val_ft.json", "w") as f:
  for r in records_train:
    f.write(json.dumps(r) + '\n')


In [None]:
!wget https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz  # Contains 10MB of data
!tar -xzf 6097_5_mins.tar.gz

--2022-06-04 18:41:38--  https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz
Resolving nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)... 3.5.132.11
Connecting to nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)|3.5.132.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11002569 (10M) [application/x-gzip]
Saving to: ‘6097_5_mins.tar.gz.2’


2022-06-04 18:41:40 (4.67 MB/s) - ‘6097_5_mins.tar.gz.2’ saved [11002569/11002569]



In [17]:
!(python hifigan_finetune.py \
--config-name=hifigan.yaml \
model.train_ds.dataloader_params.batch_size=32 \
model.max_steps=1000 \
model.optim.lr=0.0001 \
~model.optim.sched \
train_dataset=./hifigan_train_ft.json \
validation_datasets=./hifigan_val_ft.json \
exp_manager.exp_dir=hifigan_ft \
+init_from_nemo_model=tts_hifigan.nemo \
trainer.check_val_every_n_epoch=10 \
model.train_ds=hifigan_train_ft.json \
model.validation_ds=hifigan_val_ft.json)

[NeMo W 2022-06-05 07:40:54 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

[NeMo W 2022-06-05 07:40:54 experimental:28] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-06-05 07:40:56 exp_manager:281] Experiments will be logged at hifigan_ft/HifiGan/2022-06-05_07-40-56
[NeMo I 2022-06-05 07:40:56 exp_manag