# FastPitch Adapter Finetuning

This notebook is designed to provide a guide on how to run FastPitch Adapter Finetuning Pipeline. It contains the following sections:
1. **Transform pre-trained FastPitch checkpoint to adapter-compatible checkpoint**
2. **Fine-tune FastPitch on adaptation data**: fine-tune pre-trained multi-speaker FastPitch for a new speaker
* Dataset Preparation: download dataset and extract manifest files. (duration more than 15 mins)
* Preprocessing: add absolute audio paths in manifest, calculate pitch stats.
* Training: fine-tune frozen multispeaker FastPitch with trainable adapters.
3. **Fine-tune HiFiGAN on adaptation data**: fine-tune a vocoder for the fine-tuned multi-speaker FastPitch
* Dataset Preparation: extract mel-spectrograms from fine-tuned FastPitch.
* Training: fine-tune HiFiGAN with fine-tuned adaptation data.
4. **Inference**: generate speech from adpated FastPitch
* Load Model: load pre-trained multi-speaker FastPitch with **fine-tuned adapters**.
* Output Audio: generate audio files.

# License

> Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> 
> Licensed under the Apache License, Version 2.0 (the "License");
> you may not use this file except in compliance with the License.
> You may obtain a copy of the License at
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> Unless required by applicable law or agreed to in writing, software
> distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the License for the specific language governing permissions and
> limitations under the License.

In [None]:
"""
You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies# .
"""
BRANCH = 'main'
# # If you're using Colab and not running locally, uncomment and run this cell.
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget unidecode pynini==2.1.4 scipy==1.7.3
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

In [None]:
!wandb login #PASTE_WANDB_APIKEY_HERE

In [None]:
YOUR_PRETRAINED_FASTPITCH_CHECKPOINT = ""
YOUR_FINETUNED_HIFIGAN_ON_MULTISPEAKER_CHECKPOINT = ""

In [None]:
sample_rate = 44100
# Store all python script
codedir = 'NeMoTTS' 
# Store all manifest and audios
datadir = 'NeMoTTS_dataset'
# Store all related text-normalized files
normdir = 'NeMoTTS_normalize_files'
# Store all supplementary files
suppdir = "NeMoTTS_sup_data"
# Store all config files
confdir = "NeMoTTS_conf"
# Store all training logs
logsdir = "NeMoTTS_logs"
# Store all mel-spectrograms for vocoder training
melsdir = "NeMoTTS_mels"

In [None]:
import os
import json
import shutil
import nemo
import torch
import numpy as np

from pathlib import Path
from tqdm import tqdm

# 1. Transform pre-trained checkpoint to adapter-compatible checkpoint

In [None]:
from nemo.core import adapter_mixins
from omegaconf import DictConfig, OmegaConf, open_dict

In [None]:
def update_model_config_to_support_adapter(config) -> DictConfig:
    with open_dict(config):
        enc_adapter_metadata = adapter_mixins.get_registered_adapter(config.input_fft._target_)
        if enc_adapter_metadata is not None:
            config.input_fft._target_ = enc_adapter_metadata.adapter_class_path

        dec_adapter_metadata = adapter_mixins.get_registered_adapter(config.output_fft._target_)
        if dec_adapter_metadata is not None:
            config.output_fft._target_ = dec_adapter_metadata.adapter_class_path

        pitch_predictor_adapter_metadata = adapter_mixins.get_registered_adapter(config.pitch_predictor._target_)
        if pitch_predictor_adapter_metadata is not None:
            config.pitch_predictor._target_ = pitch_predictor_adapter_metadata.adapter_class_path

        duration_predictor_adapter_metadata = adapter_mixins.get_registered_adapter(config.duration_predictor._target_)
        if duration_predictor_adapter_metadata is not None:
            config.duration_predictor._target_ = duration_predictor_adapter_metadata.adapter_class_path

        aligner_adapter_metadata = adapter_mixins.get_registered_adapter(config.alignment_module._target_)
        if aligner_adapter_metadata is not None:
            config.alignment_module._target_ = aligner_adapter_metadata.adapter_class_path

    return config

In [None]:
state = torch.load(YOUR_PRETRAINED_FASTPITCH_CHECKPOINT)
state['hyper_parameters']['cfg'] = update_model_config_to_support_adapter(state['hyper_parameters']['cfg'])
torch.save(state, YOUR_PRETRAINED_FASTPITCH_CHECKPOINT)

In [None]:
shutil.copyfile(YOUR_PRETRAINED_FASTPITCH_CHECKPOINT, "FastPitch.pt")
shutil.copyfile(YOUR_FINETUNED_HIFIGAN_ON_MULTISPEAKER_CHECKPOINT, "HifiGan.pt")
YOUR_PRETRAINED_FASTPITCH_CHECKPOINT = "FastPitch.pt"
YOUR_FINETUNED_HIFIGAN_ON_MULTISPEAKER_CHECKPOINT = "HifiGan.pt"

# 2. Fine-tune FastPitch on adaptation data

## a. Data Preparation
For our tutorial, we use small part of VCTK dataset with a new target speaker (p267). Usually, the audios should have total duration more than 15 mintues.

In [None]:
!mkdir -p {datadir} && cd {datadir} && wget https://vctk-subset.s3.amazonaws.com/vctk_subset.tar.gz && tar zxf vctk_subset.tar.gz

In [None]:
manidir = f"{datadir}/vctk_subset"
!ls {manidir}

For simplicity, we use original dev set as training set and original test set as validation set.

In [None]:
train_manifest = os.path.abspath(os.path.join(manidir, 'train.json'))
valid_manifest = os.path.abspath(os.path.join(manidir, 'dev.json'))

## b. Preprocessing

In [None]:
# additional files
!mkdir -p {normdir} && cd {normdir} \
&& wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/scripts/tts_dataset_files/cmudict-0.7b_nv22.10 \
&& wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/scripts/tts_dataset_files/heteronyms-052722 \

### Add absolute file path in manifest

In [None]:
def json_reader(filename):
    lines = []
    with open(filename) as f:
        for line in f: lines.append(json.loads(line))
    return lines

def json_writer(manifest, filename):
    with open(filename, 'w') as fout:
        for m in manifest: fout.write(json.dumps(m) + '\n') 

In [None]:
train_datas = json_reader(train_manifest)
for m in train_datas: m['audio_filepath'] = os.path.abspath(os.path.join(manidir, m['audio_filepath']))
json_writer(train_datas, train_manifest)

In [None]:
valid_datas = json_reader(valid_manifest)
for m in valid_datas: m['audio_filepath'] = os.path.abspath(os.path.join(manidir, m['audio_filepath']))
json_writer(valid_datas, valid_manifest)

### Calibrate speaker id to start from 0

In [None]:
train_datas = json_reader(train_manifest)
for m in train_datas: m['old_speaker'], m['speaker'] = m['speaker'], 0
json_writer(train_datas, train_manifest)

valid_datas = json_reader(valid_manifest)
for m in valid_datas: m['old_speaker'], m['speaker'] = m['speaker'], 0
json_writer(valid_datas, valid_manifest)

### Calculate Pitch Stats

In [None]:
import librosa
from nemo.collections.asr.parts.preprocessing.features import WaveformFeaturizer
from nemo.collections.tts.parts.utils.tts_dataset_utils import get_base_dir

In [None]:
def get_pitch(sample):    
    rel_audio_path = Path(sample["audio_filepath"]).relative_to(base_data_dir).with_suffix("")
    rel_audio_path_as_text_id = str(rel_audio_path).replace("/", "_")
    pitch_filepath = os.path.join(pitch_dir, f"{rel_audio_path_as_text_id}.pt")
    
    if os.path.exists(pitch_filepath):
        pitch = torch.load(pitch_filepath).numpy()

    else:
        features = wave_model.process(
            sample["audio_filepath"]
        )
        voiced_tuple = librosa.pyin(
            features.numpy(),
            fmin=librosa.note_to_hz('C2'),
            fmax=librosa.note_to_hz('C7'),
            frame_length=2048,
            sr=44100,
            fill_na=0.0,
        )
        pitch = voiced_tuple[0]
        torch.save(torch.from_numpy(pitch).float(), pitch_filepath)
    
    return pitch

In [None]:
wave_model = WaveformFeaturizer(sample_rate=sample_rate)
pitch_dir = os.path.join(suppdir, 'pitch')
os.makedirs(suppdir, exist_ok=True)
os.makedirs(pitch_dir, exist_ok=True)

train_pitchs = []
train_datas = json_reader(train_manifest)
base_data_dir = get_base_dir([item["audio_filepath"] for item in train_datas])
for m in tqdm(train_datas): train_pitchs.append(get_pitch(m))
    
valid_datas = json_reader(valid_manifest)
base_data_dir = get_base_dir([item["audio_filepath"] for item in valid_datas])
for m in tqdm(valid_datas): get_pitch(m)

train_pitchs = np.concatenate(train_pitchs)
pitch_mean = float(np.mean(train_pitchs))
pitch_std = float(np.std(train_pitchs))

with open(os.path.join(manidir, 'pitch_stats.json'), 'w') as f:
    json.dump({'pitch':[pitch_mean, pitch_std]}, f)

## c. Training

In [None]:
!mkdir -p {confdir} && cd {confdir} && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/fastpitch_align_44100_adapter.yaml

In [None]:
!cd {codedir} && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/fastpitch_finetune_adapters.py

### Important notes
* **+init_from_ptl_ckpt**: initialize with a multi-speaker FastPitch checkpoint
* **~model.speaker_encoder.lookup_module**: remove the pre-trained looked-up speaker embedding

In [None]:
# Normally 100 epochs (15 mins)
!(python {codedir}/fastpitch_finetune_adapters.py \
--config-path={os.path.abspath(confdir)} \
--config-name=fastpitch_align_44100_adapter.yaml \
+init_from_ptl_ckpt={YOUR_PRETRAINED_FASTPITCH_CHECKPOINT} \
sample_rate=44100 \
train_dataset={train_manifest} \
validation_datasets={valid_manifest} \
sup_data_types="['align_prior_matrix', 'pitch', 'speaker_id', 'reference_audio']" \
sup_data_path={suppdir} \
pitch_mean={pitch_mean} \
pitch_std={pitch_std} \
phoneme_dict_path={normdir}/cmudict-0.7b_nv22.10 \
heteronyms_path={normdir}/heteronyms-052722 \
~model.speaker_encoder.lookup_module \
model.speaker_encoder.gst_module._target_="nemo.collections.tts.modules.submodules.GlobalStyleToken" \
model.input_fft.condition_types="['add', 'layernorm']" \
model.output_fft.condition_types="['add', 'layernorm']" \
model.duration_predictor.condition_types="['add', 'layernorm']" \
model.pitch_predictor.condition_types="['add', 'layernorm']" \
model.alignment_module.condition_types="['add']" \
model.train_ds.dataloader_params.batch_size=8 \
model.validation_ds.dataloader_params.batch_size=8 \
model.train_ds.dataloader_params.num_workers=8 \
model.validation_ds.dataloader_params.num_workers=8 \
+model.text_tokenizer.add_blank_at=True \
model.optim.name=adam \
model.optim.lr=2e-4 \
model.optim.weight_decay=0.0 \
~model.optim.sched \
exp_manager.exp_dir={logsdir} \
+exp_manager.create_wandb_logger=True \
+exp_manager.wandb_logger_kwargs.name="tutorial-FastPitch-finetune-adaptation" \
+exp_manager.wandb_logger_kwargs.project="NeMo" \
+exp_manager.checkpoint_callback_params.save_top_k=-1 \
trainer.max_epochs=10 \
trainer.check_val_every_n_epoch=10 \
trainer.log_every_n_steps=1 \
trainer.devices=1 \
trainer.strategy=ddp \
trainer.precision=32 \
)

In [None]:
# e.g. NeMoTTS_logs/FastPitch/Y-M-D_H-M-S
last_checkpoint_dir = sorted(list([i for i in (Path(logsdir) / "FastPitch").iterdir() if i.is_dir()]))[-1] / "checkpoints"
YOUR_FINETUNED_ADAPTER_CHECKPOINT = list(last_checkpoint_dir.glob('adapters.pt'))[0]
YOUR_FINETUNED_ADAPTER_CHECKPOINT

# 4. Fine-tune HiFiGAN on adaptation data

## a. Dataset Preparation

In [None]:
from nemo.collections.tts.parts.utils.tts_dataset_utils import BetaBinomialInterpolator
from nemo.collections.tts.models import FastPitchModel
from collections import defaultdict
import random

In [None]:
def gen_spectrogram(index, manifest, speaker_to_index, base_data_dir):
    
    record = manifest[index]
    audio_file = record["audio_filepath"]
    
    if '.wav' in audio_file:
        save_path = os.path.abspath(os.path.join(melsdir, audio_file.split("/")[-1].replace(".wav", ".npy")))
    
    if '.flac' in audio_file:
        save_path = os.path.abspath(os.path.join(melsdir, audio_file.split("/")[-1].replace(".flac", ".npy")))
    
    if os.path.exists(save_path):
        return save_path
    
    if "normalized_text" in record:
        text = spec_model.parse(record["normalized_text"], normalize=False)
    else:
        text = spec_model.parse(record['text'])
        
    text_len = torch.tensor(text.shape[-1], dtype=torch.long, device=spec_model.device).unsqueeze(0)
    
    audio = wave_model.process(audio_file).unsqueeze(0).to(device=spec_model.device)
    audio_len = torch.tensor(audio.shape[1]).long().unsqueeze(0).to(device=spec_model.device)
    spect, spect_len = spec_model.preprocessor(input_signal=audio, length=audio_len) 
    
    attn_prior = torch.from_numpy(beta_binomial_interpolator(spect_len.item(), text_len.item())).unsqueeze(0).to(spec_model.device)
        
    reference_pool = speaker_to_index[record["speaker"]] - set([index]) if len(speaker_to_index[record["speaker"]]) > 1 else speaker_to_index[record["speaker"]]
    reference_sample = manifest[random.sample(reference_pool, 1)[0]]
    reference_audio = wave_model.process(reference_sample["audio_filepath"]).unsqueeze(0).to(device=spec_model.device)
    reference_audio_length = torch.tensor(reference_audio.shape[1]).long().unsqueeze(0).to(device=spec_model.device)
    reference_spec, reference_spec_len = spec_model.preprocessor(input_signal=reference_audio, length=reference_audio_length)  
    
        
    with torch.no_grad():
        spectrogram = spec_model.forward(
          text=text, 
          input_lens=text_len,
          spec=spect, 
          mel_lens=spect_len, 
          attn_prior=attn_prior,
          reference_spec=reference_spec,
          reference_spec_lens=reference_spec_len,
        )[0]
    
    spec = spectrogram[0].to('cpu').numpy()
    np.save(save_path, spec)
    return save_path

In [None]:
# Pretrained FastPitch Weights
spec_model = FastPitchModel.load_from_checkpoint(YOUR_PRETRAINED_FASTPITCH_CHECKPOINT)

# Load Adapter Weights
spec_model.load_adapters(YOUR_FINETUNED_ADAPTER_CHECKPOINT)
spec_model.eval().cuda()

beta_binomial_interpolator = BetaBinomialInterpolator()

In [None]:
os.makedirs(melsdir, exist_ok=True)

# Train
train_datas = json_reader(train_manifest)
base_data_dir = get_base_dir([item["audio_filepath"] for item in train_datas])
speaker_to_index = defaultdict(list)
for i, d in enumerate(train_datas): speaker_to_index[d.get('speaker', None)].append(i)
speaker_to_index = {k: set(v) for k, v in speaker_to_index.items()}

for i, record in enumerate(tqdm(train_datas)):
    record["mel_filepath"] =  gen_spectrogram(i, train_datas, speaker_to_index, base_data_dir)

json_writer(train_datas, train_manifest)


# Valid
valid_datas = json_reader(valid_manifest)
base_data_dir = get_base_dir([item["audio_filepath"] for item in valid_datas])
speaker_to_index = defaultdict(list)
for i, d in enumerate(valid_datas): speaker_to_index[d.get('speaker', None)].append(i)
speaker_to_index = {k: set(v) for k, v in speaker_to_index.items()}

for i, record in enumerate(tqdm(valid_datas)):
    record["mel_filepath"] =  gen_spectrogram(i, valid_datas, speaker_to_index, base_data_dir)

json_writer(valid_datas, valid_manifest)

## b. Training

In [None]:
!cd {confdir} && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/hifigan/hifigan_44100.yaml
!cd {confdir} && mkdir -p model/train_ds && cd model/train_ds && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/hifigan/model/train_ds/train_ds_finetune.yaml 
!cd {confdir} && mkdir -p model/validation_ds && cd model/validation_ds && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/hifigan/model/validation_ds/val_ds_finetune.yaml
!cd {confdir} && mkdir -p model/generator && cd model/generator && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/hifigan/model/generator/v1_44100.yaml
!cd {codedir} && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/hifigan_finetune.py

In [None]:
# Normally 500 epochs (30 mins)
!(python {codedir}/hifigan_finetune.py \
--config-path={os.path.abspath(confdir)} \
--config-name=hifigan_44100.yaml \
train_dataset={train_manifest} \
validation_datasets={valid_manifest} \
+init_from_ptl_ckpt={YOUR_FINETUNED_HIFIGAN_ON_MULTISPEAKER_CHECKPOINT} \
model.train_ds.dataloader_params.batch_size=32 \
model.optim.lr=0.0001 \
+trainer.max_epochs=5 \
trainer.check_val_every_n_epoch=5 \
model/train_ds=train_ds_finetune \
model/validation_ds=val_ds_finetune \
trainer.devices=1 \
trainer.strategy='ddp' \
trainer.precision=16 \
exp_manager.exp_dir={logsdir} \
exp_manager.create_wandb_logger=True \
exp_manager.wandb_logger_kwargs.name="tutorial-HiFiGAN-finetune-adaptation" \
exp_manager.wandb_logger_kwargs.project="NeMo" \
)

In [None]:
# e.g. NeMoTTS_logs/HifiGan/Y-M-D_H-M-S/checkpoints/HifiGan--val_loss=XXX-epoch=XXX.ckpt
last_checkpoint_dir = sorted(list([i for i in (Path(logsdir) / "HifiGan").iterdir() if i.is_dir()]))[-1] / "checkpoints"
YOUR_FINETUNED_HIFIGAN_ON_ADAPTATION_CHECKPOINT = list(last_checkpoint_dir.glob('*-last.ckpt'))[0]
YOUR_FINETUNED_HIFIGAN_ON_ADAPTATION_CHECKPOINT

# 3. Inference

In [None]:
from nemo.collections.tts.models import HifiGanModel
import IPython.display as ipd
import matplotlib.pyplot as plt

## a. Load Model

In [None]:
wave_model = WaveformFeaturizer(sample_rate=sample_rate)

In [None]:
# FastPitch
spec_model = FastPitchModel.load_from_checkpoint(YOUR_PRETRAINED_FASTPITCH_CHECKPOINT)
spec_model.load_adapters(YOUR_FINETUNED_ADAPTER_CHECKPOINT)
# spec_model.freeze()
# spec_model.unfreeze_enabled_adapters()
spec_model = spec_model.eval().cuda()

In [None]:
# HiFiGAN
vocoder_model = HifiGanModel.load_from_checkpoint(checkpoint_path=YOUR_FINETUNED_HIFIGAN_ON_ADAPTATION_CHECKPOINT).eval().cuda()

## b. Output Audio

In [None]:
def gt_spectrogram(audio_path, wave_model, spec_gen_model):
    features = wave_model.process(audio_path, trim=False)
    audio, audio_length = features, torch.tensor(features.shape[0]).long()
    audio = audio.unsqueeze(0).to(device=spec_gen_model.device)
    audio_length = audio_length.unsqueeze(0).to(device=spec_gen_model.device)
    with torch.no_grad():
        spectrogram, spec_len = spec_gen_model.preprocessor(input_signal=audio, length=audio_length)
    return spectrogram, spec_len

def gen_spectrogram(text, spec_gen_model, reference_spec, reference_spec_lens):
    parsed = spec_gen_model.parse(text)
    with torch.no_grad():    
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, 
                                                          reference_spec=reference_spec, 
                                                          reference_spec_lens=reference_spec_lens)

    return spectrogram
  
def synth_audio(vocoder_model, spectrogram):    
    with torch.no_grad():  
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return audio

In [None]:
# Reference Audio
with open(train_manifest, "r") as f:
    for i, line in enumerate(f):
        reference_record = json.loads(line)
        break
        
# Validatation Audio
num_val = 3
val_records = []
with open(valid_manifest, "r") as f:
    for i, line in enumerate(f):
        val_records.append(json.loads(line))
        if len(val_records) >= num_val:
            break

In [None]:
for i, val_record in enumerate(val_records):
    reference_spec, reference_spec_lens = gt_spectrogram(reference_record['audio_filepath'], wave_model, spec_model)
    reference_spec = reference_spec.to(spec_model.device)
    spec_pred = gen_spectrogram(val_record['text'], spec_model,
                                reference_spec=reference_spec, 
                                reference_spec_lens=reference_spec_lens)

    audio_gen = synth_audio(vocoder_model, spec_pred)
    
    audio_ref = ipd.Audio(reference_record['audio_filepath'], rate=sample_rate)
    audio_gt = ipd.Audio(val_record['audio_filepath'], rate=sample_rate)
    audio_gen = ipd.Audio(audio_gen, rate=sample_rate)
    
    print("------")
    print(f"Text: {val_record['text']}")
    print('Reference Audio')
    ipd.display(audio_ref)
    print('Ground Truth Audio')
    ipd.display(audio_gt)
    print('Synthesized Audio')
    ipd.display(audio_gen)
    plt.imshow(spec_pred[0].to('cpu').numpy(), origin="lower", aspect="auto")
    plt.show()

In [None]:
str(YOUR_PRETRAINED_FASTPITCH_CHECKPOINT)

In [None]:
str(YOUR_FINETUNED_ADAPTER_CHECKPOINT)

In [None]:
str(YOUR_FINETUNED_HIFIGAN_ON_ADAPTATION_CHECKPOINT)