# FastPitch SpeakerAdaptation

This notebook is designed to provide a guide on how to run FastPitch Speaker Adaptation Pipeline. If you are not familiar with Adapters please go through the following [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/02_NeMo_Adapters.ipynb). This tutorial contains the following sections:

- Fine-tune FastPitch: fine-tune pre-trained multi-speaker FastPitch for a new speaker
- Inference: generate speech from adapted FastPitch

# License

> Copyright 2022 NVIDIA. All Rights Reserved.
> 
> Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
> 
> http://www.apache.org/licenses/LICENSE-2.0
> 
> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

In [None]:
"""
You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies# .
"""
BRANCH = 'main'
# # If you're using Colab and not running locally, uncomment and run this cell.
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget unidecode pynini==2.1.4 scipy==1.7.3
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

In [None]:
import os
import json
import nemo
import torch
import numpy as np
import IPython.display as ipd

from pathlib import Path
from tqdm.notebook import tqdm

In [None]:
from nemo.collections.tts.torch.g2ps import EnglishG2p
from nemo.collections.tts.torch.data import TTSDataset
from nemo_text_processing.text_normalization.normalize import Normalizer
from nemo.collections.tts.torch.tts_tokenizers import EnglishPhonemesTokenizer, EnglishCharsTokenizer

## Fine-tune FastPitch using Adapters

### Data

Download a small dataset to demonstrate speaker adaptation using adapters

In [None]:
# Dataset download
!wget https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz  # Contains 10MB of data
!tar -xzf 6097_5_mins.tar.gz

In [None]:
!head -n 1 ./6097_5_mins/manifest.json

For speaker adaptation the manifest must contain speaker ID (`speaker` field). Our downloaded manifest does not contain this field so we will add it.

In [None]:
def json_reader(filename):
    with open(filename) as f:
        for line in f:
            yield json.loads(line)
            

def json_writer(file, json_objects):
    with open(file, "w") as f:
        for jsonobj in json_objects:
            jsonstr = json.dumps(jsonobj)
            f.write(jsonstr + "\n")

In [None]:
manifest = list(json_reader('./6097_5_mins/manifest.json'))
for m in manifest:
    m['speaker'] = 500
json_writer('./6097_5_mins/manifest.json', manifest)

Split the data into train and validation set.

In [None]:
!cat ./6097_5_mins/manifest.json | tail -n 2 > ./val_manifest.json
!cat ./6097_5_mins/manifest.json | head -n -2 > ./train_manifest.json
!ln -s ./6097_5_mins/audio audio

Download all additional files for training the model.

In [None]:
# additional files
!mkdir -p tts_dataset_files && cd tts_dataset_files \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/cmudict-0.7b_nv22.08 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/heteronyms-052722 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/nemo_text_processing/text_normalization/en/data/whitelist/lj_speech.tsv \
&& cd ..

We use the `examples/tts/fastpitch_finetune_adapters.py` script to finetune the adapters with the `fastpitch_speaker_adaptation.yaml` configuration. So, let's download these files.

In [None]:
!wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch_finetune_adapters.py

!mkdir -p conf \
&& cd conf \
&& wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/fastpitch_speaker_adaptation.yaml \
&& cd ..

### Generate Supplementary Data

It is recommended to precompute supplementary data like `pitch` and `alignment_matrix` before training.
Let's define the normalizer and tokenizer for this purpose.

In [None]:
# Text normalizer
text_normalizer = Normalizer(
    lang="en", 
    input_case="cased", 
    whitelist="tts_dataset_files/lj_speech.tsv"
)

text_normalizer_call_kwargs = {
    "punct_pre_process": True,
    "punct_post_process": True
}

# Text tokenizer
# Grapheme-to-phoneme module
g2p = EnglishG2p(
    phoneme_dict="tts_dataset_files/cmudict-0.7b_nv22.08",
    heteronyms="tts_dataset_files/heteronyms-052722"
)

# Text tokenizer
text_tokenizer = EnglishPhonemesTokenizer(
    punct=True,
    stresses=True,
    chars=True,
    apostrophe=True,
    pad_with_space=True,
    g2p=g2p,
)

Defining a method that would compute and save the supplementary data. This method would return pitch statistics for the dataset.

In [None]:
def pre_calculate_supplementary_data(manifest_dir, sup_data_path, sup_data_types, text_tokenizer, text_normalizer, 
                                     text_normalizer_call_kwargs, 
                                     batch_size=32, 
                                     sample_rate=22050):
    # init train and val dataloaders
    stages = ["train", "val"]
    stage2dl = {}
    for stage in stages:
        ds = TTSDataset(
            manifest_filepath=os.path.join(manifest_dir, f"{stage}_manifest.json"),
            sample_rate=sample_rate,
            sup_data_path=sup_data_path,
            sup_data_types=sup_data_types,
            n_fft=1024,
            win_length=1024,
            hop_length=256,
            window="hann",
            n_mels=80,
            lowfreq=0,
            highfreq=8000,
            text_tokenizer=text_tokenizer,
            text_normalizer=text_normalizer,
            text_normalizer_call_kwargs=text_normalizer_call_kwargs

        ) 
        stage2dl[stage] = torch.utils.data.DataLoader(ds, batch_size=batch_size, collate_fn=ds._collate_fn, num_workers=12, pin_memory=True)
    
    # iteration over dataloaders
    pitch_mean, pitch_std, pitch_min, pitch_max = None, None, None, None
    for stage, dl in stage2dl.items():
        pitch_list = []
        for batch in tqdm(dl, total=len(dl)):
            tokens, tokens_lengths, audios, audio_lengths, attn_prior, pitches, pitches_lengths = batch
            pitch_list.append(pitches[pitches != 0])
            
        if stage == "train":
            pitch_tensor = torch.cat(pitch_list)
            pitch_mean, pitch_std = pitch_tensor.mean().item(), pitch_tensor.std().item()
            pitch_min, pitch_max = pitch_tensor.min().item(), pitch_tensor.max().item()
            
    return pitch_mean, pitch_std, pitch_min, pitch_max

In [None]:
fastpitch_sup_data_path = "fastpitch_sup_data_folder"
manifest_dir = "./"
sup_data_types = ["align_prior_matrix", "pitch"]

pitch_mean, pitch_std, pitch_min, pitch_max = pre_calculate_supplementary_data(
    manifest_dir, fastpitch_sup_data_path, sup_data_types, text_tokenizer, text_normalizer, text_normalizer_call_kwargs
)

### Training Adapters

Now we are ready for training the adapters! Let's try to adapt new speakers using adapters.

In [None]:
# !(python fastpitch_finetune_adapters.py \
#     --config-name=fastpitch_speaker_adaptation.yaml \

!(python /workspace/NeMo/examples/tts/fastpitch_finetune_adapters.py \
    --config-path=/workspace/NeMo/examples/tts/conf \
    --config-name=fastpitch_speaker_adaptation.yaml \
    sample_rate=44100 \
    train_dataset=./train_manifest.json \
    validation_datasets=./val_manifest.json \
    sup_data_types="['align_prior_matrix', 'pitch', 'speaker_id']" \
    sup_data_path={fastpitch_sup_data_path} \
    +init_from_pretrained_model="tts_en_fastpitch_multispeaker" \
    pitch_mean={pitch_mean} \
    pitch_std={pitch_std} \
    pitch_fmin={pitch_min} \
    pitch_fmax={pitch_max} \
    model.n_speakers=12800 \
    phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.08 \
    heteronyms_path=tts_dataset_files/heteronyms-052722 \
    whitelist_path=tts_dataset_files/lj_speech.tsv \
    model.train_ds.dataloader_params.batch_size=24 \
    model.validation_ds.dataloader_params.batch_size=24 \
    model.train_ds.dataloader_params.num_workers=8 \
    model.validation_ds.dataloader_params.num_workers=8 \
    model.optim.lr=2e-4 \
    ~model.optim.sched \
    model.optim.name=adamw \
    model.optim.weight_decay=0.0 \
    +model.text_tokenizer.add_blank_at=True \
    trainer.check_val_every_n_epoch=10 \
    trainer.max_epochs=100 \
    trainer.log_every_n_steps=1 \
    trainer.devices=1 \
    trainer.precision=32 \
 )

Let's look at some of the options in the training command:

`+init_from_pretrained_model` : FastPitch will load this pretrained checkpoint, and add adapters on top of this model.

`model.n_speakers` : In the above pretrained model, there are already 12800 number of speaker embeddings, hence this parameter needs to match that number for loading the weights correctly.

`~model.optim.sched` : Since this is a finetuning activity for only a few steps, we do not need a scheduler.

Let's look at some fields in `fastpitch_speaker_adaptation.yaml`

`adapter` : This is a new field inside `model` struct, this struct contains all adapter related configs.

`adapter_name` : Name of the adapter.

`adapter_module_name` : Names of modules where adapter has to be added/enabled.

`adapter_state_dict_name` : Name of the state dict to be saved with adapter weights only.

`add_random_speaker` : Since adapters are used to add new speakers, this flag tells if a random speaker embedding should be used to initialize the new speaker's embedding or some other pretrained speaker embedding needs to be used as initial value of the new speaker's embedding.

Look at `nemo.collections.common.parts.adapter_modules` to see the type of adapters available.

## Inference

Inference using already trained adapter can be done using the following steps:

- Update the config of the pretrained model to support adapters. This step replaces the module, where adapters need to be used, with the registered adapter class for that particular module.

- Load the pretrained model.

- Load the adapter weights using `load_adapters` method.

- Generate spectrogram using this model and convert it to spectrogram using a vocoder.

Let's go through these steps one at a time.

In [None]:
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
from nemo.core import adapter_mixins
from omegaconf import DictConfig, open_dict

def update_model_config_to_support_adapter(config) -> DictConfig:
    with open_dict(config):
        enc_adapter_metadata = adapter_mixins.get_registered_adapter(config.input_fft._target_)
        if enc_adapter_metadata is not None:
            config.input_fft._target_ = enc_adapter_metadata.adapter_class_path

        dec_adapter_metadata = adapter_mixins.get_registered_adapter(config.output_fft._target_)
        if dec_adapter_metadata is not None:
            config.output_fft._target_ = dec_adapter_metadata.adapter_class_path

        pitch_predictor_adapter_metadata = adapter_mixins.get_registered_adapter(config.pitch_predictor._target_)
        if pitch_predictor_adapter_metadata is not None:
            config.pitch_predictor._target_ = pitch_predictor_adapter_metadata.adapter_class_path

        duration_predictor_adapter_metadata = adapter_mixins.get_registered_adapter(config.duration_predictor._target_)
        if duration_predictor_adapter_metadata is not None:
            config.duration_predictor._target_ = duration_predictor_adapter_metadata.adapter_class_path

        aligner_adapter_metadata = adapter_mixins.get_registered_adapter(config.alignment_module._target_)
        if aligner_adapter_metadata is not None:
            config.alignment_module._target_ = aligner_adapter_metadata.adapter_class_path

    return config

Let's prepare config file to use adapter modules in FastPitch instead of just the FastPitch model.

In [None]:
pretrained_model = "tts_en_fastpitch_multispeaker"

# Get the model config from the pretrained checkpoint. 
model_cfg = FastPitchModel.from_pretrained(pretrained_model, return_config=True)
# Update the config to support adapters.
model_cfg = update_model_config_to_support_adapter(model_cfg)
# Use the updated config to load the pretrained checkpoint. This would allow loading the pretrained \
# checkpoint along with adapters.
spec_gen_model = FastPitchModel.from_pretrained(pretrained_model, override_config_path=model_cfg)
spec_gen_model.summarize()

Load adapter checkpoint. The checkpoint produced in the above training was saved in the folder `./nemo_experiments/FastPitch/{version}/checkpoints/adapters` where version is based on the timestamp during training.

After your training of adapters is done, check the folder `./nemo_experiments/FastPitch/`, you will find a folder with a timestamp as it's name. This folder name is the `version`, replace the version in the below cell with the version that your experiment generated.

In [None]:
# Replace `version` with the version in your experiment
version = "2022-10-12_19-29-21"
adapter_checkpoint_path = f"nemo_experiments/FastPitch/{version}/checkpoints/adapters.pt"
spec_gen_model.load_adapters(adapter_checkpoint_path, name=None, map_location='cpu')

In [None]:
spec_gen_model.freeze()
spec_gen_model.summarize()

To convert spectrogram to waveform we will need a vocoder. Let's load HiFiGAN vocoder.

In [None]:
pretrained_vocoder = "tts_en_hifitts_hifigan_ft_fastpitch"
vocoder = HifiGanModel.from_pretrained(pretrained_vocoder)

In [None]:
def infer(spec_gen_model, vocoder_model, str_input, speaker=None):
    """
    Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.
    
    Args:
        spec_gen_model: Spectrogram generator model (FastPitch in our case)
        vocoder_model: Vocoder model (HiFiGAN in our case)
        str_input: Text input for the synthesis
        speaker: Speaker ID
    
    Returns:
        spectrogram and waveform of the synthesized audio.
    """
    spec_gen_model.cuda().eval()
    with torch.no_grad():
        parsed = spec_gen_model.parse(str_input)
        if speaker is not None:
            speaker = torch.tensor([speaker]).long().to(device=spec_gen_model.device)
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker=speaker)
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
        
    if spectrogram is not None:
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
    else:
        raise Exception("None value was generated for spectrogram")
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return spectrogram, audio

In [None]:
# Let's load validation manifest for inference
val_manifest = list(json_reader('./val_manifest.json'))
for val_record in val_manifest:
    str_input = val_record['text_normalized']
    speaker_id = val_record['speaker']
    _, audio_adapter = infer(spec_gen_model, vocoder, str_input, speaker=speaker_id)
    
    print("Original Audio")
    ipd.display(ipd.Audio(val_record['audio_filepath'], rate=44100))
    print("Generated Audio with Adapter")
    ipd.display(ipd.Audio(audio_adapter, rate=44100))
    print("-------------")

**Note:** The quality of the generated audio may not be very good because for demo purposes we trained on <5mins of data, it is recommended to use 15-30mins of new user data.