# FastPitch Speaker Adaptation and Speaker Representation

This notebook is designed to provide a guide on how to run FastPitch Speaker Adaptation Pipeline. If you are not familiar with Adapters please go through the following [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/02_NeMo_Adapters.ipynb). This tutorial contains the following sections:

- Speaker Representation: discussion of few methods to represent speakers in FastPitch
- Fine-tune FastPitch: fine-tune pre-trained multi-speaker FastPitch for a new speaker
- Inference: generate speech from adapted FastPitch

# License

> Copyright 2022 NVIDIA. All Rights Reserved.
> 
> Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
> 
> http://www.apache.org/licenses/LICENSE-2.0
> 
> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

In [None]:
"""
You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies# .
"""
BRANCH = 'main'
# # If you're using Colab and not running locally, uncomment and run this cell.
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget unidecode pynini==2.1.4 scipy==1.7.3
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

In [None]:
import os
import json
import nemo
import torch
import numpy as np
import IPython.display as ipd

from pathlib import Path
from tqdm.notebook import tqdm

In [None]:
%%capture --no-display
from nemo.collections.asr.parts.preprocessing.segment import AudioSegment
from nemo.collections.tts.torch.g2ps import EnglishG2p
from nemo.collections.tts.torch.data import TTSDataset
from nemo_text_processing.text_normalization.normalize import Normalizer
from nemo.collections.tts.torch.tts_tokenizers import EnglishPhonemesTokenizer, EnglishCharsTokenizer

import librosa

## Fine-tune FastPitch using Adapters

### Data

Download a small dataset to demonstrate speaker adaptation using adapters

In [None]:
# Dataset download
!wget https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz  # Contains 10MB of data
!tar -xzf 6097_5_mins.tar.gz

In [None]:
!head -n 1 ./6097_5_mins/manifest.json

For speaker adaptation the manifest must contain speaker ID (`speaker` field). Our downloaded manifest does not contain this field so we will add it.

In [None]:
def json_reader(filename):
    with open(filename) as f:
        for line in f:
            yield json.loads(line)
            

def json_writer(file, json_objects):
    with open(file, "w") as f:
        for jsonobj in json_objects:
            jsonstr = json.dumps(jsonobj)
            f.write(jsonstr + "\n")

In [None]:
manifest = list(json_reader('./6097_5_mins/manifest.json'))
for m in manifest:
    m['speaker'] = 500
json_writer('./6097_5_mins/manifest.json', manifest)

Split the data into train and validation set.

In [None]:
!cat ./6097_5_mins/manifest.json | tail -n 2 > ./val_manifest.json
!cat ./6097_5_mins/manifest.json | head -n -2 > ./train_manifest.json
!ln -s ./6097_5_mins/audio audio

Download all additional files for training the model.

In [None]:
# additional files
!mkdir -p tts_dataset_files && cd tts_dataset_files \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/cmudict-0.7b_nv22.10 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/heteronyms-052722 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/nemo_text_processing/text_normalization/en/data/whitelist/lj_speech.tsv \
&& cd ..

We use the `examples/tts/fastpitch_finetune_adapters.py` script to finetune the adapters with the `fastpitch_speaker_adaptation.yaml` configuration. So, let's download these files.

In [None]:
!wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch_finetune_adapters.py

!mkdir -p conf \
&& cd conf \
&& wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/fastpitch_speaker_adaptation.yaml \
&& cd ..

In [None]:
!cp /workspace/NeMo/examples/tts/conf/fastpitch_speaker_adaptation.yaml conf

### Generate Supplementary Data

It is recommended to precompute supplementary data like `pitch` and `alignment_matrix` before training.
Let's define the normalizer and tokenizer for this purpose.

In [None]:
# Text normalizer
text_normalizer = Normalizer(
    lang="en", 
    input_case="cased", 
    whitelist="tts_dataset_files/lj_speech.tsv"
)

text_normalizer_call_kwargs = {
    "punct_pre_process": True,
    "punct_post_process": True
}

# Text tokenizer
# Grapheme-to-phoneme module
g2p = EnglishG2p(
    phoneme_dict="tts_dataset_files/cmudict-0.7b_nv22.10",
    heteronyms="tts_dataset_files/heteronyms-052722"
)

# Text tokenizer
text_tokenizer = EnglishPhonemesTokenizer(
    punct=True,
    stresses=True,
    chars=True,
    apostrophe=True,
    pad_with_space=True,
    g2p=g2p,
)

Defining a method that would compute and save the supplementary data. This method would return pitch statistics for the dataset.

In [None]:
def pre_calculate_supplementary_data(manifest_dir, sup_data_path, sup_data_types, text_tokenizer, text_normalizer, 
                                     text_normalizer_call_kwargs, 
                                     batch_size=32, 
                                     sample_rate=44100):
    stages = ["train", "val"]
    stage2dl = {}
    for stage in stages:
        ds = TTSDataset(
            manifest_filepath=os.path.join(manifest_dir, f"{stage}_manifest.json"),
            sample_rate=sample_rate,
            sup_data_path=sup_data_path,
            sup_data_types=sup_data_types,
            n_fft=2048,
            win_length=2048,
            hop_length=512,
            window="hann",
            n_mels=80,
            lowfreq=0,
            highfreq=22050,
            text_tokenizer=text_tokenizer,
            text_normalizer=text_normalizer,
            text_normalizer_call_kwargs=text_normalizer_call_kwargs

        ) 
        stage2dl[stage] = torch.utils.data.DataLoader(ds, batch_size=batch_size, collate_fn=ds._collate_fn, num_workers=12, pin_memory=True)
    
    # iteration over dataloaders
    pitch_mean, pitch_std, pitch_min, pitch_max = None, None, None, None
    for stage, dl in stage2dl.items():
        pitch_list = []
        for batch in tqdm(dl, total=len(dl)):
            tokens, tokens_lengths, audios, audio_lengths, attn_prior, pitches, pitches_lengths = batch
            pitch_list.append(pitches[pitches != 0])
            
        if stage == "train":
            pitch_tensor = torch.cat(pitch_list)
            pitch_mean, pitch_std = pitch_tensor.mean().item(), pitch_tensor.std().item()
            pitch_min, pitch_max = pitch_tensor.min().item(), pitch_tensor.max().item()
            
    return pitch_mean, pitch_std, pitch_min, pitch_max

In [None]:
fastpitch_sup_data_path = "fastpitch_sup_data_folder"
manifest_dir = "./"
sup_data_types = ["align_prior_matrix", "pitch"]

# Precompute the supplementary data like pitch, alignment matrix.
pitch_mean, pitch_std, pitch_min, pitch_max = pre_calculate_supplementary_data(
    manifest_dir, fastpitch_sup_data_path, sup_data_types, text_tokenizer, text_normalizer, text_normalizer_call_kwargs
)

pitch_mean, pitch_std, pitch_min, pitch_max

New speakers are adapted on a pretrained multi-speaker model, in our example we will use [tts_en_fastpitch_multispeaker](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_multispeaker_fastpitchhifigan) model. We will need the trained speaker IDs of the pretrained model, in our pretrained model the speaker IDs go from 1 to 20. 

In [None]:
speakers = [spkr for spkr in range(1, 21)]

### Speaker Representation

In this notebook we will go through few ways to represent speakers in TTS models, example FastPitch.

1. **Lookup table embeddings:** This method represents each speaker by a fixed embedding which is learnt during training/finetuning of FastPitch. An embedding layer (`torch.nn.Embedding`) is added to FastPitch, these embeddings are used to condition the encoder, pitch/duration predictors and decoder. A disadvantage of this representation is that in case of adapting new speaker, the corresponding embedding does not exist and hence cannot be used for unseen speakers. A significant number of steps are needed for learning speaker embeddings for new/unseen speakers. An approach to solve this problem is to use weighted mean of the already existing speaker embeddings (see implementation `nemo.collections.tts.modules.submodules.WeightedSpeakerEmbedding`) but this method does not produce very good speach. To use this in NeMo add `speaker_id` to the list of `sup_data_types`

In [None]:
# If Lookup table embeddings are used to represent speakers
sup_data_types="['align_prior_matrix','pitch','speaker_id']"

2. **Global Style Tokens:** As introduced in the paper [Global Style Tokens (GST)](https://arxiv.org/pdf/1803.09017.pdf), in this method a bank of embeddings (tokens) are learnt along with the TTS model to represent a range of acoustic expressiveness. These style tokens are combined into a single embedding by an attention layer. During multi-speaker training, reference audio of the same speaker is used to calculate the GST embedding to represent the speaker. During inference, a fixed reference audio of the speaker or centroid of the speaker's GST embeddings can be used to condition the TTS model. To use this in NeMo add `speaker_id` and `gst_ref_audio` to the list of `sup_data_types` 

In [None]:
# If GST embeddings are used to represent speakers
sup_data_types="['align_prior_matrix','pitch','speaker_id','gst_ref_audio']"

3. **Speaker Verification:** As explained in this [paper](https://arxiv.org/abs/1806.04558), speaker embeddings can be extracted from Speaker Verification models like [Titanet](https://arxiv.org/abs/2110.04410) and used to condition TTS models like FastPitch. To use this in NeMo:
    - Extract speaker embeddings from Titanet and save them.
    - Add `speaker_embedding` to the list of `sup_data_types`.
    - Add `+model.train_ds.dataset.speaker_embedding_path=<path/to/speaker/embeddings/from/Titanet>` to the config file.

In [None]:
# Extract speaker embeddings from speaker verification model (Titanet)

def add_label_and_embeddingid(file_name):
    """
    For extracting speaker embeddings from Titanet, the manifest needs to have a `label` field which is the same as speaker id,
    so we will add this field to the manifest. We will also add another field `embedding_id` to the manifest, which denotes
    the ID of the embedding in the embedding file that corresponds to the particular audio, and is needed when using this
    embedding in FastPitch.
    """
    manifest = list(json_reader(file_name))
    for i, m in enumerate(manifest):
        m['label'] = m['speaker']
        m['embedding_id'] = i
    json_writer(file_name, manifest)


import nemo.collections.asr as nemo_asr

!mkdir -p speaker_embeddings

# For best performance, check that sampling rate of SV model matches that of TTS model
verification_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained('titanet_large')
verification_model.eval().cuda()

stages = ['train', 'val']

for stage in stages:
    manifest_path = f"./{stage}_manifest.json"
    add_label_and_embeddingid(manifest_path)
    embedding_path = f"./speaker_embeddings/{stage}_titanet_embedding.npy"
    embs, _, _, _ = nemo_asr.models.EncDecSpeakerLabelModel.get_batch_embeddings(verification_model,
                                                                             manifest_filepath=manifest_path,
                                                                             batch_size=32, 
                                                                             sample_rate=44100, 
                                                                             device='cuda')
    np.save(embedding_path, embs)

In [None]:
# If Titanet embeddings are used to represent speakers
sup_data_types="['align_prior_matrix','pitch','speaker_embedding']"

All three above methods can be used together as well (see implementation in `nemo.collections.tts.modules.submodules.SpeakerEncoder`). In this case `sup_data_types` should look like the following:

In [None]:
# If Lookup table, GST speaker embedding, Titanet embeddings are used together
sup_data_types="['align_prior_matrix','pitch','speaker_id','gst_ref_audio','speaker_embedding']"

### Training Adapters

Now we are ready for training the adapters! Let's try to adapt new speakers using adapters.

In [None]:
!(python fastpitch_finetune_adapters.py \
    --config-name=fastpitch_speaker_adaptation.yaml \
    sample_rate=44100 \
    train_dataset="./train_manifest.json" \
    validation_datasets="./val_manifest.json" \
    sup_data_types={sup_data_types} \
    sup_data_path={fastpitch_sup_data_path} \
    +init_from_pretrained_model="tts_en_fastpitch_multispeaker" \
    pitch_mean={pitch_mean} \
    pitch_std={pitch_std} \
    pitch_fmin={pitch_min} \
    pitch_fmax={pitch_max} \
    phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.10 \
    heteronyms_path=tts_dataset_files/heteronyms-052722 \
    whitelist_path=tts_dataset_files/lj_speech.tsv \
    model.speaker_embedding_dim=192 \
    model.n_speakers=12800 \
    model.adapter.add_weight_speaker=True \
    +model.adapter.add_weight_speaker_list="{speakers}" \
    model.train_ds.dataloader_params.batch_size=24 \
    model.validation_ds.dataloader_params.batch_size=24 \
    model.train_ds.dataloader_params.num_workers=8 \
    model.validation_ds.dataloader_params.num_workers=8 \
    +model.train_ds.dataset.speaker_embedding_path=./speaker_embeddings/train_titanet_embedding.npy \
    +model.validation_ds.dataset.speaker_embedding_path=./speaker_embeddings/val_titanet_embedding.npy \
    model.optim.lr=1e-5 \
    ~model.optim.sched \
    model.optim.name=adamw \
    model.optim.weight_decay=0.0 \
    +model.text_tokenizer.add_blank_at=True \
    trainer.check_val_every_n_epoch=10 \
    trainer.max_epochs=50 \
    trainer.log_every_n_steps=1 \
    trainer.devices=1 \
    trainer.precision=32 \
    exp_manager.exp_dir="nemo_experiments" \
)

From the model summary it is evident that the number of trainable parameters is ~7% of the total number of parameters. Hence, the training is so fast, as we train only the additional parameters introduced by the adapters keeping the pretrained model frozen.

Let's look at some of the options in the training command:

`+init_from_pretrained_model` : FastPitch will load this pretrained checkpoint, and add adapters on top of this model.

`model.n_speakers` : In the above pretrained model, there are already 12800 number of speaker embeddings, hence this parameter needs to match that number for loading the weights correctly.

`~model.optim.sched` : Since this is a finetuning activity for only a few steps, we do not need a scheduler.

Let's look at some fields in `fastpitch_speaker_adaptation.yaml`

`adapter` : This is a new field inside `model` struct, this struct contains all adapter related configs.

`adapter_name` : Name of the adapter.

`adapter_module_name` : Names of modules where adapter has to be added/enabled.

`adapter_state_dict_name` : Name of the state dict to be saved with adapter weights only.

`add_random_speaker` : Since adapters are used to add new speakers, this flag tells if a random speaker embedding should be used to initialize the new speaker's embedding or some other pretrained speaker embedding needs to be used as initial value of the new speaker's embedding.

`add_weight_speaker` : New/unseen speaker's embeddings are not present in the pretrained model, so if this option is set to true then the unseen speaker is represented by a weighted mean of the pretrained speakers.

`add_weight_speaker_list` : User can choose the pretrained speakers to be used to calculate weighted mean in case `add_weight_speaker` is set to true.

`speaker_encoder` : This is a new field inside `model` struct, this struct contains all speaker encoder related configs. The implementation can be found in `nemo.collections.tts.modules.submodules.SpeakerEncoder`. This module can combine GST (global style token) based speaker embeddings, speaker embeddings from speaker verificaiton models and lookup table speaker embeddings. This struct has sub-structs related to GST, Speaker Verification based embeddings and Lookup table based embeddings corresponding to the three parameters `SpeakerEncoder` class expects - gst_module, sv_projection_module and lookup_emb_projection_module.

 - `speaker_encoder.gst_module` : This struct is the GST related struct inside of `speaker_encoder`. If you are not using GST based speaker embedding then please remove it by adding `~model.speaker_encoder.gst_module` to the above command.

 - `speaker_encoder.sv_projection_module` : This struct is the Speaker verification related struct inside of `speaker_encoder`. If you are not using speaker verification based speaker embedding then please remove it by adding `~model.speaker_encoder.sv_projection_module` to the above command. However, if you are using speaker verification based speaker embedding then add the speaker_embedding_path for each dataset to the above command like:
```
+model.train_ds.dataset.speaker_embedding_path=speaker_embeddings/train_titanet_embedding.npy \
+model.validation_ds.dataset.speaker_embedding_path=./speaker_embeddings/val_titanet_embedding.npy \
```


Look at `nemo.collections.common.parts.adapter_modules` to see the type of adapters available.

## Inference

Inference using already trained adapter can be done using the following steps:

- Update the config of the pretrained model to support adapters. This step replaces the module, where adapters need to be used, with the registered adapter class for that particular module.

- Load the pretrained model.

- Load the adapter weights using `load_adapters` method.

- Generate spectrogram using this model and convert it to spectrogram using a vocoder.

Let's go through these steps one at a time.

In [None]:
import random
# Let's load validation manifest for inference
val_manifest = list(json_reader('val_manifest.json'))
# Select randomly an audio sample for the given speaker to use as reference audio for GST module or speaker verification module.
ref = random.sample(val_manifest, 1)[0]
# For illustration purposes we will infer on 2 samples only.
val_manifest = random.sample(val_manifest, 2)

#### Setup adapters for inference

Let's prepare config file to use adapter modules in FastPitch instead of just the FastPitch model.

Load adapter checkpoint. The checkpoint produced in the above training was saved in the folder `./nemo_experiments/FastPitch/{version}/checkpoints/` where version is based on the timestamp during training.

After your training of adapters is done, check the folder `./nemo_experiments/FastPitch/`, you will find a folder with a timestamp as it's name. This folder name is the `version`, replace the version in the below cell with the version that your experiment generated.

In [None]:
from nemo.collections.common.parts import adapter_modules
from nemo.collections.tts.modules.submodules import WeightedSpeakerEmbedding
from nemo.collections.tts.models import FastPitchModel, HifiGanModel

spec_model_ckpt = "./nemo_experiments/FastPitch/<version_folder_name>/checkpoints/<last_checkpoint.ckpt>"

spec_model = FastPitchModel.load_from_checkpoint(spec_model_ckpt, strict=False)
state_dict = torch.load(spec_model_ckpt)['state_dict']
has_adapter = any(['adapter' in k for k in state_dict.keys()])
if has_adapter:
    adapter_cfg = adapter_modules.LinearAdapterConfig(
        in_features=spec_model.cfg.output_fft.d_model,  # conformer specific model dim. Every layer emits this dim at its output.
        dim=256,  # the bottleneck dimension of the adapter
        activation='swish',  # activation used in bottleneck block
        norm_position='pre',  # whether to use LayerNorm at the beginning or the end of the adapter
    )
    spec_model.add_adapter(name='encoder+decoder+duration_predictor+pitch_predictor+aligner:adapter', cfg=adapter_cfg)
    spec_model.set_enabled_adapters(enabled=False)
    spec_model.set_enabled_adapters('adapter', enabled=True)
    spec_model.unfreeze_enabled_adapters()

spec_model.fastpitch.speaker_emb = WeightedSpeakerEmbedding(pretrained_embedding=spec_model.fastpitch.speaker_emb, speaker_list=speakers)
spec_model.load_state_dict(state_dict)
spec_model.eval().cuda()

Since the above model uses GST based speaker representation, we will need the spectrogram and spectrogram length of the GST reference audio we selected previously.

In [None]:
import soundfile as sf

def load_wav(audio_file, target_sr=None):
    '''
    This method loads wav audio file at a given target sampling rate.
    '''
    with sf.SoundFile(audio_file, 'r') as f:
        samples = f.read(dtype='float32')
        sample_rate = f.samplerate
    if target_sr is not None and sample_rate != target_sr:
        samples = librosa.core.resample(samples, orig_sr=sample_rate, target_sr=target_sr)
    return samples.transpose()


device = spec_model.device
gst_audio = load_wav(ref["audio_filepath"], 44100)
gst_audio = torch.from_numpy(gst_audio).unsqueeze(0).to(device)
gst_audio_len = torch.tensor(gst_audio.shape[1], dtype=torch.long, device=device).unsqueeze(0)
gst_ref_spec, gst_ref_spec_lens = spec_model.preprocessor(input_signal=gst_audio, length=gst_audio_len)

Extract reference speaker embedding from precalculated Titanet based speaker embeddings

In [None]:
sv_embeddings = np.load("./speaker_embeddings/val_titanet_embedding.npy")
sv_speaker_emb = torch.tensor(sv_embeddings[ref['embedding_id']].reshape(-1, 1)).cuda()

To convert spectrogram to waveform we will need a **vocoder**. Let's load HiFiGAN vocoder.

In [None]:
pretrained_vocoder = "tts_en_hifitts_hifigan_ft_fastpitch"
vocoder = HifiGanModel.from_pretrained(pretrained_vocoder)
vocoder.eval().cuda()

In [None]:
def infer(spec_gen_model, vocoder_model, str_input, gst_ref_spec=None, gst_ref_spec_lens=None, speaker_embedding=None):
    """
    Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.
    
    Args:
        spec_gen_model: Spectrogram generator model (FastPitch in our case)
        vocoder_model: Vocoder model (HiFiGAN in our case)
        str_input: Text input for the synthesis
        gst_ref_spec: Reference spectrogram from same speaker for GST input
        gst_ref_spec_lens: Reference spectrogram length
    
    Returns:
        spectrogram and waveform of the synthesized audio.
    """
    spec_gen_model.cuda().eval()
    with torch.no_grad():
        parsed = spec_gen_model.parse(str_input)
        spectrogram = spec_gen_model.generate_spectrogram(
            tokens=parsed, 
            gst_ref_spec=gst_ref_spec,
            gst_ref_spec_lens=gst_ref_spec_lens,
            speaker_embedding=speaker_embedding,
        )
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
        
    if spectrogram is not None:
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
    else:
        raise Exception("None value was generated for spectrogram")
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return spectrogram, audio

In [None]:
for val_record in val_manifest:
    str_input = val_record['text_normalized']
    speaker_id = val_record['speaker']
    _, audio_adapter = infer(spec_model, vocoder, str_input, gst_ref_spec, gst_ref_spec_lens, sv_speaker_emb.T)
    print(f"text: {str_input}")
    print("Original Audio")
    ipd.display(ipd.Audio(val_record['audio_filepath'], rate=44100))
    print("Generated Audio with Adapter")
    ipd.display(ipd.Audio(audio_adapter, rate=44100))
    print("-------------")

**Note:** The quality of the generated audio may not be very good because for demo purposes we trained on <5mins of data, it is recommended to use 15-30mins of new user data.