In [1]:
from nemo.collections.tts.data.dataset import TTSDataset
import json
import nemo
import torch
import librosa
import numpy as np

from pathlib import Path
from tqdm.notebook import tqdm

from nemo.collections.tts.models.base import SpectrogramGenerator
from nemo.collections.tts.models import FastPitchModel

from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt

from torch.utils.data.dataloader import DataLoader

In [None]:
import torch
torch.cuda.is_available()

In [None]:
# !./reinstall.sh dev
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget text-unidecode scipy==1.7.3
# !pip install phonemizer && apt-get update
# apt-get install espeak-ng

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/configs.html

### FastPitch

FastPitch is non-autoregressive model for mel-spectrogram generation based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference [paper](https://ieeexplore.ieee.org/abstract/document/9413889). 

### HiFiGAN

HiFiGAN is a generative adversarial network (GAN) model that generates audio from mel spectrograms. The generator uses transposed convolutions to upsample mel spectrograms to audio [paper](https://arxiv.org/abs/2010.05646). 

## Dataset Preparation

* Creating manifests
* Normalizing text
* Phonemization
* Creating supplementary data

### Creating manifests 

I created the script `my_get_data.py` which reads the file `the_fu_mattia_pascal/metadata.csv` provided with the dataset and generates the following fields for each datapoint:
1. `audio_filepath`: location of the wav file
2. `duration`: duration of the wav file
3. `text`: original text
    
After that, the script randomly splits the data into 3 buckets, `train_manifest.json`, `val_manifest.json` and `test_manifest.json`.

Also `my_get_data_multi_speaker.py` works the same way, but for multiple datasets (generates multi speaker).

10% datapoints go to validation set, 20% go to test set and the remaining 70% go to training set.

In [None]:
!python my_get_data.py \
    --data-root /home/giacomo/ \
    --val-size 0.1 \
    --test-size 0.2

### Normalizing text

The script above, `get_data.py`, also generates another field per each datapoint:
- `normalized_text`: normalized text via custom NeMo's text normalizer for Italian language:
    ```
    nemo_text_processing.text_normalization.normalize.Normalizer(lang="it", input_case="cased", overwrite_cache=True, cache_dir=str(file_path / "cache_dir"))
    ```
    [github nemo IT](https://github.com/NVIDIA/NeMo-text-processing/tree/main/nemo_text_processing/text_normalization/it)
    
Here are some example records:
```json

{"audio_filepath": "/home/giacomoleonemaria/NeMo/il_fu_mattia_pascal/wavs/mattiapascal_10_pirandello_f000400.wav", "duration": 4.989813, "text": "\u2014 No! ora! \u2014 ribatt\u00e9 quegli, afferrandole un braccio e attirandola a s\u00e9.", "normalized_text": "\u2014 No! ora! \u2014 ribatt\u00e9 quegli, afferrandole un braccio e attirandola a s\u00e9."}

```

### Phonemization

In [None]:
!(python my_phonemizer.py \
    --manifests /home/giacomo/il_fu_mattia_pascal/test_manifest.json /home/giacomo/il_fu_mattia_pascal/val_manifest.json /home/giacomo/il_fu_mattia_pascal/train_manifest.json \
    --language it \
    --preserve-punctuation)


To better understand the phonemize method, refer to the docs [here](https://github.com/bootphon/phonemizer/blob/master/phonemizer/backend/base.py#L137).

 `my_phonemizer.py` generates `train_manifest_phonemes.json`, `test_manifest_phonemes.json` and `val_manifest_phonemes.json` respectively.

We are effectively doubling the size of our dataset. Each original record maps on to two records, one with original `normalized_text` field value and `is_phoneme` set to 0 and another with phonemized text and `is_phoneme` flag set to 1.

Example:
```json
{"audio_filepath": "/home/giacomoleonemaria/NeMo/il_fu_mattia_pascal/wavs/mattiapascal_10_pirandello_f000400.wav", "duration": 4.989813, "text": "\u2014 No! ora! \u2014 ribatt\u00e9 quegli, afferrandole un braccio e attirandola a s\u00e9.", "normalized_text": "\u2014 No! ora! \u2014 ribatt\u00e9 quegli, afferrandole un braccio e attirandola a s\u00e9.", "is_phoneme": 0}

{"audio_filepath": "/home/giacomoleonemaria/NeMo/il_fu_mattia_pascal/wavs/mattiapascal_10_pirandello_f000400.wav", "duration": 4.989813, "text": "\u2014 No! ora! \u2014 ribatt\u00e9 quegli, afferrandole un braccio e attirandola a s\u00e9.", "normalized_text": "\u2014 n\u0254! ora! \u2014 ribat\u02d0e kwe\u028e\u026a, affer\u027eandole \u028an brat\u0283\u02d0o e at\u02d0irandola a se.", "is_phoneme": 1}
```

### Creating supplementary data

To accelerate and stabilize our training, we also need to extract pitch for every audio, estimate pitch statistics (mean and std) and pre-calculate alignment prior matrices for alignment framework. To do this, all we need to do is iterate over our data one time.

In the below method the arguments are as follows:
- `sup_data_path` — path to the folder which contains supplementary data. If the supplementary data or the folder does not already exists then it will be created.

- `sup_data_types` — types of supplementary data to be provided to the model.

- `text_tokenizer` — text tokenizer object that we already created.

- `text_normalizer` — text normalizer object that we already created.

- `text_normalizer_call_kwargs` — dictionary of arguments to be used in calling the text normalizer that we already created.

In [None]:
!python extract_sup_data.py \
        --config-path . \
        --config-name ds_for_fastpitch_align.yaml \
        ++dataloader_params.num_workers=6

Malavoglia:

PITCH_MEAN=188.20228576660156, PITCH_STD=60.07517623901367

PITCH_MIN=65.4063949584961, PITCH_MAX=2057.0478515625

Il fu Mattia Pascal:

PITCH_MEAN=159.78489685058594, PITCH_STD=31.194135665893555

PITCH_MIN=65.4063949584961, PITCH_MAX=651.6829223632812

In [None]:
from nemo_text_processing.text_normalization.normalize import Normalizer
# Text normalizer
text_normalizer = Normalizer(
    lang="it", 
    input_case="cased", 
    whitelist="/home/giacomo/NeMo-text-processing/nemo_text_processing/text_normalization/it/data/whitelist.tsv"
)

text_normalizer_call_kwargs = {
    "punct_pre_process": True,
    "punct_post_process": True
}

from nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers import ItalianPhonemesTokenizer
# Text tokenizer
text_tokenizer = ItalianPhonemesTokenizer()

In [None]:
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case='cased', lang='it')
written = "2 km/m dip. Fisica"
norm_it = normalizer.normalize(written, punct_post_process=True, verbose=True)
print(norm_it)

In [None]:
from nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers import ItalianPhonemesTokenizer
tokenizer = ItalianPhonemesTokenizer()
text = "E dunque? Ci sono poi tanti mezzi: di controllo!"
tokens = tokenizer(text)
print(tokens)

In [None]:
from nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers import ItalianCharsTokenizer
tokenizer = ItalianCharsTokenizer()
text = "E dunque? Ci sono poi tanti mezzi: di controllo!"
tokens = tokenizer(text)
print(tokens)

In [None]:
def pre_calculate_supplementary_data(sup_data_path, sup_data_types, text_tokenizer, text_normalizer, text_normalizer_call_kwargs):
    # init train and val dataloaders
    stages = ["train", "val"]
    stage2dl = {}
    for stage in stages:
        ds = TTSDataset(
            manifest_filepath=f"/home/giacomo/dataset/MAILABS/it_IT/by_book/female/lisa_caputo/malavoglia/{stage}_manifest_phonemes.json",
            sample_rate=16000,
            sup_data_path=sup_data_path,
            sup_data_types=sup_data_types,
            n_fft=1024,
            win_length=1024,
            hop_length=256,
            window="hann",
            n_mels=80,
            lowfreq=0,
            highfreq=8000,
            text_tokenizer=text_tokenizer,
            text_normalizer=text_normalizer,
            text_normalizer_call_kwargs=text_normalizer_call_kwargs

        ) 
        stage2dl[stage] = torch.utils.data.DataLoader(ds, batch_size=1, collate_fn=ds._collate_fn, num_workers=1)

    # iteration over dataloaders
    pitch_mean, pitch_std, pitch_min, pitch_max = None, None, None, None
    for stage, dl in stage2dl.items():
        pitch_list = []
        for batch in tqdm(dl, total=len(dl)):
            tokens, tokens_lengths, audios, audio_lengths, attn_prior, pitches, pitches_lengths = batch
            pitch = pitches.squeeze(0)
            pitch_list.append(pitch[pitch != 0])

        if stage == "train":
            pitch_tensor = torch.cat(pitch_list)
            pitch_mean, pitch_std = pitch_tensor.mean().item(), pitch_tensor.std().item()
            pitch_min, pitch_max = pitch_tensor.min().item(), pitch_tensor.max().item()
            
    return pitch_mean, pitch_std, pitch_min, pitch_max

In [None]:
fastpitch_sup_data_path = "fastpitch_sup_data_folder"
sup_data_types = ["align_prior_matrix", "pitch"]

pitch_mean, pitch_std, pitch_min, pitch_max = pre_calculate_supplementary_data(
    fastpitch_sup_data_path, sup_data_types, text_tokenizer, text_normalizer, text_normalizer_call_kwargs
)
print(pitch_mean, pitch_std, pitch_min, pitch_max)

In [None]:
pitch_max = 651.6829223632812
pitch_min = 65.4063949584961
pitch_mean = 159.78488159179688
pitch_std = 31.194143295288086

fastpitch_sup_data_path = "fastpitch_sup_data_folder"
sup_data_types = ["align_prior_matrix", "pitch"]

We can do this also via `extract_sup_data.py` script.

## Training

To train the model, the script results in something similar, where all default parameters are set in fastpitch_align.yaml.

In [3]:
!(CUDA_VISIBLE_DEVICES=0 python fastpitch.py --config-path . --config-name=fastpitch_align_ITA.yaml \
  sample_rate=16000 \
  train_dataset=/home/giacomo/il_fu_mattia_pascal/train_manifest_phonemes.json \
  validation_datasets=/home/giacomo/il_fu_mattia_pascal/val_manifest_phonemes.json \
  sup_data_path=sup_data \
  exp_manager.exp_dir=/home/giacomo/il_fu_mattia_pascal/checkpoint \
  trainer.check_val_every_n_epoch=1 \
)

    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2023-10-12 23:35:25 exp_manager:384] Experiments will be logged at /home/giacomo/il_fu_mattia_pascal/checkpoint/FastPitch/2023-10-12_23-35-25
[NeMo I 2023-10-12 23:35:25 exp_manager:823] TensorboardLogger has been set up
Creating ClassifyFst grammars. This might take some time...
[NeMo I 2023-10-12 23:35:42 dataset:228] Loading dataset from /home/giacomo/il_fu_mattia_pascal/train_manifest_phonemes.json.
0it [00:00, ?it/s][NeMo W 2023-10-12 23:35:42 tts_tokenizers:429] Text: [e che– povero impiegato– aveva vissuto sempre lontano dalla famiglia, un po' qua, un po' là.] contains unknown char: [–]. Symbol will be skipped.
[NeMo W 2023-10-12 23:35:42 tts_tokenize

Note:
1. We use `CUDA_VISIBLE_DEVICES=0` to limit training to single GPU.
2. For debugging you may also add the following flags: `HYDRA_FULL_ERROR=1`, `CUDA_LAUNCH_BLOCKING=1`

## Evaluating FastPitch + pretrained HiFi-GAN

Let's evaluate the quality of the FastPitch model generated so far using a HiFi-GAN model pre-trained on English.

In [None]:
import IPython.display as ipd
from nemo.collections.tts.models import HifiGanModel, FastPitchModel
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt

In [None]:
test = "E non le pare che fosse rosso, ad esempio, il lanternone della Virt\u00f9 pagana?" # text input to the model
test_id = "mattiapascal_12_pirandello3_f000058" # identifier for the audio corresponding to the test text
data_path = "/home/giacomo/il_fu_mattia_pascal/wavs/" # path to dataset folder with wav files from original dataset
seed = 1234

In [None]:
def evaluate_spec_fastpitch_ckpt(spec_gen_model, v_model, test):
    with torch.no_grad():
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.backends.cudnn.enabled = True
        torch.backends.cudnn.benchmark = False
        parsed = spec_gen_model.parse(str_input=test, normalize=True)
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed)
        print(spectrogram.size())
        audio = v_model.convert_spectrogram_to_audio(spec=spectrogram)

    spectrogram = spectrogram.to('cpu').numpy()[0]
    audio = audio.to('cpu').numpy()[0]
    audio = audio / np.abs(audio).max()
    return audio, spectrogram

In [None]:
# load hifigan models
hfg_ngc = "tts_en_lj_hifigan_ft_mixerttsx" # NGC pretrained model name: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_lj_hifigan 
vocoder_model = HifiGanModel.from_pretrained(hfg_ngc, strict=False).eval().cuda()

In [None]:
# load fastpitch
import glob, os
fastpitch_model_path = sorted(
    glob.glob("FastPitch.ckpt"), 
    key=os.path.getmtime)[-1] # path_to_fastpitch_nemo_or_ckpt

if ".nemo" in fastpitch_model_path:
    spec_gen_model = FastPitchModel.restore_from(fastpitch_model_path).eval().cuda()
else:
    spec_gen_model = FastPitchModel.load_from_checkpoint(checkpoint_path=fastpitch_model_path).eval().cuda()

In [None]:
audio, spectrogram = evaluate_spec_fastpitch_ckpt(spec_gen_model, vocoder_model, test)

# visualize the spectrogram
if spectrogram is not None:
    imshow(spectrogram, origin="lower")
    plt.show()

# audio
print("original audio")
ipd.display(ipd.Audio(data_path+test_id+'.wav', rate=16000))
print("predicted audio")
ipd.display(ipd.Audio(audio, rate=16000))

## Finetuning HiFi-GAN

Improving speech quality by Finetuning HiFi-GAN on synthesized mel-spectrograms from FastPitch. 

In [None]:
test_audio_text = "E non le pare che fosse rosso, ad esempio, il lanternone della Virt\u00f9 pagana?"
test_audio_filepath = "/home/giacomo/il_fu_mattia_pascal/wavs/mattiapascal_12_pirandello3_f000058.wav" 

In [None]:
from matplotlib.pyplot import imshow
from nemo.collections.tts.models import FastPitchModel
from matplotlib import pyplot as plt
import librosa
import librosa.display
import torch
import soundfile as sf
import numpy as np
from nemo.collections.tts.parts.utils.tts_dataset_utils import BetaBinomialInterpolator

def load_wav(audio_file):
    with sf.SoundFile(audio_file, 'r') as f:
        samples = f.read(dtype='float32')
    return samples.transpose()

def plot_logspec(spec, axis=None):    
    librosa.display.specshow(
        librosa.amplitude_to_db(spec, ref=np.max),
        y_axis='linear', 
        x_axis="time",
        fmin=0, 
        fmax=8000,
        ax=axis
    )

In [None]:
spec_model = FastPitchModel.restore_from(fastpitch_model_path).eval().cuda()

### Original mel spectrogram generated from original audio file

In [None]:
print("loading original melspec")
y, sr = librosa.load(test_audio_filepath)
# change n_fft, win_length, hop_length parameters below based on your specific config file
spectrogram2 = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=1024, win_length=1024, hop_length=256)
spectrogram = spectrogram2[ :80, :]
print("spectrogram shape = ", spectrogram.shape)
plot_logspec(spectrogram)
plt.show()

### Mel spectrogram predicted from FastPitch

In [None]:
print("loading fastpitch melspec via generate_spectrogram")
with torch.no_grad():
    text = spec_model.parse(test_audio_text, normalize=False)
    spectrogram = spec_model.generate_spectrogram(
      tokens=text, 
      speaker=None,
    )
spectrogram = spectrogram.to('cpu').numpy()[0]
plot_logspec(spectrogram)
print("spectrogram shape = ", spectrogram.shape)
plt.show()

**Note**: The above predicted spectrogram has the duration lower in frames which is not equal to the ground truth 498 frames. In order to finetune HiFi-GAN we need mel spectrogram predicted from FastPitch with ground truth alignment and duration.

### Mel spectrogram predicted from FastPitch with groundtruth alignment and duration 

In [None]:
print("loading fastpitch melspec via forward method with groundtruth alignment and duration")
with torch.no_grad():
    device = spec_model.device
    beta_binomial_interpolator = BetaBinomialInterpolator()
    text = spec_model.parse(test_audio_text, normalize=False)
    text_len = torch.tensor(text.shape[-1], dtype=torch.long, device=device).unsqueeze(0)
    audio = load_wav(test_audio_filepath)
    audio = torch.from_numpy(audio).unsqueeze(0).to(device)
    audio_len = torch.tensor(audio.shape[1], dtype=torch.long, device=device).unsqueeze(0)
    spect, spect_len = spec_model.preprocessor(input_signal=audio, length=audio_len)
    attn_prior = torch.from_numpy(
      beta_binomial_interpolator(spect_len.item(), text_len.item())
    ).unsqueeze(0).to(text.device)
    spectrogram = spec_model.forward(
      text=text, 
      input_lens=text_len, 
      spec=spect, 
      mel_lens=spect_len, 
      attn_prior=attn_prior,
      speaker=None,
    )[0]
spectrogram = spectrogram.to('cpu').numpy()[0]
print("spectrogram shape = ", spectrogram.shape)
plot_logspec(spectrogram)
plt.show()

- Finetuning without groundtruth alignment and duration has artifacts from the original audio (noise) that get passed on as input to the vocoder resulting in artifacts in vocoder output in the form of noise.
- <b> On the other hand, `Mel spectrogram predicted from FastPitch with groundtruth alignment and duration` gives the best results because it enables HiFi-GAN to learn mel spectrograms generated by FastPitch as well as duration distributions closer to the real world (i.e. ground truth) durations. </b>

From implementation perspective - we follow the same process described in [Finetuning FastPitch for a new speaker](FastPitch_Finetuning.ipynb) - i.e. take the latest checkpoint from FastPitch training and predict spectrograms for each of the input records in `train_manifest_text_normed.json`, `test_manifest_text_normed.json` and `val_manifest_text_normed.json`. NeMo provides an efficient script, [scripts/dataset_processing/tts/generate_mels.py](https://raw.githubusercontent.com/nvidia/NeMo/main/scripts/dataset_processing/tts/generate_mels.py), to generate Mel-spectrograms in the directory `NeMoGermanTTS/mels` and also create new JSON manifests with a suffix `_mel` by adding a new key `"mel_filepath"`. For example, `train_manifest_text_normed.json` corresponds to `train_manifest_text_normed_mel.json` saved in the same directory. You can run the following CLI to obtain the new JSON manifests.

In [None]:
!python generate_mels.py \
    --cpu \
    --input-json-manifests /home/giacomo/il_fu_mattia_pascal/train_manifest.json /home/giacomo/il_fu_mattia_pascal/test_manifest.json /home/giacomo/il_fu_mattia_pascal/val_manifest.json \
    --fastpitch-model-ckpt {fastpitch_model_path} \
    --output-json-manifest-root ./