<a href="https://colab.research.google.com/github/Morozhkaa/Project-TTS/blob/main/4_%D0%9A%D0%BE%D0%BF%D0%B8%D1%8F_%D0%B1%D0%BB%D0%BE%D0%BA%D0%BD%D0%BE%D1%82%D0%B0_%22FastPitch_Finetuning_ipynb%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning FastPitch for a new speaker

In this tutorial, we will finetune a single speaker FastPitch (with alignment) model on 5 mins of a new speaker's data. We will finetune the model parameters only on the new speaker's text and speech pairs (though see the section at the end to learn more about mixing speaker data).

We will download the training data, then generate and run a training command to finetune Fastpitch on 5 mins of data, and synthesize the audio from the trained checkpoint.

A final section will describe approaches to improve audio quality past this notebook.

## License

> Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
>
> Licensed under the Apache License, Version 2.0 (the "License");
> you may not use this file except in compliance with the License.
> You may obtain a copy of the License at
>
>     http://www.apache.org/licenses/LICENSE-2.0
>
> Unless required by applicable law or agreed to in writing, software
> distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the License for the specific language governing permissions and
> limitations under the License.

In [1]:
BRANCH = 'main'
!apt-get install sox libsndfile1 ffmpeg
!pip install wget unidecode pynini==2.1.4
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libsndfile1 is already the newest version (1.0.28-4ubuntu0.18.04.2).
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
The following packages were automatically installed and are no longer required:
  libnvidia-common-460 nsight-compute-2020.2.0
Use 'apt autoremove' to remove them.
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3 sox
0 upgraded, 8 newly installed, 0 to remove and 67 not upgraded.
Need to get 760 kB of archives.
After this operation, 6,717 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3

## Downloading data

In [2]:
!tar -xzf 6097_5_mins.tar.gz

Let's take 2 samples from the dataset and split it off into a validation set. Then, split all other samples into the training set.

As mentioned, since the paths in the manifest are relative, we also create a symbolic link to the audio folder such that `audio/` goes to the correct directory.

In [3]:
!cat ./6097_5_mins/manifest.json | tail -n 2 > ./6097_manifest_dev_ns_all_local.json
!cat ./6097_5_mins/manifest.json | head -n -2 > ./6097_manifest_train_dur_5_mins_local.json
!ln -s ./6097_5_mins/audio audio

Let's also download the pretrained checkpoint that we want to finetune from. NeMo will save checkpoints to `~/.cache`, so let's move that to our current directory. 

*Note: please, check that `home_path` refers to your home folder. Otherwise, change it manually.*

In [4]:
home_path = !(echo $HOME)
home_path = home_path[0]
print(home_path)

/root


In [5]:
import os
import json

import torch
import IPython.display as ipd
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt

from nemo.collections.tts.models import FastPitchModel
FastPitchModel.from_pretrained("tts_en_fastpitch")

from pathlib import Path
nemo_files = [p for p in Path(f"{home_path}/.cache/torch/NeMo/").glob("**/tts_en_fastpitch_align.nemo")]
print(f"Copying {nemo_files[0]} to ./")
Path("./tts_en_fastpitch_align.nemo").write_bytes(nemo_files[0].read_bytes())

[NeMo W 2022-05-18 06:59:05 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2022-05-18 06:59:05 experimental:28] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-05-18 06:59:05 experimental:28] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.


[NeMo I 2022-05-18 06:59:06 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo to /root/.cache/torch/NeMo/NeMo_1.9.0rc0/tts_en_fastpitch_align/26d7e09971f1d611e24df90c7a9d9b38/tts_en_fastpitch_align.nemo
[NeMo I 2022-05-18 06:59:18 common:789] Instantiating model from pre-trained checkpoint
[NeMo I 2022-05-18 06:59:19 tokenize_and_classify:92] Creating ClassifyFst grammars.


[NeMo W 2022-05-18 06:59:37 g2ps:85] apply_to_oov_word=None, it means that some of words will remain unchanged if they are not handled by one of rule in self.parse_one_word(). It is useful when you use tokenizer with set of phonemes and chars together, otherwise it can be not.
[NeMo W 2022-05-18 06:59:37 modelPT:149] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: /ws/LJSpeech/nvidia_ljspeech_train_clean_ngc.json
      sample_rate: 22050
      sup_data_path: /raid/LJSpeech/supplementary
      sup_data_types:
      - align_prior_matrix
      - pitch
      n_fft: 1024
      win_length: 1024
      hop_length: 256
      window: hann
      n_mels: 80
      lowfreq: 0
      highfreq: 8000
      max_duration: null
      min_duration: 0.1
      ignore_file: nu

[NeMo I 2022-05-18 06:59:37 features:200] PADDING: 1
[NeMo I 2022-05-18 06:59:49 save_restore_connector:243] Model FastPitchModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.9.0rc0/tts_en_fastpitch_align/26d7e09971f1d611e24df90c7a9d9b38/tts_en_fastpitch_align.nemo.
Copying /root/.cache/torch/NeMo/NeMo_1.9.0rc0/tts_en_fastpitch_align/26d7e09971f1d611e24df90c7a9d9b38/tts_en_fastpitch_align.nemo to ./


187023360

To finetune the FastPitch model on the above created filelists, we use the `examples/tts/fastpitch_finetune.py` script to train the models with the `fastpitch_align_v1.05.yaml` configuration.

Let's grab those files.

In [6]:
!wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch_finetune.py

!mkdir -p conf \
&& cd conf \
&& wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/fastpitch_align_v1.05.yaml \
&& cd ..

--2022-05-18 07:00:03--  https://raw.githubusercontent.com/nvidia/NeMo/main/examples/tts/fastpitch_finetune.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1751 (1.7K) [text/plain]
Saving to: ‘fastpitch_finetune.py’


2022-05-18 07:00:03 (26.0 MB/s) - ‘fastpitch_finetune.py’ saved [1751/1751]

--2022-05-18 07:00:03--  https://raw.githubusercontent.com/nvidia/NeMo/main/examples/tts/conf/fastpitch_align_v1.05.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6625 (6.5K) [text/plain]
Saving to: ‘fastpitch_align_v1.05.yaml’


20

We also need some additional files (see `FastPitch_MixerTTS_Training.ipynb` tutorial for more details) for training. Let's download these, too.

In [7]:
# additional files
!mkdir -p tts_dataset_files && cd tts_dataset_files \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/cmudict-0.7b_nv22.01 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/heteronyms-030921 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/nemo_text_processing/text_normalization/en/data/whitelist/lj_speech.tsv \
&& cd ..

--2022-05-18 07:00:14--  https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/cmudict-0.7b_nv22.01
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3721964 (3.5M) [text/plain]
Saving to: ‘cmudict-0.7b_nv22.01’


2022-05-18 07:00:14 (44.5 MB/s) - ‘cmudict-0.7b_nv22.01’ saved [3721964/3721964]

--2022-05-18 07:00:14--  https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/heteronyms-030921
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3132 (3.1K) [text/plain]
Saving to: ‘heteronyms-030

## Finetuning FastPitch

We can now train our model with the following command:

**NOTE: This will take about 50 minutes on colab's K80 GPUs.**

In [8]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [9]:
# TODO(oktai15): remove +model.text_tokenizer.add_blank_at=true when we update FastPitch checkpoint
!(python fastpitch_finetune.py --config-name=fastpitch_align_v1.05.yaml \
  train_dataset=./6097_manifest_train_dur_5_mins_local.json \
  validation_datasets=./6097_manifest_dev_ns_all_local.json \
  sup_data_path=./fastpitch_sup_data \
  phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.01 \
  heteronyms_path=tts_dataset_files/heteronyms-030921 \
  whitelist_path=tts_dataset_files/lj_speech.tsv \
  exp_manager.exp_dir=./gdrive/MyDrive/ljspeech_to_6097_no_mixing_5_mins \
  +init_from_nemo_model=./tts_en_fastpitch_align.nemo \
  +trainer.max_steps=1000 ~trainer.max_epochs \
  trainer.check_val_every_n_epoch=25 \
  model.train_ds.dataloader_params.batch_size=10 model.validation_ds.dataloader_params.batch_size=10 \
  model.n_speakers=1 model.pitch_mean=228.6525415 model.pitch_std=34.4261 \
  model.pitch_fmin=108.620689 model.pitch_fmax=479.347826 model.optim.lr=2e-4 \
  ~model.optim.sched model.optim.name=adam trainer.devices=1 trainer.strategy=null \
)

[NeMo W 2022-05-18 07:01:25 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2022-05-18 07:01:26 experimental:28] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-05-18 07:01:26 experimental:28] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-05-18 07:01:28 exp_manager:287] Experiments will be logged at gdrive/MyDrive/ljspeech_to_6097_no_mixing_5_mins/FastPitch/2022-05-18_07-01-27
[NeMo I 2022-05

Let's take a closer look at the training command:

* `--config-name=fastpitch_align_v1.05.yaml`
  * We first tell the script what config file to use.

* `train_dataset=./6097_manifest_train_dur_5_mins_local.json 
  validation_datasets=./6097_manifest_dev_ns_all_local.json 
  sup_data_path=./fastpitch_sup_data`
  * We tell the script what manifest files to train and eval on, as well as where supplementary data is located (or will be calculated and saved during training if not provided).
  
* `phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.01 
heteronyms_path=tts_dataset_files/heteronyms-030921
whitelist_path=tts_dataset_files/lj_speech.tsv 
`
  * We tell the script where `phoneme_dict_path`, `heteronyms-030921` and `whitelist_path` are located. These are the additional files we downloaded earlier, and are used in preprocessing the data.
  
* `exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins`
  * Where we want to save our log files, tensorboard file, checkpoints, and more.

* `+init_from_nemo_model=./tts_en_fastpitch_align.nemo`
  * We tell the script what checkpoint to finetune from.

* `+trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25`
  * For this experiment, we tell the script to train for 1000 training steps/iterations rather than specifying a number of epochs to run. Since the config file specifies `max_epochs` instead, we need to remove that using `~trainer.max_epochs`.

* `model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24`
  * Set batch sizes for the training and validation data loaders.

* `model.n_speakers=1`
  * The number of speakers in the data. There is only 1 for now, but we will revisit this parameter later in the notebook.

* `model.pitch_mean=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512`
  * For the new speaker, we need to define new pitch hyperparameters for better audio quality.
  * These parameters work for speaker 6097 from the Hi-Fi TTS dataset.
  * For speaker 92, we suggest `model.pitch_mean=214.5 model.pitch_std=30.9 model.pitch_fmin=80 model.pitch_fmax=512`.
  * fmin and fmax are hyperparameters to librosa's pyin function. We recommend tweaking these per speaker.
  * After fmin and fmax are defined, pitch mean and std can be easily extracted.

* `model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam`
  * For fine-tuning, we lower the learning rate.
  * We use a fixed learning rate of 2e-4.
  * We switch from the lamb optimizer to the adam optimizer.

* `trainer.devices=1 trainer.strategy=null`
  * For this notebook, we default to 1 gpu which means that we do not need ddp.
  * If you have the compute resources, feel free to scale this up to the number of free gpus you have available.
  * Please remove the `trainer.strategy=null` section if you intend on multi-gpu training.

## Synthesize Samples from Finetuned Checkpoints

Once we have finetuned our FastPitch model, we can synthesize the audio samples for given text using the following inference steps. We use a HiFi-GAN vocoder trained on LJSpeech.

We define some helper functions as well.

In [10]:
from nemo.collections.tts.models import HifiGanModel

vocoder = HifiGanModel.from_pretrained("tts_hifigan")
vocoder = vocoder.eval().cuda()

[NeMo I 2022-05-18 07:09:49 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo to /root/.cache/torch/NeMo/NeMo_1.9.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2022-05-18 07:10:08 common:789] Instantiating model from pre-trained checkpoint


[NeMo W 2022-05-18 07:10:12 modelPT:149] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2022-05-18 07:10:12 modelPT:156] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2022-05-18 07:10:12 features:200] PADDING: 0


[NeMo W 2022-05-18 07:10:12 features:178] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2022-05-18 07:10:12 features:200] PADDING: 0
[NeMo I 2022-05-18 07:10:14 save_restore_connector:243] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.9.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


In [11]:
def infer(spec_gen_model, vocoder_model, str_input):
    with torch.no_grad():
        parsed = spec_gen_model.parse(str_input)
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed)
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
        
    if spectrogram is not None:
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return spectrogram, audio

In [12]:
last_ckpt = "/content/gdrive/MyDrive/ljspeech_to_6097_no_mixing_5_mins/FastPitch/2022-05-13_18-13-53/checkpoints/FastPitch--v_loss=0.7351-epoch=24-last.ckpt"

spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)
spec_model.eval().cuda()

spec, audio = infer(spec_model, vocoder, "Enter your text here")
ipd.display(ipd.Audio(audio, rate=22050))

[NeMo I 2022-05-18 07:11:39 tokenize_and_classify:92] Creating ClassifyFst grammars.


[NeMo W 2022-05-18 07:11:59 g2ps:85] apply_to_oov_word=None, it means that some of words will remain unchanged if they are not handled by one of rule in self.parse_one_word(). It is useful when you use tokenizer with set of phonemes and chars together, otherwise it can be not.
[NeMo W 2022-05-18 07:11:59 modelPT:149] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: ./6097_manifest_train_dur_5_mins_local.json
      sample_rate: 22050
      sup_data_path: ./fastpitch_sup_data
      sup_data_types:
      - align_prior_matrix
      - pitch
      n_fft: 1024
      win_length: 1024
      hop_length: 256
      window: hann
      n_mels: 80
      lowfreq: 0
      highfreq: 8000
      max_duration: null
      min_duration: 0.1
      ignore_file: null
      trim:

[NeMo I 2022-05-18 07:11:59 features:200] PADDING: 1


In [13]:
!pip install FastAPI -q
!pip install uvicorn -q
!pip install fastapi nest-asyncio pyngrok uvicorn -q
!pip install noisereduce -q
import soundfile as sf
import noisereduce as nr
from scipy.io import wavfile
from fastapi.responses import Response, FileResponse

import sys
import os
import pandas as pd
import subprocess
import json
import codecs
import unidecode
import nest_asyncio
from pyngrok import ngrok
import uvicorn

is_initialized = False

from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel
from nemo.collections.nlp.models.machine_translation import MTEncDecModel

[K     |████████████████████████████████| 54 kB 1.8 MB/s 
[K     |████████████████████████████████| 10.9 MB 402 kB/s 
[K     |████████████████████████████████| 63 kB 1.9 MB/s 
[K     |████████████████████████████████| 80 kB 8.0 MB/s 
[K     |████████████████████████████████| 53 kB 1.2 MB/s 
[K     |████████████████████████████████| 58 kB 4.3 MB/s 
[K     |████████████████████████████████| 745 kB 4.3 MB/s 
[?25h  Building wheel for pyngrok (setup.py) ... [?25l[?25hdone


In [14]:
synthesis_models = {
    'en': {
        'spec_gen': 'tts_en_fastpitch',
        'vocoder': 'tts_hifigan'
    }
}

def models_init():
    global is_initialized
    if is_initialized:
        return True
    try:
        ckpt = "/content/gdrive/MyDrive/ljspeech_to_6097_no_mixing_5_mins/FastPitch/2022-05-13_18-13-53/checkpoints/FastPitch--v_loss=0.7351-epoch=24-last.ckpt"
        synthesis_models['en']['spec_gen'] = FastPitchModel.load_from_checkpoint(ckpt).eval().cuda()
        synthesis_models['en']['vocoder'] = HifiGanModel.from_pretrained('tts_hifigan').eval()
        is_initialized = True
        return 200
    except:
        return 404

def process_text(text: str) -> str:
    text = text.replace('&quest', '?')
    return text

def normalize_text(txt: str) -> str:
    valid_chars = (" ", "'", "!", "?", ".", ",", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u",
    "v", "w", "x", "y", "z")
    new_txt = unidecode.unidecode(txt.lower().strip())
    res_arr = []
    for c in new_txt:
        if c in valid_chars:
            res_arr.append(c)
        else:
            res_arr.append(' ')
    res = ''.join(res_arr).strip()
    return ' '.join(res.split())

In [15]:
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
async def root():
    return {"message": "Translator for message exchange API is currently on."}

@app.get('/test_audio')
async def test_audio():
    return FileResponse('test_audio.wav')

@app.get("/init")
async def init():
    init_result = models_init()
    return {"status": "init_result"}

@app.get("/synthesize/{text}")
async def synthesize(text: str):
    pr_text = process_text(text)
    normalized_text = normalize_text(pr_text)
    try:
        with open(f'/audio/{normalized_text}.wav') as f:
            return FileResponse(f'audio/{normalized_text}.wav')
    except IOError:
        spec_gen = synthesis_models['en']['spec_gen']
        vocoder = synthesis_models['en']['vocoder']

        parsed = spec_gen.parse(normalized_text)
        spectrogram = spec_gen.generate_spectrogram(tokens=parsed)
        waveform = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
        audio = waveform[0].cpu().detach().numpy().tolist()
        sf.write(f'audio/{normalized_text}.wav', audio, 22050)
        return FileResponse(f'audio/{normalized_text}.wav')


In [None]:
ngrok_tunnel = ngrok.connect(8000)
print('Public URL:', ngrok_tunnel.public_url)
nest_asyncio.apply()
uvicorn.run(app, port=8000)

Public URL: http://45ac-35-201-201-113.ngrok.io


INFO:     Started server process [71]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)


INFO:     91.228.178.74:0 - "GET / HTTP/1.1" 200 OK
INFO:     91.228.178.74:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found
[NeMo I 2022-05-18 07:14:09 tokenize_and_classify:92] Creating ClassifyFst grammars.


[NeMo W 2022-05-18 07:14:28 g2ps:85] apply_to_oov_word=None, it means that some of words will remain unchanged if they are not handled by one of rule in self.parse_one_word(). It is useful when you use tokenizer with set of phonemes and chars together, otherwise it can be not.
[NeMo W 2022-05-18 07:14:28 modelPT:149] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: ./6097_manifest_train_dur_5_mins_local.json
      sample_rate: 22050
      sup_data_path: ./fastpitch_sup_data
      sup_data_types:
      - align_prior_matrix
      - pitch
      n_fft: 1024
      win_length: 1024
      hop_length: 256
      window: hann
      n_mels: 80
      lowfreq: 0
      highfreq: 8000
      max_duration: null
      min_duration: 0.1
      ignore_file: null
      trim:

[NeMo I 2022-05-18 07:14:28 features:200] PADDING: 1
[NeMo I 2022-05-18 07:14:28 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.9.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.
[NeMo I 2022-05-18 07:14:28 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.9.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2022-05-18 07:14:28 common:789] Instantiating model from pre-trained checkpoint


[NeMo W 2022-05-18 07:14:32 modelPT:149] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2022-05-18 07:14:32 modelPT:156] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2022-05-18 07:14:32 features:200] PADDING: 0


[NeMo W 2022-05-18 07:14:32 features:178] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2022-05-18 07:14:32 features:200] PADDING: 0
[NeMo I 2022-05-18 07:14:34 save_restore_connector:243] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.9.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.
INFO:     91.228.178.74:0 - "GET /init HTTP/1.1" 200 OK
INFO:     212.111.29.197:0 - "GET /init HTTP/1.1" 200 OK
INFO:     91.228.178.74:0 - "GET /test_audio HTTP/1.1" 200 OK
INFO:     91.228.178.74:0 - "GET /test_audio HTTP/1.1" 200 OK
INFO:     95.168.222.132:0 - "GET /test_audio HTTP/1.1" 200 OK
INFO:     95.168.222.132:0 - "GET /test_audio HTTP/1.1" 200 OK
INFO:     91.228.178.74:0 - "GET /synthesize/wonderful%20weather%20today%21 HTTP/1.1" 200 OK
INFO:     91.228.178.74:0 - "GET /synthesize/wonderful%20weather%20today%21 HTTP/1.1" 200 OK
INFO:     95.168.222.6:0 - "GET /synthesize/wonderful%20weather%20today%21 HTTP/1.1" 200 OK
INFO:     91.228.178.74:0 - "GET /synthesize/wonderful%20weather%20today%21 HTTP/1.1" 200 OK
INFO: 