# FastPitch: Voice Merkel

## Generate audio samples

Training a FastPitch model from scrath takes 3 to 27 hours depending on the type and number of GPUs, performance numbers can be found in Section "Training performance results" in `README.md`. Therefore, to save the time of running this notebook, we recommend to download the pretrained FastPitch checkpoints on NGC for inference.

You can find FP32 checkpoint at [NGC](https://ngc.nvidia.com/catalog/models/nvidia:fastpitch_pyt_fp32_ckpt_v1/files) , and AMP (Automatic Mixed Precision) checkpoint at [NGC](https://ngc.nvidia.com/catalog/models/nvidia:fastpitch_pyt_amp_ckpt_v1/files).

To synthesize audio, you will need a WaveGlow model, which generates waveforms based on mel-spectrograms generated by FastPitch.You can download a pre-trained WaveGlow AMP model at [NGC](https://ngc.nvidia.com/catalog/models/nvidia:waveglow256pyt_fp16).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
def get_random_examples():
    import random
    with open('/raid/dsaha/Merkel_one/metadata_new_val.csv', 'r') as f:
        txt = f.readlines()
    rand_id = random.sample(txt, 1)[0].split('|')[0]
    return rand_id

In [3]:
!gpustat

[1m[37mltgpu2                       [m  Sat Jul 16 13:36:56 2022  [1m[30m470.57.02[m
[36m[0][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 24'C[m, [32m  0 %[m | [36m[1m[33m10889[m / [33m11019[m MB | [1m[30mneima[m([33m6605M[m) [1m[30mneima[m([33m4281M[m)
[36m[1][m [34mNVIDIA GeForce GTX 1080 Ti[m |[31m 23'C[m, [32m  0 %[m | [36m[1m[33m 3131[m / [33m11178[m MB | [1m[30mneima[m([33m3127M[m)
[36m[2][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 24'C[m, [32m  0 %[m | [36m[1m[33m 4559[m / [33m11019[m MB | [1m[30mtester[m([33m1067M[m) [1m[30mneima[m([33m3489M[m)
[36m[3][m [34mNVIDIA GeForce GTX 1080 Ti[m |[31m 20'C[m, [32m  0 %[m | [36m[1m[33m 3131[m / [33m11178[m MB | [1m[30mneima[m([33m3127M[m)
[36m[4][m [34mNVIDIA GeForce GTX 1080 Ti[m |[31m 22'C[m, [32m  0 %[m | [36m[1m[33m 3155[m / [33m11178[m MB | [1m[30mneima[m([33m3151M[m)
[36m[5][m [34mNVIDIA TITAN Xp           [m |[31m 18

In [4]:
import os
os.environ['CUDA_VISIBLE_DEVICES']='3'

In [5]:
# ! mkdir -p output
# ! MODEL_DIR='../pretrained_models' ../scripts/download_fastpitch.sh
# ! MODEL_DIR='../pretrained_models' ../scripts/download_waveglow.sh

You can perform inference using the respective checkpoints that are passed as `--fastpitch` and `--waveglow` arguments. Next, you will use FastPitch model to generate audio samples for input text, including the basic version and the variations i npace, fade out, and pitch transforms, etc.

In [6]:
import IPython

# store paths in aux variables
# fastp = '../pretrained_models/nvidia_fastpitch_200518.pt'
fastp = '../output_merkel/FastPitch_checkpoint_1000.pt'
# waveg = '../pretrained_models/nvidia_waveglow256pyt_fp16.pt'
# waveg = '../../Tacotron2/pretrained_models/nvidia_waveglow256pyt_fp16.pt'
# flags = f'--cuda --fastpitch {fastp} --waveglow {waveg} --wn-channels 256 --energy-conditioning --pace 0.726'# --sampling-rate 16000'

# waveg = '../../Tacotron2/output_wg_merkel/checkpoint_WaveGlow_last.pt'
# flags = f'--cuda --fastpitch {fastp} --waveglow {waveg} --wn-channels 512 --energy-conditioning --pace 0.726 --sampling-rate 16000'

waveg = '../../SpeechSynthesis/Tacotron2/output_wg_merkel_2/checkpoint_WaveGlow_last.pt'
flags = f'--cuda --fastpitch {fastp} --waveglow {waveg} --wn-channels 256 --energy-conditioning --sampling-rate 16000'

In [7]:
IS_GERMAN = ('merkel' in fastp)
if IS_GERMAN:
    flags += ' --text-cleaners german_phoneme_cleaners --symbol-set german_basic'

### 1. Basic speech synthesis

You need to create an input file with some text, or just input the text in the below cell:

In [8]:
%%writefile text.txt
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves.

Overwriting text.txt






Run the script below to generate audio from the input text file:

In [9]:
id_ = "2017-04-29_0039" #get_random_examples()

In [10]:
print(id_)
with open("/raid/dsaha/Merkel_one/{}/text.txt".format(id_)) as f:
    print(f.read())

2017-04-29_0039
Und ich freue mich auch, dass viele Wissenschaftler wieder zurückgekommen sind, weil wir stabile Rahmenbedingungen für Forschungsprojekte anbieten können an unseren außeruniversitären und universitären Forschungseinrichtungen.


In [11]:
!rm output/original/audio_0.wav

In [12]:
# basic systhesis
!/srv/home/dsaha/miniconda/envs/videotts/bin/python ../inference.py {flags} -i /raid/dsaha/Merkel_one/{id_}/text.txt -o output/original

IPython.display.Audio("output/original/audio_0.wav")

DLL 2022-07-16 13:36:58.817211 - PARAMETER | input :  /raid/dsaha/Merkel_one/2017-04-29_0039/text.txt
DLL 2022-07-16 13:36:58.817262 - PARAMETER | output :  output/original
DLL 2022-07-16 13:36:58.817300 - PARAMETER | log_file : 
DLL 2022-07-16 13:36:58.817365 - PARAMETER | save_mels :  False
DLL 2022-07-16 13:36:58.817400 - PARAMETER | cuda :  True
DLL 2022-07-16 13:36:58.817433 - PARAMETER | cudnn_benchmark :  False
DLL 2022-07-16 13:36:58.817465 - PARAMETER | fastpitch :  ../output_merkel/FastPitch_checkpoint_1000.pt
DLL 2022-07-16 13:36:58.817496 - PARAMETER | waveglow :  ../../SpeechSynthesis/Tacotron2/output_wg_merkel_2/checkpoint_WaveGlow_last.pt
DLL 2022-07-16 13:36:58.817528 - PARAMETER | sigma_infer :  0.9
DLL 2022-07-16 13:36:58.817561 - PARAMETER | denoising_strength :  0.01
DLL 2022-07-16 13:36:58.817608 - PARAMETER | sampling_rate :  16000
DLL 2022-07-16 13:36:58.817641 - PARAMETER | stft_hop_length :  256
DLL 2022-07-16 13:36:58.817673 - PARAMETER | amp :  False
DLL 2022

In [13]:
IPython.display.Audio("/raid/dsaha/Merkel_one/{}/{}.wav".format(id_, id_))