# Demo: Nvidia NeMo TTS quickstart con explicación sencilla

Este script tiene el siguiente propósito: resumir lo visto en los tutoriales de TTS de NeMo. La información es la misma que aquella
encontrada en los tutoriales proporcionados en:
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/starthere/tutorials.html

**ESTE SCRIPT ESTÁ FORMATEADO PARA CORRER CORRECTAMENTE EN GOOGLE COLAB**

#### El primer paso es instalar las dependencias necesarias. Recuerda seleccionar un runtime con GPU (change runtime -> hardware accelerator).

In [None]:
BRANCH = 'r1.11.0'
!apt-get install sox libsndfile1 ffmpeg
!pip install wget unidecode
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
!pip install wget unidecode pynini==2.1.4

#### Una vez instalado todo, comenzamos importando las librerías necesarias. Este ejemplo utiliza el modelo FastPitch + HiFigan.

In [None]:
import IPython.display as ipd
import librosa
import librosa.display
import numpy as np
import torch
from matplotlib import pyplot as plt
%matplotlib inline

# Reduce logging messages for this notebook
from nemo.utils import logging
logging.setLevel(logging.ERROR)

from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel
from nemo.collections.tts.helpers.helpers import regulate_len

# Load the models from NGC
fastpitch = FastPitchModel.from_pretrained("tts_en_fastpitch").eval().cuda()
hifigan = HifiGanModel.from_pretrained("tts_hifigan").eval().cuda()
sr = 22050

# Define a helper function to go from string to audio
def str_to_audio(inp, pace=1.0, durs=None, pitch=None):
    with torch.no_grad():
        tokens = fastpitch.parse(inp)
        spec, _, durs_pred, _, pitch_pred, *_ = fastpitch(text=tokens, durs=durs, pitch=pitch, speaker=None, pace=pace)
        audio = hifigan.convert_spectrogram_to_audio(spec=spec).to('cpu').numpy()
    return spec, audio, durs_pred, pitch_pred

# Define a helper function to plot spectrograms with pitch and display the audio
def display_pitch(audio, pitch, sr=22050, durs=None):
    fig, ax = plt.subplots(figsize=(12, 6))
    spec = np.abs(librosa.stft(audio[0], n_fft=1024))
    # Check to see if pitch has been unnormalized
    if torch.abs(torch.mean(pitch)) <= 1.0:
        # Unnormalize the pitch with LJSpeech's mean and std
        pitch = pitch * 65.72037058703644 + 214.72202032404294
    # Check to see if pitch has been expanded to the spec length yet
    if len(pitch) != spec.shape[0] and durs is not None:
        pitch = regulate_len(durs, pitch.unsqueeze(-1))[0].squeeze(-1)
    # Plot and display audio, spectrogram, and pitch
    ax.plot(pitch.cpu().numpy()[0], color='cyan', linewidth=1)
    librosa.display.specshow(np.log(spec+1e-12), y_axis='log')
    ipd.display(ipd.Audio(audio, rate=sr))
    plt.show()

#### Antes que nada, observemos la diferencia en velocidad dependiendo del string input. Utilizaremos 1, 5, 25 palabras.

In [None]:
input_string = "One"
_, audio, *_ = str_to_audio(input_string)
ipd.display(ipd.Audio(audio, rate=sr))

In [None]:
input_string = "This will be one word"
_, audio, *_ = str_to_audio(input_string)
ipd.display(ipd.Audio(audio, rate=sr))

In [None]:
input_string = "The following string contains five times the words utilized in the previous string. We will be able to evaluate how much longer it takes to compute"
_, audio, *_ = str_to_audio(input_string)
ipd.display(ipd.Audio(audio, rate=sr))

#### Ahora, podemos modificar la velocidad de habla. Tenemos un audio original, una versión rápida y una versión lenta.

In [None]:
#Define what we want the model to say
input_string = "Hey, I am speaking at different paces!"  # Feel free to change it and experiment

# Let's run fastpitch normally
_, audio, *_ = str_to_audio(input_string)
print(f"This is fastpitch speaking at the regular pace of 1.0. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))

# We can speed up the speech by adjusting the pace
_, audio, *_ = str_to_audio(input_string, pace=1.3)
print(f"This is fastpitch speaking at the faster pace of 1.3. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))

# We can slow down the speech by adjusting the pace
_, audio, *_ = str_to_audio(input_string, pace=0.7)
print(f"This is fastpitch speaking at the slower pace of 0.7. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))