# TTS Inference

TODO

# License

> Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
>
> Licensed under the Apache License, Version 2.0 (the "License");
> you may not use this file except in compliance with the License.
> You may obtain a copy of the License at
>
>     http://www.apache.org/licenses/LICENSE-2.0
>
> Unless required by applicable law or agreed to in writing, software
> distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the License for the specific language governing permissions and
> limitations under the License.

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# # If you're using Google Colab and not running locally, uncomment and run this cell.
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget unidecode
# BRANCH = 'main'
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]

## Imports

Please run the below cell to setup this notebook

In [None]:
import IPython.display as ipd
import librosa
import librosa.display
import numpy as np
import torch
from matplotlib import pyplot as plt
%matplotlib inline

# Reduce logging messages for this notebook
from nemo.utils import logging
logging.setLevel(logging.ERROR)

from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel
from nemo.collections.tts.helpers.helpers import regulate_len

# Load the models from NGC
fastpitch = FastPitchModel.from_pretrained("tts_en_fastpitch").eval().cuda()
hifigan = HifiGanModel.from_pretrained("tts_hifigan").eval().cuda()
sr = 22050

def display_pitch(audio, pitch, sr=22050):
    fig, ax = plt.subplots(figsize=(12, 6))
    spec = np.abs(librosa.stft(audio[0], n_fft=1024))
    ax.plot(pitch.cpu().numpy()[0], color='cyan', linewidth=1)
    librosa.display.specshow(np.log(spec+1e-12), y_axis='log')
    ipd.display(ipd.Audio(audio, rate=sr))
    plt.show()

## Duration Control

This section is applicable to models that use a duration predictor module. This module is called the Length Regulator and was introduce in FastSpeech [1]. A list of NeMo models that support duration predictors are as follows:

- [FastPitch](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/fastpitch.py)
- [FastPitch_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastpitchhifigan)
- [FastSpeech2](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastspeech_2)
- [FastSpeech2_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastspeech2hifigan)
- [TalkNet](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_talknet)
- [Glow-TTS](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_glowtts)

While each model has their own implementation of this duration predictor, all of them follow a simple convolutional architecture. The input is the encoded tokens, and the output of the module is a value that represents how many frames in the decoder correspond to each token. It is essentially a hard attention mechanism.

Since each model outputs a duration value per token, it is simple to slow down or increase the speech rate by increasing or decreasing these values. Consider the following:

```python
def regulate_len(durations, pace=1.0):
    durations = durations.float() / pace
    # The output from the duration predictor module is still a float
    # If we want the speech to be faster, we can increase the pace and make each token duration shorter
    # Alternatively we can slow down the pace by decreasing the pace parameter
    return durations.long()  # Lastly, we need to make the durations integers for subsequent processes
```

Let's try this out with FastPitch

In [None]:
#Define what we want the model to say
input_string = "Hey, I am speaking at different paces!"  # Feel free to change it and experiment

# Define a quick helper function to go from string to audio
def str_to_audio(inp, pace=1.0):
    with torch.no_grad():
        tokens = fastpitch.parse(inp)
        spec = fastpitch.generate_spectrogram(tokens=tokens, pace=pace)
        audio = hifigan.convert_spectrogram_to_audio(spec=spec).to('cpu').numpy()
    return audio

# Let's run fastpitch normally
audio = str_to_audio(input_string)
print(f"This is fastpitch speaking at the regular pace of 1.0. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))

# We can speed up the speech by adjusting the pace
audio = str_to_audio(input_string, pace=1.2)
print(f"This is fastpitch speaking at the faster pace of 1.2. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))

# We can slow down the speech  by adjusting the pace
audio = str_to_audio(input_string, pace=0.75)
print(f"This is fastpitch speaking at the faster pace of 0.75. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))

## Pitch Control

The newer spectrogram generator models predict the pitch for certain words. Since these models predict pitch, we can adjust the predicted pitch in a similar manner to the predicted durations like in the previous section. A list of NeMo models that support pitch control are as follows:

- [FastPitch](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/fastpitch.py)
- [FastPitch_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastpitchhifigan)
- [FastSpeech2](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastspeech_2)
- [FastSpeech2_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastspeech2hifigan)
- [TalkNet](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_talknet)

### FastPitch

As with the previous tutorial, we will focus on FastPitch. FastPitch differs from some other models as it predicts a pitch difference to a normalized (mean 0, std 1) speaker pitch. Other models will just predict the unnormalized pitch. Looking at a simplified version of the FastPitch model, we see

```python
# Predict pitch
pitch_predicted = self.pitch_predictor(enc_out, enc_mask)  # Predicts a pitch that is normalized with speaker statistics 
pitch_emb = self.pitch_emb(pitch.unsqueeze(1))  # A simple 1D convolution to map the float pitch to a embedding pitch

enc_out = enc_out + pitch_emb.transpose(1, 2)  # We add the pitch to the encoder output
spec, *_ = self.decoder(input=len_regulated, seq_lens=dec_lens)  # We send the sum to the decoder to get the spectrogram
```

Let's see the `pitch_predicted` for a sample text. You can run the below cell. You should get an image that looks like for the input `Hey, what is my pitch?`:
<img src="files/fastpitch-pitch.png">
Notice that the last word `pitch` has an increase in pitch to stress that it is a question.

In [None]:
import librosa
import librosa.display
from matplotlib import pyplot as plt
import numpy as np
from nemo.collections.tts.helpers.helpers import regulate_len
%matplotlib inline

#Define what we want the model to say
input_string = "Hey, what is my pitch?"  # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
    tokens = fastpitch.parse(input_string)
    spec, _, durs_predicted, _, pitch, *_ = fastpitch(text=tokens, durs=None, pitch=None, speaker=None)
    audio = hifigan.convert_spectrogram_to_audio(spec=spec).to('cpu').numpy()
    # FastPitch predicts one pitch value per token. To plot it, we have to expand the token length to the spectrogram length
    pitch, _ = regulate_len(durs_predicted, pitch.unsqueeze(-1))
    pitch = pitch.squeeze(-1)
    # Note we have to unnormalize the pitch with LJSpeech's mean and std
    pitch = pitch * 65.72037058703644 + 214.72202032404294

# Let's plot the predicted pitch and how it affects the predicted audio
fig, ax = plt.subplots(figsize=(12, 6))
spec = np.abs(librosa.stft(audio[0], n_fft=1024))
ax.plot(pitch.cpu().numpy()[0], color='cyan', linewidth=1)
librosa.display.specshow(np.log(spec+1e-12), y_axis='log')
ipd.display(ipd.Audio(audio, rate=sr))

## Plot Control

Now that we see how the pitch affects the predicted spectrogram, we can now adjust it to add some effects. We will explore 4 different manipulations:

1) Pitch shift

2) Pitch flatten

3) Pitch inversion

4) Pitch amplification

In [None]:
# First we look at pitch shift. To shift the pitch up or down by some Hz, we can just add or subtract as needed
# Let's shift the pitch down by 50 Hz
# First, let's run the previous example and then shift down

#Define what we want the model to say
input_string = "Hey, what is my pitch?"  # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
    tokens = fastpitch.parse(input_string)
    spec_norm, _, durs_norm_pred, _, pitch, *_ = fastpitch(text=tokens, durs=None, pitch=None, speaker=None)
    audio_norm = hifigan.convert_spectrogram_to_audio(spec=spec_norm).to('cpu').numpy()
    
    # Note we have to unnormalize the pitch with LJSpeech's mean and std
    pitch = pitch * 65.72037058703644 + 214.72202032404294
    pitch_norm = pitch
    
    # Now let's pitch shift down by 50Hz
    pitch_shift = pitch - 50
    pitch = (pitch_shift - 214.72202032404294) / 65.72037058703644
    
    # Now we can pass it to the model
    spec_shift, _, durs_shift_pred, _, pitch, *_ = fastpitch(text=tokens, durs=None, pitch=pitch, speaker=None)
    audio_shift = hifigan.convert_spectrogram_to_audio(spec=spec_shift).to('cpu').numpy()
    
    # FastPitch predicts one pitch value per token. To plot it, we have to expand the token length to the spectrogram length
    pitch_shift, _ = regulate_len(durs_shift_pred, pitch_shift.unsqueeze(-1))
    pitch_shift = pitch_shift.squeeze(-1)
    pitch_norm, _ = regulate_len(durs_norm_pred, pitch_norm.unsqueeze(-1))
    pitch_norm = pitch_norm.squeeze(-1)
    

# Let's see both results
print("The first unshifted sample")
display_pitch(audio_norm, pitch_norm)

print("The second shifted sample. This sample is much deeper than the previous.")
display_pitch(audio_shift, pitch_shift)

In [None]:
# Second we look at pitch flattening. To flattern the pitch, we just set it to 0.
# First, let's run the previous example and then compare it to flattening

#Define what we want the model to say
input_string = "Hey, what is my pitch?"  # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
    tokens = fastpitch.parse(input_string)
    spec_norm, _, durs_norm_pred, _, pitch, *_ = fastpitch(text=tokens, durs=None, pitch=None, speaker=None)
    audio_norm = hifigan.convert_spectrogram_to_audio(spec=spec_norm).to('cpu').numpy()
    
    # Note we have to unnormalize the pitch with LJSpeech's mean and std
    pitch = pitch * 65.72037058703644 + 214.72202032404294
    pitch_norm = pitch
    
    # Now let's set the pitch to 0
    pitch_flat = pitch * 0
    
    # Now we can pass it to the model
    spec_flat, _, durs_flat_pred, _, pitch, *_ = fastpitch(text=tokens, durs=None, pitch=pitch_flat, speaker=None)
    audio_flat = hifigan.convert_spectrogram_to_audio(spec=spec_flat).to('cpu').numpy()
    pitch_flat = pitch_flat + 214.72202032404294
    
    # FastPitch predicts one pitch value per token. To plot it, we have to expand the token length to the spectrogram length
    pitch_flat, _ = regulate_len(durs_flat_pred, pitch_flat.unsqueeze(-1))
    pitch_flat = pitch_flat.squeeze(-1)
    pitch_norm, _ = regulate_len(durs_norm_pred, pitch_norm.unsqueeze(-1))
    pitch_norm = pitch_norm.squeeze(-1)
    

# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm)

print("The second flattened sample. This sample is more monotone than the previous.")
display_pitch(audio_flat, pitch_flat)

In [None]:
# Third we look at pitch flattening. To invert the pitch, we just multiply it by -1.
# First, let's run the previous example and then compare it to inversion

#Define what we want the model to say
input_string = "Hey, what is my pitch?"  # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
    tokens = fastpitch.parse(input_string)
    spec_norm, _, durs_norm_pred, _, pitch, *_ = fastpitch(text=tokens, durs=None, pitch=None, speaker=None)
    audio_norm = hifigan.convert_spectrogram_to_audio(spec=spec_norm).to('cpu').numpy()
    
    # Note we have to unnormalize the pitch with LJSpeech's mean and std
    pitch_norm = pitch * 65.72037058703644 + 214.72202032404294
    
    # Now let's invert the pitch
    pitch_invert = pitch * -1
    
    # Now we can pass it to the model
    spec_inv, _, durs_inv_pred, _, pitch_inv, *_ = fastpitch(text=tokens, durs=None, pitch=pitch_invert, speaker=None)
    audio_inv = hifigan.convert_spectrogram_to_audio(spec=spec_inv).to('cpu').numpy()
    pitch_inv = pitch_invert * 65.72037058703644 + 214.72202032404294
    
    # FastPitch predicts one pitch value per token. To plot it, we have to expand the token length to the spectrogram length
    pitch_inv, _ = regulate_len(durs_inv_pred, pitch_inv.unsqueeze(-1))
    pitch_inv = pitch_inv.squeeze(-1)
    pitch_norm, _ = regulate_len(durs_norm_pred, pitch_norm.unsqueeze(-1))
    pitch_norm = pitch_norm.squeeze(-1)
    

# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm)

print("The second inverted sample. This sample sounds less like a question and more like a statement.")
display_pitch(audio_inv, pitch_inv)

In [None]:
# Lastly, we look at pitch amplifying. To ameplify the pitch, we just multiply it by a positive constant.
# First, let's run the previous example and then compare it to amplification.

#Define what we want the model to say
input_string = "Hey, what is my pitch?"  # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
    tokens = fastpitch.parse(input_string)
    spec_norm, _, durs_norm_pred, _, pitch, *_ = fastpitch(text=tokens, durs=None, pitch=None, speaker=None)
    audio_norm = hifigan.convert_spectrogram_to_audio(spec=spec_norm).to('cpu').numpy()
    
    # Note we have to unnormalize the pitch with LJSpeech's mean and std
    pitch_norm = pitch * 65.72037058703644 + 214.72202032404294
    
    # Now let's amplify the pitch
    pitch_amp = pitch * 1.5
    
    # Now we can pass it to the model
    spec_amp, _, durs_amp_pred, _, _, *_ = fastpitch(text=tokens, durs=None, pitch=pitch_amp, speaker=None)
    audio_amp = hifigan.convert_spectrogram_to_audio(spec=spec_amp).to('cpu').numpy()
    pitch_amp = pitch_amp * 65.72037058703644 + 214.72202032404294
    
    # FastPitch predicts one pitch value per token. To plot it, we have to expand the token length to the spectrogram length
    pitch_amp, _ = regulate_len(durs_amp_pred, pitch_amp.unsqueeze(-1))
    pitch_amp = pitch_amp.squeeze(-1)
    pitch_norm, _ = regulate_len(durs_norm_pred, pitch_norm.unsqueeze(-1))
    pitch_norm = pitch_norm.squeeze(-1)
    

# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm)

print("The second amplified sample.")
display_pitch(audio_amp, pitch_amp)

## Putting it all together

Now that we understand how to control the duration and pitch of TTS models. We can show how to adjust the voice to sound more solemn (slower speed + lower pitch), or more excited (higher speed + higher pitch).

In [None]:
#Define what we want the model to say
input_string = "I want to pass on my condolences for your loss."

# Run inference to get spectrogram and pitch
with torch.no_grad():
    tokens = fastpitch.parse(input_string)
    spec_norm, _, durs_norm_pred, _, pitch_norm, *_ = fastpitch(text=tokens, durs=None, pitch=None, speaker=None)
    audio_norm = hifigan.convert_spectrogram_to_audio(spec=spec_norm).to('cpu').numpy()
    
    # TODO
    new_pitch = (pitch_norm)*0.75-0.75
    new_pitch[0][-5] += 0.2
    spec_sol, _, durs_sol_pred, _, _, _, _, _, _, pitch_sol = fastpitch(
        text=tokens, durs=None, pitch=new_pitch, speaker=None, pace=0.9)
    audio_sol = hifigan.convert_spectrogram_to_audio(spec=spec_sol).to('cpu').numpy()
    
    pitch_sol_t = pitch_sol
    
    pitch_sol, _ = regulate_len(durs_sol_pred, pitch_sol.unsqueeze(-1), pace=0.9)
    pitch_sol = pitch_sol.squeeze(-1)
    pitch_norm, _ = regulate_len(durs_norm_pred, pitch_norm.unsqueeze(-1))
    pitch_norm = pitch_norm.squeeze(-1)
    pitch_norm = pitch_norm * 65.72037058703644 + 214.72202032404294
    pitch_sol = pitch_sol * 65.72037058703644 + 214.72202032404294
    
# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm)


print("The second solumn sample")
display_pitch(audio_sol, pitch_sol)

In [None]:
#Define what we want the model to say
input_string = "Congratulations on your promotion."

# Run inference to get spectrogram and pitch
with torch.no_grad():
    tokens = fastpitch.parse(input_string)
    spec_norm, _, durs_norm_pred, _, pitch_norm, *_ = fastpitch(text=tokens, durs=None, pitch=None, speaker=None)
    audio_norm = hifigan.convert_spectrogram_to_audio(spec=spec_norm).to('cpu').numpy()
    
    # TODO
    new_pitch = (pitch_norm)*1.7+0.5
    spec_sol, _, durs_sol_pred, _, _, _, _, _, _, pitch_sol = fastpitch(
        text=tokens, durs=None, pitch=new_pitch, speaker=None, pace=1.1)
    audio_sol = hifigan.convert_spectrogram_to_audio(spec=spec_sol).to('cpu').numpy()
    
    pitch_sol_t = pitch_sol
    
    pitch_sol, _ = regulate_len(durs_sol_pred, pitch_sol.unsqueeze(-1), pace=1.1)
    pitch_sol = pitch_sol.squeeze(-1)
    pitch_norm, _ = regulate_len(durs_norm_pred, pitch_norm.unsqueeze(-1))
    pitch_norm = pitch_norm.squeeze(-1)
    pitch_norm = pitch_norm * 65.72037058703644 + 214.72202032404294
    pitch_sol = pitch_sol * 65.72037058703644 + 214.72202032404294
    
# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm)


print("The second solumn sample")
display_pitch(audio_sol, pitch_sol)

## References

[1] https://arxiv.org/abs/1905.09263

| Model | Module |
|---|---|
|[FastPitch](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/fastpitch.py)|[TemporalPredictor](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/fastpitch.py#L117)|
|[FastPitch_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastpitchhifigan)|[TemporalPredictor](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/fastpitch.py#L117)|
|[FastSpeech2](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastspeech_2)|[VarianceAdaptor](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/fastspeech2.py)|
|[FastSpeech2_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastspeech2hifigan)|[VarianceAdaptor](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/fastspeech2.py)|
|[TalkNet](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_talknet)| [ConvASREncoder](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/modules/conv_asr.py#L53) |
|[Glow-TTS](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_glowtts)| [TextEncoder](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/glow_tts.py#L63) |

In [None]:
tokens[0]

In [None]:
pitch_sol_t[0]