# Speech Synthesis

We've reached the final step of the voice assistant pipeline! After a user asks a question or says a command to a voice assistant, the assistant is expected to vocalize a response. This step is known as **speech synthesis**, or generating speech from text.

<img src="https://www.izunnaokpala.com/wp-content/uploads/2017/01/text-to-speech.jpg" width=300>

Accurate text to speech converters are a particularly vital  for people who have visual impairments or difficulty with reading for any reason. **What might be some challenges in implementing accurate text-to-speech converters?**


**Why is this challenging?**

1. Text can be composed of abbreviations, which can be ambiguous. Example: "in" can be the word "in" or an abbreviation for "inches."

2. Text can contain numbers which can be read aloud differently. Example: "123" can be read as "one two three" or "one hundred twenty three."

3. Text contains heteronyms (words that are spelled identically but have different pronunciations and meanings). Example: "read" can be past or present tense version of the verb. 

4. A single word can be pronounced in numerous ways (depending on context or just in general). Example: either has two pronunciations: [EE-ther] or [AHY-ther]. 

Can you think of other reasons why converting text into speech is a difficult task?





## Background Information

Here are helpful definitions to understand, pertaining to linguistics.

*   **Phone**: smallest discrete segment of sound

*   **Phoneme**: the smallest unit of sound that distinguishes one word from another


> Example: "hope” is a three phoneme word, composed of the “h” sound, the long “oo” sound, and the “p” sound.

> There are 44 phonemes in the English language.


*   **Grapheme**: a symbol used to identify a phoneme

> Example: "team" is composed of the graphemes: \<t\>, \<ee\>, \<m\>

> Confusingly, some phonemes (sounds) can be spelled with different graphemes (letters), so there are more than 44 graphemes in English.

> Also, identically spelled graphemes can correspond to different phonemes. Example: the "oo" grapheme can be either for the sound in the word "boot" or "book."

<img src="https://images.squarespace-cdn.com/content/v1/5382de75e4b092b699c496fd/1423466660597-Q1FT622P9BIER0XPAR8X/ke17ZwdGBToddI8pDm48kI54ULE27xmTcPfCQGs1vVQUqsxRUqqbr1mOJYKfIPR7LoDQ9mXPOjoJoqy81S2I8N_N4V1vUb5AoIIIbLZhVYxCRW4BPu10St3TBAUQYVKc47-IGNlq2cBoYxnpJKpAk6A4IA1Flr_lnNM8eXymD8dvywz9opra1nexca3u_Jrz/phonemes-graphemes-letters.jpg" width=350>

* **Prosody**: elements of speech like intonation, tone, stress, duration, and rhythm.

> These linguistic elements can reflect features of a speaker like their emotional state, whether they asked a question or said a command, and if they're being sarcastic or ironic.

> Example: "I never said she stole my money" can have 7 different meanings & interpretations depending on intonation and which word is stressed by the speaker.



### Exercise 1

How many phonemes are in the word "leather?"

## Pipeline

1. **Text to Words**

The first step of speech synthesis is called pre-processing or *normalization*. The goal is to reduce ambiguity, which were the challenges discussed earlier.

Statistical probability techniques like Hidden Markov Models (HMM) or neural networks are used for this step.

For example, if the word "year" occurs in the same sentence as "1843," a model would assign higher probability that the number is pronouned "eighteen forty three."

2. **Words to Phonemes**

One option would be to store a huge dictionary mapping each textual word to its phonemes, but there are two issues with this: 1 - a lot of memory is required to store this, and 2 - one word can have different phonemes depending on its context if the pronunciation differs.

Alternatively, words can be broken down into their *graphemes*. One downside of this approach is that English has many irregular words that are pronounced in a very different way from how they're written.

3. **Phonemes to Sound**

There are three main approaches to do this: use human recordings, computer-generated phonemes (frequencies), or mimic human voice mechanisms. These are covered below.

#### Types of text to speech models


I.  **Concatenation synthesis**


> Concatenate (string together) segments of recorded speech.


> *Unit Selection Synthesis*:


>> Recorded utterances are broken down into either phones, syllables, word, phrases, etc. (typically using a speech recognition system!).

>> Those segments are stored in a database according to pitch, duration, neighboring phones, etc. 

>> Desired utterance is produced by speech synthesizer by chaining best candidate segments from database at runtime (typically with a decision tree).

> Pro: usually sounds the most natural and least robotic.

> Con: require large database and are time-consuming


II.   **Formant synthesis**

> Doesn't use human-recorded speech samples at runtime.

> Generates a perceived sound quality from adding sine waves together and uses an acoustic model.

> Parameters like frequency and noise are varied to create waveforms of artificial speech.

> Pro: works at high speeds and doesn't require database or pre-recorded samples.

> Con: sounds robotic.


<img src="https://cdn4.explainthatstuff.com/concatenative-versus-formant-speech.png" width=350>


III.   **Articulatory synthesis**

> Models the human vocal tract and our articulation processes.

> Not really used in commercial speech synthesis systems.

<img src="https://slideplayer.com/slide/6811114/23/images/6/The+articulatory+speech+synthesizer+%281%29.jpg" width=350>



##### Exercise 2

Which type of speech synthesis system (concatenative, formant, or articulatory) do you think Siri is? Why?


#### Deep-Learning Based Speech Synthesis

* **Waveform Modeling**

> Model raw audio waveforms and generate speech from acoustic features like spectrograms or spectrograms in mel scale. The spectrograms are generated *autoregressively*, and a *vocoder* is used to generate speech from the spectrograms.

> Example: [WaveNet](https://arxiv.org/abs/1609.03499)

> Pro: can provide both the *high quality* of unit selection synthesis and *flexibility* of parametric synthesis.

> Con: extremely high computational cost.

<img src="https://image.slidesharecdn.com/dlsl2017d4l3speechsynthesis-wavenet-170127184643/95/speech-synthesis-wavenet-d4l3-deep-learning-for-speech-and-language-upc-2017-3-638.jpg?cb=1485545046" width=350>

* **Acoustic feature generation** 

> Other models focus on generating engineered acoustic features directly from input text, which can be combined with waveform modeling for *end-to-end* speech synthesis.

> Example: [Tacotron](https://arxiv.org/abs/1703.10135)

> Pro: end-to-end methods only require a single model and less feature engineering.

> Con: require a lot of data and can be slow 

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTwK4WEYe2-qpGTex-IlVjHcr3SFwK9nmTBhw&usqp=CAU" width=380>

##### Exercise 3

We just said that spectrograms or mel spectrograms can be used as input features for a model that generates audio waves. Do you think we can use MFCC's (Mel-frequency cepstral coefficients)?

## End to End Synthesis

**Note**:  Go to the `Runtime` tab at the top --> click `Change runtime type` --> select `GPU` from the dropdown and click `Save`.

The **Tacotron2 and WaveGlow** model together create a text-to-speech (TTS) system which synthesizes natural sounding speech from raw transcripts without any additional prosody information. 

The Tacotron2 model produces *mel spectrograms* from input text using encoder-decoder architecture. WaveGlow (also available via torch.hub) takes in the mel spectrograms to generate speech.

This implementation of Tacotron2 uses Dropout instead of Zoneout (what the original [paper](https://arxiv.org/abs/1712.05884) uses) to regularize the LSTM layers.

<img src="https://pytorch.org/assets/images/tacotron2_diagram.png" alt="alt" width="40%"/>

In [9]:
#@title Run to install packages
%%bash
pip install numpy scipy librosa unidecode inflect librosa
apt-get update
apt-get install -y libsndfile1

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:7 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:9 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:13 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:14 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:15 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic

In [10]:
#@title Run to import packages
import numpy as np
import torch
from scipy.io.wavfile import write
from IPython.display import Audio

### Download pretrained Tacotron model

In [11]:
tacotron2 = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2')

Using cache found in /root/.cache/torch/hub/nvidia_DeepLearningExamples_torchhub
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2_pyt_ckpt_fp32/versions/19.09.0/files/nvidia_tacotron2pyt_fp32_20190427


In [12]:
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

Tacotron2(
  (embedding): Embedding(148, 512)
  (encoder): Encoder(
    (convolutions): ModuleList(
      (0): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (2): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  )
  (decoder): Decoder(
    (prenet): Prenet(
      (layers): ModuleList(
        (0): LinearNorm(
          (lin

Take a look at the output above and see what sections and layers you recognize and understand from the Tacotron architecture.

<img src="https://nvidia.github.io/OpenSeq2Seq/html/_images/Tacotron-2.png" width=400>

#### Dicussion Questions

Notice that there is an encoder, decoder, and postnet.

1. What does an encoder and decoder do within a Transformer in general? 

2. What is a postnet?

3. How many convolution layers are there within the postnet?



The **encoder** is made of 3 parts. First, a word embedding is learned. The embedding is then passed through a convolutional *prenet*. Lastly, the results are used by a bidirectional RNN. 

The encoder and decoder are connected via an **attention mechanism**.

The **decoder** is made of a 2-layer LSTM network, a convolutional postnet, and a fully connected prenet.

### Download pretrained Waveglow model

In [13]:
waveglow = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_waveglow')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

Using cache found in /root/.cache/torch/hub/nvidia_DeepLearningExamples_torchhub


WaveGlow(
  (upsample): ConvTranspose1d(80, 80, kernel_size=(1024,), stride=(256,))
  (WN): ModuleList(
    (0): WN(
      (in_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
        (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,))
        (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,))
        (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(16,))
        (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(32,))
        (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(64,))
        (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(128,), dilation=(128,))
      )
      (res_skip_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
        (1): Conv1d(51

What layers make sense or are new to you from the Waveglow architecture?

### Try the model!

In [28]:
text = "Maybe a chipmunk version of my voice won’t be so bad"

In [29]:
# preprocessing
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])

# run the models
with torch.no_grad():
    mel, _, _ = tacotron2.infer(sequences, lengths)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


In [30]:
Audio(audio_numpy, rate=rate)

### Exercise 4

Change the input text to anything that you want to convert into audio and see how the model performs.

What words or phrases does it seem to not handle as well?

Reference: nvidia [notebook](https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_tacotron2.ipynb#scrollTo=9mLWtMnYr8kB)