## TTS inference with Tacotron2+ WaveGlow on Google COLAB

by Hyungon Ryu | Sr. Solution Architect at NVIDIA


In this COLAB Jupyter, I'll demonstrate how to generate voice from input text with Tacotron2 + WaveGlow Model.

## DevOps

Check Available GPU. COLAB provide Tesla K80.

In [0]:
!nvidia-smi

Sun Mar 31 20:04:41 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P8    35W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

### set max clock

In [0]:
%%bash
#check the environemnt 
echo "Check H/W"
lscpu | grep 'CPU(s):            '
lscpu | grep GHz
echo "memory" && free -m | cut -c-49 |  head -n 2 
echo "storage" && df -h |  cut -c-60 | head -n 2
df -h |  grep '/dev/sda1'
echo " " && nvidia-smi -L | cut -c-17
echo "confure Max Application Clock for K80 875Mhz"
nvidia-smi -ac 2505,875 && nvidia-smi -pm 1
echo " " &&echo "Check S/W"
cat /etc/*-release | grep PRETTY_NAME
python --version 
nvcc --version | grep  tools

Check H/W
CPU(s):              2
Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
memory
              total        used        free      
Mem:          13022         439       10849      
storage
Filesystem      Size  Used Avail Use% Mounted on
overlay         359G   23G  318G   7% /
/dev/sda1       365G   27G  339G   8% /opt/bin
 
GPU 0: Tesla K80 
confure Max Application Clock for K80 875Mhz
Applications clocks set to "(MEM 2505, SM 875)" for GPU 00000000:00:04.0


All done.
Enabled persistence mode for GPU 00000000:00:04.0.
All done.
 
Check S/W
PRETTY_NAME="Ubuntu 18.04.2 LTS"
Python 3.6.7
Cuda compilation tools, release 10.0, V10.0.130


### install pytorch 1.0
Current implementation of Tacotron2 and Waveglow model  require pytorch 1.0.
It will takes 30 seconds. 

pytorch 1.0 build number : `torch-nightly-1.0.0.dev20181128`

In [0]:
%%time
%%bash
pip install numpy
pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly-1.0.0.dev20181128-cp36-cp36m-linux_x86_64.whl

Looking in links: https://download.pytorch.org/whl/nightly/cu92/torch_nightly-1.0.0.dev20181128-cp36-cp36m-linux_x86_64.whl
Collecting torch_nightly
  Downloading https://download.pytorch.org/whl/nightly/cu92/torch_nightly-1.0.0.dev20181128-cp36-cp36m-linux_x86_64.whl (576.6MB)
Installing collected packages: torch-nightly
Successfully installed torch-nightly-1.0.0.dev20181128
CPU times: user 3.29 ms, sys: 9.89 ms, total: 13.2 ms
Wall time: 40.7 s


### install required python utilities
It will takes 30 seconds. 

In [0]:
%%time 
%%bash
pip install -q \
 inflect==0.2.5 \
 librosa==0.6.0 \
 scipy==1.0.0 \
 tensorboardX==1.1 \
 Unidecode==1.0.22 

magenta 0.3.19 has requirement librosa>=0.6.2, but you'll have librosa 0.6.0 which is incompatible.
imgaug 0.2.8 has requirement numpy>=1.15.0, but you'll have numpy 1.14.6 which is incompatible.
fastai 1.0.50.post1 has requirement numpy>=1.15, but you'll have numpy 1.14.6 which is incompatible.
cvxpy 1.0.15 has requirement scipy>=1.1.0, but you'll have scipy 1.0.0 which is incompatible.
albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.8 which is incompatible.


CPU times: user 6.68 ms, sys: 3.43 ms, total: 10.1 ms
Wall time: 18.3 s


## Git clone Waveglow and tacotron model

 if you have any problem, use exact commit. 
 - Tacotron2 : commit `6e430556bd4e1404c4dbf7cf4c790b4dd53ee93d`
 - WaveGlow : commit `6c6d5fcce1351203c2029dcf0fefd06f5647b948`
 


In [0]:
%%bash
rm -rf waveglow
git clone https://github.com/NVIDIA/waveglow.git
cd waveglow
git submodule init
git submodule update --remote --merge

Submodule path 'tacotron2': checked out 'ece7d3f5681bf8fe46a6c3e5293bf8c5aab6cbce'


Cloning into 'waveglow'...
Submodule 'tacotron2' (http://github.com/NVIDIA/tacotron2) registered for path 'tacotron2'
Cloning into '/content/waveglow/tacotron2'...


## Download Checkpoint file


### pythonscript

In [0]:
import requests

def download_file_from_google_drive(id, destination):
    def get_confirm_token(response):
        for key, value in response.cookies.items():
            if key.startswith('download_warning'):
                return value

        return None

    def save_response_content(response, destination):
        CHUNK_SIZE = 32768

        with open(destination, "wb") as f:
            for chunk in response.iter_content(CHUNK_SIZE):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)

    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)




### Tacotron2 

Tacotron2 Model checkpoint file : *`tacotron2_statedict.pt`* (108MB)

In [0]:
%%time
destination="tacotron2_statedict.pt"
file_id="1c5ZTuT7J08wLUoVZ2KkUs_VdZuJ86ZqA"
download_file_from_google_drive(file_id, destination)

CPU times: user 208 ms, sys: 259 ms, total: 468 ms
Wall time: 1.5 s


### WaveGlow
WaveGlow Model  checkpoint file : *`waveglow_old.pt`*  (2GB)
It will takes 15 seconds.

In [0]:
%%time
destination="waveglow_old.pt"
file_id="1cjKPHbtAMh_4HTHmuIGNkbOkPBD9qwhj"
download_file_from_google_drive(file_id, destination)

CPU times: user 3.76 s, sys: 4.79 s, total: 8.55 s
Wall time: 17.7 s


### configure 


In [0]:
import os
import sys
import time
import numpy as np
from scipy.io.wavfile import write

import warnings
warnings.filterwarnings('ignore')

##### for figure

In [0]:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pylab as plt
import IPython.display as ipd


%matplotlib inline

def plot_data(data, figsize=(16, 4)):
    fig, axes = plt.subplots(1, len(data), figsize=figsize)
    for i in range(len(data)):
        axes[i].imshow(data[i], aspect='auto', origin='bottom', 
                       interpolation='none')

##### waveglow and tacotron model

In [0]:
import torch
sys.path.insert(0, 'waveglow')
sys.path.insert(0, 'waveglow/tacotron2')
from hparams import create_hparams
from model import Tacotron2
from layers import TacotronSTFT
from audio_processing import griffin_lim
from train import load_model
from text import text_to_sequence

##### Setup Hparams

In [0]:
hparams = create_hparams()
hparams.sampling_rate = 22050


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



### Load Tacotron2 model

In [0]:
%%time
checkpoint_path = "tacotron2_statedict.pt"
model = load_model(hparams)
model.load_state_dict(torch.load(checkpoint_path)['state_dict'])
_ = model.eval()

CPU times: user 2.3 s, sys: 894 ms, total: 3.19 s
Wall time: 3.27 s


### Load WaveGlow model

In [0]:
%%time
waveglow_path ="waveglow_old.pt"
waveglow = torch.load(waveglow_path)['model']
waveglow.remove_weightnorm(waveglow)
waveglow.cuda().eval()

CPU times: user 554 ms, sys: 1.64 s, total: 2.2 s
Wall time: 2.2 s


### Generate Mel Spectrogram from input text


#### input sentence

In [0]:
text = "This is a demo of Tacotron 2 and waveglow. I find it really cool that you can make the python scripts talk!"

#### text preprocessing 

In [0]:
sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]

#### load sequence

In [0]:
sequence = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()

#### generate Mel spectrogram from input text 

In [0]:
%%time
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)

CPU times: user 1.04 s, sys: 46.1 ms, total: 1.09 s
Wall time: 1.09 s


####  plot results

In [0]:
plot_data((mel_outputs.data.cpu().numpy()[0],
           mel_outputs_postnet.data.cpu().numpy()[0],
           alignments.data.cpu().numpy()[0].T))

### Synthesize audio from Mel spectrogram using WaveGlow¶
it will takes 17 sec on K80. it include redundant memory copy time.

In [0]:
%%time
with torch.no_grad():
    audio = waveglow.infer(mel_outputs_postnet, sigma=0.666)

CPU times: user 4.85 s, sys: 4.95 s, total: 9.8 s
Wall time: 9.74 s


#### check audio

In [0]:
print(text)
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)

This is a demo of Tacotron 2 and waveglow. I find it really cool that you can make the python scripts talk!


## Measure exact inference time

#### reduce batch to simplify code

In [0]:
print(mel_outputs_postnet.size())
print(mel_outputs_postnet[-1, :, :].size())

torch.Size([1, 80, 598])
torch.Size([80, 598])


#### configure waveglow 

### Generate Audio from Mel spectrogram
in general, it will takes 6.6 seconds on COLAB with Tesla K80 ( Kepler Architecture). if you would use Tesla V100 (Volta Architecture) or Tesla T4(Turing Architecture), you could generate audio in real-time. 

In [0]:
%%time

from mel2samp import MAX_WAV_VALUE
sampling_rate=22050
sigma=0.6

mel = mel_outputs_postnet[-1, :, :]
mel = torch.autograd.Variable(mel.cuda())
mel = torch.unsqueeze(mel, 0)
mel = mel.data
start= time.perf_counter()
with torch.no_grad():
    audio2 = MAX_WAV_VALUE*waveglow.infer(mel, sigma=0.6)[0]
duration= time.perf_counter() - start
print("inference time {:.2f}s/it".format(duration))

inference time 9.69s/it
CPU times: user 4.85 s, sys: 4.77 s, total: 9.62 s
Wall time: 9.69 s



It will take 30 seconds to download 

In [0]:
%%time
audio2=audio2.data.cpu().numpy()

CPU times: user 39.9 s, sys: 34.6 s, total: 1min 14s
Wall time: 1min 14s


### check audio

In [0]:
print(text)
ipd.Audio(audio2 , rate=sampling_rate)

This is a demo of Tacotron 2 and waveglow. I find it really cool that you can make the python scripts talk!
