Skip to content
2 contributors

Users who have contributed to this file

@GrzegorzKarchNV @rajeevsrao
89 lines (60 sloc) 4.36 KB

Tacotron 2 and WaveGlow Inference with TensorRT

This is subfolder of the Tacotron 2 for PyTorch repository, tested and maintained by NVIDIA, and provides scripts to perform high-performance inference using NVIDIA TensorRT.

The Tacotron 2 and WaveGlow models form a text-to-speech (TTS) system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech. More information about the TTS system and its training can be found in the Tacotron 2 PyTorch README.

NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.4x over native PyTorch in mixed precision.

Quick Start Guide

  1. Clone the repository.

    git clone
    cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2
  2. Download pretrained checkpoints from NGC and copy them to the ./checkpoints directory:

  1. Build the Tacotron 2 and WaveGlow PyTorch NGC container.

    bash scripts/docker/
  2. Start an interactive session in the NGC container to run training/inference. After you build the container image, you can start an interactive CLI session with:

    bash scripts/docker/
  3. Verify that TensorRT version installed is 7.0 or greater. If necessary, download and install the latest release from

    pip list | grep tensorrt
    dpkg -l | grep TensorRT
  4. Export the models to ONNX intermediate representation (ONNX IR). Export Tacotron 2 to three ONNX parts: Encoder, Decoder, and Postnet:

    mkdir -p output
    python exports/ --tacotron2 ./checkpoints/nvidia_tacotron2pyt_fp16_20190427 -o output/

    Export WaveGlow to ONNX IR:

    python exports/ --waveglow ./checkpoints/nvidia_waveglow256pyt_fp16 --wn-channels 256 -o output/

    After running the above commands, there should be four new ONNX files in ./output/ directory: encoder.onnx, decoder_iter.onnx, postnet.onnx, and waveglow.onnx.

  5. Export the ONNX IRs to TensorRT engines with fp16 mode enabled:

    python trt/ --encoder output/encoder.onnx --decoder output/decoder_iter.onnx --postnet output/postnet.onnx --waveglow output/waveglow.onnx -o output/ --fp16

    After running the command, there should be four new engine files in ./output/ directory: encoder_fp16.engine, decoder_iter_fp16.engine, postnet_fp16.engine, and waveglow_fp16.engine.

  6. Run TTS inference pipeline with fp16:

    python trt/ -i phrases/phrase.txt --encoder output/encoder_fp16.engine --decoder output/decoder_iter_fp16.engine --postnet output/postnet_fp16.engine --waveglow output/waveglow_fp16.engine -o output/

Inference performance: NVIDIA T4

Our results were obtained by running the ./trt/ script in the PyTorch-19.11-py3 NGC container. Please note that to reproduce the results, you need to provide pretrained checkpoints for Tacotron 2 and WaveGlow. Please edit the script to provide your checkpoint filenames. For all tests in this table, we used WaveGlow with 256 residual channels.

Framework Batch size Input length Precision Avg latency (s) Latency std (s) Latency confidence interval 90% (s) Latency confidence interval 95% (s) Latency confidence interval 99% (s) Throughput (samples/sec) Speed-up PyT+TRT/TRT Avg mels generated (81 mels=1 sec of speech) Avg audio length (s) Avg RTF
PyT+TRT 1 128 FP16 1.14 0.02 1.16 1.16 1.21 137,050 1.45 611 7.09 6.20
PyT 1 128 FP16 1.63 0.07 1.71 1.73 1.81 94,758 1.00 601 6.98 4.30
You can’t perform that action at this time.