Skip to content

The best practice to call TTS service in server scenario

Yulin Li edited this page Jan 20, 2020 · 2 revisions

The best practice to call Azure TTS service in server scenario

In many scenarios, you may want to call the Azure TTS service in server end (e.g. a website backend). Here are some suggestions to improve the performance in such scenarios.

TTS service latency is decided by the synthesis time and network.

For synthesis time, usually the longer text will take longer time to synthesize. Using streaming mode format will be helpful for long text. If you have long text, it is also useful to send the text sentence by sentence to service to reduce latency.

For network latency, HTTP connection usually takes time. When possible, use persistent connections and reuse the connetion for multiple requests.

Use speech SDK

It's recommended to use our speech SDK to call TTS services.

Reuse the synthesizer

Each synthesizer has its own HTTP/Websocket connection. So, reusing the synthesizers could reduce the latency as there's no need to establish a new connection for a new synthesis request.

You can use an object pool to manange the synthesizers.

Receieve the audio streamingly

You can bind to the Sythesizing event or use AudioDataStream to receieve the audio asynchronously in streaming mode.

Synthezing event,

synthesizer.Synthesizing += (s, e) =>
    {
        // receive the audio chunk data here.
        Console.WriteLine($"Synthesizing event received with audio chunk of {e.Result.AudioData.Length} bytes.");
    };

AudioDataStream

using (var audioDataStream = AudioDataStream.FromResult(result))
{
    // You can save all the data in the audio data stream to a file
    string fileName = "outputaudio.wav";
    await audioDataStream.SaveToWaveFileAsync(fileName);
    Console.WriteLine($"Audio data was saved to [{fileName}]");

    // You can also read data from audio data stream and process it in memory
    // Reset the stream position to the beginnging since saving to file puts the postion to end
    audioDataStream.SetPosition(0);

    byte[] buffer = new byte[16000];
    uint totalSize = 0;
    uint filledSize = 0;

    while ((filledSize = audioDataStream.ReadData(buffer)) > 0)
    {
        Console.WriteLine($"{filledSize} bytes received.");
        totalSize += filledSize;
    }

    Console.WriteLine($"{totalSize} bytes of audio data received");
}

For more details, see our samples in C#.

Call REST API

If you call REST API directly, try following steps:

  1. Try to establish connection before posting actual content (using a warm up request).
  2. Reuse the HTTP connection For example, in C#, reuse HttpClient object for each request. Don't create a new one.
  3. You need to get the auth token to call the TTS REST API. You can get and refresh the token asynchronously in a background thread to keep the token ready.
  4. Use streaming to receive the synthesized audio. In C#, refer this.
Clone this wiki locally