Tacotron2 model generates long Mel-spectrogram for a short input text

I am a newbie to Text-to-speech (TTS) problems. I train the Tacotron2 model on the KSS dataset (Korean language dataset) from scratch. After training, this model can produce good speech audios. Performances on the validation set are:

- Stop token loss: 0.0000
- Mel spectrogram loss (before Postnet): 0.1331
- Mel spectrogram loss (after Postnet): 0.1089
- Guided attention loss: 0.0008

There is one problem that I have no idea how to solve. Given a short text like "윤 후보는 앞서 경기 구리시", Tacotron2 takes nearly 13 seconds to generate the Mel-spectrogram. The alignment figure returned from the Local Sensitive Attention module in Tacotron2 show that it can align all characters well for the first few seconds. If Tacotron2 stops generating the spectrogram right there, the audio (produced by the vocoder, I use ParallelWaveGAN) is good. However, it keeps generating this spectrogram for around 10 seconds more, making the audio sound noisy with many nonsense words then.

![image](https://user-images.githubusercontent.com/19838959/162129357-2e4a69ae-33f6-4e7d-a5d9-649ba89058b4.png)

One trick to solve this problem is adding a 'dot' at the end of the text (e.g. 윤 후보는 앞서 경기 구리시.), but this trick is unrealistic and sometimes inefficient.

What should I do to solve this problem? Thank you for your advice.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tacotron2 model generates long Mel-spectrogram for a short input text #756

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tacotron2 model generates long Mel-spectrogram for a short input text #756

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions