-
Notifications
You must be signed in to change notification settings - Fork 810
Description
I am a newbie to Text-to-speech (TTS) problems. I train the Tacotron2 model on the KSS dataset (Korean language dataset) from scratch. After training, this model can produce good speech audios. Performances on the validation set are:
- Stop token loss: 0.0000
- Mel spectrogram loss (before Postnet): 0.1331
- Mel spectrogram loss (after Postnet): 0.1089
- Guided attention loss: 0.0008
There is one problem that I have no idea how to solve. Given a short text like "윤 후보는 앞서 경기 구리시", Tacotron2 takes nearly 13 seconds to generate the Mel-spectrogram. The alignment figure returned from the Local Sensitive Attention module in Tacotron2 show that it can align all characters well for the first few seconds. If Tacotron2 stops generating the spectrogram right there, the audio (produced by the vocoder, I use ParallelWaveGAN) is good. However, it keeps generating this spectrogram for around 10 seconds more, making the audio sound noisy with many nonsense words then.
One trick to solve this problem is adding a 'dot' at the end of the text (e.g. 윤 후보는 앞서 경기 구리시.), but this trick is unrealistic and sometimes inefficient.
What should I do to solve this problem? Thank you for your advice.
