Skip to content

Tacotron2 model generates long Mel-spectrogram for a short input text #756

@hoangtrong2305

Description

@hoangtrong2305

I am a newbie to Text-to-speech (TTS) problems. I train the Tacotron2 model on the KSS dataset (Korean language dataset) from scratch. After training, this model can produce good speech audios. Performances on the validation set are:

  • Stop token loss: 0.0000
  • Mel spectrogram loss (before Postnet): 0.1331
  • Mel spectrogram loss (after Postnet): 0.1089
  • Guided attention loss: 0.0008

There is one problem that I have no idea how to solve. Given a short text like "윤 후보는 앞서 경기 구리시", Tacotron2 takes nearly 13 seconds to generate the Mel-spectrogram. The alignment figure returned from the Local Sensitive Attention module in Tacotron2 show that it can align all characters well for the first few seconds. If Tacotron2 stops generating the spectrogram right there, the audio (produced by the vocoder, I use ParallelWaveGAN) is good. However, it keeps generating this spectrogram for around 10 seconds more, making the audio sound noisy with many nonsense words then.

image

One trick to solve this problem is adding a 'dot' at the end of the text (e.g. 윤 후보는 앞서 경기 구리시.), but this trick is unrealistic and sometimes inefficient.

What should I do to solve this problem? Thank you for your advice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions