Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New pretrained synthesizer model (tensorflow) #538

Closed
ghost opened this issue Sep 30, 2020 · 3 comments
Closed

New pretrained synthesizer model (tensorflow) #538

ghost opened this issue Sep 30, 2020 · 3 comments

Comments

@ghost
Copy link

ghost commented Sep 30, 2020

Trained on LibriSpeech, using the current synthesizer (tensorflow). This performs similarly to the current model, with fewer random gaps appearing in the middle of synthesized utterances. It handles short input texts better too.

Download link: https://www.dropbox.com/s/3kyjgew55c4yxtf/librispeech_270k_tf.zip?dl=0

Unzip the file and move the logs-pretrained folder to synthesizer/saved_models.

I am not going to provide scripts to reproduce the training. For anyone interested, you will need to curate LibriSpeech to have more consistent prosody. This is what I did when running synthesizer_preprocess_audio.py:

  1. In synthesizer/hparams.py, set silence_min_duration_split=0.05
  2. Right before this line, run encoder.preprocess_wav() on each wav, this will use voice activation detection to trim silences (see Trim silences during synthesizer preprocess #501). Compare the lengths of the "before" and "after" wavs. If they don't match then it means a silence is detected and it is discarded. I keep the "before" wav if the lengths match.
  3. Post-process datasets_root/SV2TTS/synthesizer/train.txt to include utterances between 225 and 600 mel frames (2.8 to 7.5 sec). This leaves 48 hours of training data.
  4. Train from scratch for about 270k steps. I used a batch size of 12 because of limited GPU memory.
@ghost
Copy link
Author

ghost commented Sep 30, 2020

This model still has the occasional attention failure. However, this is not caused by Corentin's modifications to Rayhane's taco2. I have studied the differences line by line and concluded that there is no error introduced. Rather, I think the attention problems are inherent to the SV2TTS architecture, particularly because the speaker embedding is input to the attention mechanism.

image

Attention is problematic even in single-speaker tacotrons, and it gets worse in multispeaker due to the speaker embedding concat. This highlights the need to use a better attention mechanism for SV2TTS.

@Choons
Copy link

Choons commented Sep 30, 2020

amazing work! Thanks @blue-fish !

@SmartPoly1
Copy link

why are your dropbox links not working

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants