Steps to replicate pretrained models on LibriTTS #57

ghost · 2020-08-14T15:11:49Z

First of all, thank you for the amazing paper and for releasing the code.

I have read the instructions and all the issues, but I can't find a single place with the steps that would allow me to faithfully replicate the training of the models you shared. - The Flowtron LibriTTS -

Would it be possible to provide a detailed step by step guide to do that?
Something that would include exactly:

Your OS environment
CUDA libraries
seeds used for training
exactly how many steps the model was trained for, for each flow training.
Anything else that would make my training match exactly your training.

I am big fan of easy reproducibility :)

Thanks again.

rafaelvalle · 2020-08-14T22:37:34Z

ciao dario,

Our paper has detailed about how we trained the LibriTTS model. You will not be able to exactly match our training because the LSH model was trained on LJSpeech and two proprietary datasets. Nonetheless, you should be able to reproduce our results by following the steps on the paper substituting the LSH dataset with the LJS dataset. Post issues on this repo if you have them.

Cuda compilation tools, release 10.0, V10.0.130
Ubuntu 16.04.6 LTS
https://github.com/NVIDIA/flowtron/blob/master/config.json#L10
https://arxiv.org/abs/2005.05957

ghost · 2020-08-21T17:06:42Z

Ciao Rafael,
Thank you for your answer.

I decided to train on LibriTTS with a warm start from your pretrained LibriTTS model.

1 Flow

As suggested I started with 1 flow.
After more than 1 million steps the training and validation loss look good, together with the attention weights

Training Loss	Validation Loss	Attention Weights

Results

After running the inference at different steps, I found that the ones that "sounded" the best were the ones at approximately step 580,000 (that's also where the validation loss is at its minimum)
Still, the output wasn't satisfactory but at least intelligible.

2 Flows

I am training now with 2 flows, I started from the checkpoint at step 580,000 set the appropriate include layers to null and so far this is how the training is going:

Training Loss	Validation Loss	Attention Weights 0	Attention Weights 1

Results

When I run the inference on the early steps of this 2 flow training (step 10,000) the output is still "ok"

Step 10,000 - Output 1	Step 10,000 - Output 2

At step 240,000 even though the losses are lower, the inference results are bad

Step 240,000 - Output 1	Step 240,000 - Output 2

My questions:

Is it expected that during the training of the 2 flow network, the output will momentarily get worse?
Why are the attention weights so bad at inference time, when they are not bad during training? (See Tensorboard plots)

Thanks a lot again @rafaelvalle

rafaelvalle · 2020-08-21T17:21:05Z

Yes, because the model does not know how to attend on the most recently added flow.
During training we're performing a forward pass and the first flow step knows how to attend to the inputs. When we perform inference, the last flow step (closest to z) is the first to attend to the inputs but this flow step does not know how to attend given your Attention Weights 1 image.

Try inference again once your Attention Weights 1 look better.

ghost · 2020-08-21T17:32:20Z

That makes sense, thanks!
Will keep you posted and summarize (for future readers) what I have done

ghost · 2020-09-02T16:59:29Z

Ok, I have been running the training with 2 flows now for a while.

This is what I see on TensorBoard

Attention Weights 1	Attention Weights 0	Validation Loss	Training Loss

I would say that everything looks great.

When I run the inference everything looks (and sounds) bad

Attention Weights 0	Attention Weights 1

@rafaelvalle What would you recommend?
Things looked and sounded better at the end of training with 1 flow

Thanks

rafaelvalle · 2020-09-02T17:09:57Z

Confirm that during inference the hyperparams in config.json match what is used during training.
As a sanity check, generate a few sentences from the training data.
Then check if the issue is sentence or speaker dependent.

ghost · 2020-09-02T17:19:55Z

config.json is the same

A couple of training sentences with speaker 40 and 887:

Speaker 40

Attention Weights 0	Attention Weights 1

Speaker 887

Attention Weights 0	Attention Weights 1

Better but not good.
It seems to be sentence dependent.

rafaelvalle · 2020-09-02T18:15:53Z

If you're not, make sure to add punctuation to phrases.

ghost · 2020-09-02T18:17:15Z

I did add punctuation.
Should I just train longer?

rafaelvalle · 2020-09-02T18:17:39Z

Did you try a lower value of sigma?

ghost · 2020-09-02T20:18:07Z

I was already running it with sigma=0.5

rafaelvalle · 2020-09-02T21:12:43Z

Try something even more conservative, 0.25.
Is this model trained with speaker embeddings?
Also, can you share the phrases you've been evaluating?

rafaelvalle · 2020-09-02T23:04:19Z

What happens if you set n_frames to be 6 times the number of tokens?

ghost · 2020-09-03T17:31:53Z

Yes, the model is trained with speaker embeddings.

Here are some examples:

I set sigma as low as 0.25 as you suggested.

"I was good enough not to contradict this startling assertion." -i 887 -s 0.25 
"Then one begins to appraise." -i 1116 -s 0.25
"Now let us return to your particular world." -i 40 -s 0.25

And in the inference.py script I added the computation for n_frames

text = trainset.get_text(text).cuda()
n_frames = len(text)*6

Still bad results

rafaelvalle · 2020-09-03T18:21:07Z

Try these modifications to the phrases:

"I was good enough to contradict this startling assertion."
"Now let us return your particular world."

ghost · 2020-09-03T18:28:28Z

Speaker 40: "Now let us return your particular world."

Attention Weights 0	Attention Weights 1

Speaker 887: "I was good enough to contradict this startling assertion."

Attention Weights 0	Attention Weights 1

rafaelvalle · 2020-09-03T19:39:03Z

That's very surprising. Give us some time to look into it.

ghost · 2020-09-03T20:00:01Z

Thanks a lot! I really appreciate your help.
Please let me know if I can be more involved in the investigation

ghost · 2020-09-03T20:05:26Z

One thing: there are differences in the output when running the inference on different checkpoints.
Still, none of them are good enough, but there are significant fluctuations of course.

rafaelvalle · 2020-09-03T21:17:56Z

Are the speaker ids you're sharing the LibriTTS ids? The model should have about 123 speakers.

ghost · 2020-09-03T21:28:33Z

Yes, from the LibriTTS ids: list

rafaelvalle · 2020-09-03T21:30:39Z

I synthesized the 3 phrases with our LibriTTS-100 model trained with speaker embedding using sigma=0.75 and n_frames = 1000.

Your attention weights during training look really good and your validation loss is similar to what we reached.
Can you share your model weights?

phrases.zip

ghost · 2020-09-03T23:19:08Z

Those phrases sound like what I'd like to hear.

I uploaded the checkpoint I used here

There is one small difference in the dataset:
Speaker number 40 has a few sentences that were taken away

This is the config file

This is the training files list

ghost · 2020-09-10T15:41:23Z

@rafaelvalle did you manage to run the inference using the weights I shared?
Thanks

rafaelvalle · 2020-09-14T02:13:42Z

Yes, I get similar results to your results by using your model.
Will take a look at your model once the paper deadlines are over.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steps to replicate pretrained models on LibriTTS #57

Steps to replicate pretrained models on LibriTTS #57

ghost commented Aug 14, 2020

rafaelvalle commented Aug 14, 2020

ghost commented Aug 21, 2020 •

edited by ghost

Loading

rafaelvalle commented Aug 21, 2020

ghost commented Aug 21, 2020

ghost commented Sep 2, 2020

rafaelvalle commented Sep 2, 2020

ghost commented Sep 2, 2020 •

edited by ghost

Loading

rafaelvalle commented Sep 2, 2020

ghost commented Sep 2, 2020

rafaelvalle commented Sep 2, 2020

ghost commented Sep 2, 2020

rafaelvalle commented Sep 2, 2020 •

edited

Loading

rafaelvalle commented Sep 2, 2020

ghost commented Sep 3, 2020

rafaelvalle commented Sep 3, 2020

ghost commented Sep 3, 2020

rafaelvalle commented Sep 3, 2020

ghost commented Sep 3, 2020

ghost commented Sep 3, 2020

rafaelvalle commented Sep 3, 2020

ghost commented Sep 3, 2020

rafaelvalle commented Sep 3, 2020

ghost commented Sep 3, 2020

ghost commented Sep 10, 2020

rafaelvalle commented Sep 14, 2020

Steps to replicate pretrained models on LibriTTS #57

Steps to replicate pretrained models on LibriTTS #57

Comments

ghost commented Aug 14, 2020

rafaelvalle commented Aug 14, 2020

ghost commented Aug 21, 2020 • edited by ghost Loading

1 Flow

Results

2 Flows

Results

My questions:

rafaelvalle commented Aug 21, 2020

ghost commented Aug 21, 2020

ghost commented Sep 2, 2020

rafaelvalle commented Sep 2, 2020

ghost commented Sep 2, 2020 • edited by ghost Loading

rafaelvalle commented Sep 2, 2020

ghost commented Sep 2, 2020

rafaelvalle commented Sep 2, 2020

ghost commented Sep 2, 2020

rafaelvalle commented Sep 2, 2020 • edited Loading

rafaelvalle commented Sep 2, 2020

ghost commented Sep 3, 2020

rafaelvalle commented Sep 3, 2020

ghost commented Sep 3, 2020

rafaelvalle commented Sep 3, 2020

ghost commented Sep 3, 2020

ghost commented Sep 3, 2020

rafaelvalle commented Sep 3, 2020

ghost commented Sep 3, 2020

rafaelvalle commented Sep 3, 2020

ghost commented Sep 3, 2020

ghost commented Sep 10, 2020

rafaelvalle commented Sep 14, 2020

ghost commented Aug 21, 2020 •

edited by ghost

Loading

ghost commented Sep 2, 2020 •

edited by ghost

Loading

rafaelvalle commented Sep 2, 2020 •

edited

Loading