Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detailed work pipeline to train a multi-speaker flowtron model #113

Open
JohnHerry opened this issue Apr 6, 2021 · 7 comments
Open

detailed work pipeline to train a multi-speaker flowtron model #113

JohnHerry opened this issue Apr 6, 2021 · 7 comments

Comments

@JohnHerry
Copy link

JohnHerry commented Apr 6, 2021

Hi, all,
I am new to this job, had any body try to train a flowtron in multi-speaker model?
It seems there need a TWO-STAGE trainging for flowtron. But there is only one config.json file. I don't know how to modify this config in the two-stage trining. What does the “n_flows” mean?
Is there any demo for a multi-speaker instance? and if my language is not English, what are the work steps should I do?

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Apr 6, 2021 via email

@JohnHerry
Copy link
Author

@rafaelvalle Thanks for you help. I am training flowtron on other languages instead of English, So I had to train from scratch. There is no pretrained tacotron2 model for me as a text encoder, So what I need is to train a tacotron2 on my mulit-speech corpus?

@rafaelvalle
Copy link
Contributor

no, you will not need tacotron 2.
just make sure to turn the attention prior to True until the model learns attention. it's ok to train 2 steps of flow at once.
then turn the attention prior to False and resume training.
https://github.com/NVIDIA/flowtron/blob/master/config.json#L34

@JohnHerry
Copy link
Author

@rafaelvalle
My config.json is as follows:
2021saa

I change three values according to my dataset, and I did not set the use_attn_prior, instead, I restirctly using the training command in your document:

python train.py -c config.json -p data_config.use_attn_prior=1

in our dataset there are speech of 67 hours from 142 speakers

Should I firstly change the parameter "n_flows" as value 1 at start step to good attention, then as value 2 at the second step? and so on?

How many steps should I train to get the first step attention?

@JohnHerry
Copy link
Author

I had run first step from scratch for three days: Totally 6 RTX3090 GPU, but the attention still seems strange now. Is there any problem?
first
image
image
image
image

@JohnHerry
Copy link
Author

JohnHerry commented Apr 15, 2021

@rafaelvalle What does the x-ticks and y-ticks mean in the attetion plot? I see attention channels are 640, while my attention image above makes x-ticks to 200 and y-ticks to 70; what does these mean?

I had used config.json with n_texts=200; I saw that there are little samples whose text length over 160; so I removed samples whose text length are bigger then 160. But the Attention picture is not good too.

pic

Is there any suggestion about how to use attention plot to find my problems? I think most of those problems are about preprocessing. though.

@JohnHerry
Copy link
Author

My corpus is about multiple speakers, but my speaker ids are not consistant integers. There are 142 different speakers , while speaker id range from 1 to 240, many middle speaker samples are deleted dure to low count of samples. Is this the reason for the bad attention?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants