Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tacoton-2 plus World vocoder #304

Closed
begeekmyfriend opened this issue Dec 28, 2018 · 58 comments
Closed

Tacoton-2 plus World vocoder #304

begeekmyfriend opened this issue Dec 28, 2018 · 58 comments

Comments

@begeekmyfriend
Copy link
Contributor

begeekmyfriend commented Dec 28, 2018

Hey I am glad to inform you that I have succeeded to merge Tacotron model with World vocoder and generated some evaluation results as follows. The results sound not bad but still not perfect. However it shows another way to train different feature parameters with Tacotron. The World vocoder is an open source project and thus everyone can use it for all. Moreover the quality of resynth results from that vocoder is better than that from Griffin-Lim since the three features (lf0[1], mgc[60] and ap[5]) contain not only magnitude spectrograms but also phase information. Furthermore the depth of the features is low enough that we do not need postnet for Tacotron model. The performance of training can be reduced to 0.7 second per step. The inference can also be quick enough even it only works on CPU. So it really worthes trying.

I would like to share my experimental source code with you as follows. Note that it currently only for Chinese mandarin. You may modify it for other languages:
tacotron-world-vocoder branch
Python-Wrapper-for-World-Vocoder
pysptk merlin-world-vocoder branch
By the way you need use python setup.py install and the copy the so file manually into the system path for pysptk and python wrapper project.

Besides I also would like to provide two Python scripts for World vocoder resynth test.
world_vocoder_resynth_scripts.zip

@Rayhane-mamah Let us rock with it! And @r9y9 thanks for your pysptk project.
world_vocoder_demo.zip
image

@begeekmyfriend
Copy link
Contributor Author

begeekmyfriend commented Dec 28, 2018

It gets better when more training now.
world_vocoder_demo.zip
image

@m-toman
Copy link
Contributor

m-toman commented Dec 28, 2018

This is great, also wanted to tackle this some time ago but was busy with other projects.

So you use mgc2sp and vice versa from SPTK as in the Merlin project and not the codec WORLD provides? (https://github.com/mmorise/World/blob/master/examples/codec_test/readandsynthesis.cpp)

I've tried the WORLD codec with Merlin and I found that the MGC parameterization performed better (also REAPER got rid of most of the V/UV errors) but I never dug deep into the reason for it.

@begeekmyfriend
Copy link
Contributor Author

I am using early version of WORLD vocoder source from merlin but not the latest version in mmorise's repo which seems difficult to pass the resynth test scripts provided by merlin. I have not still deep insight on it. But I have forked my own modifed early verion WORLD vocoder source on my repo which works for me.

@m-toman
Copy link
Contributor

m-toman commented Dec 28, 2018

Thanks.
Perhaps worth to integrate the modifications into this repository... do you know where the criticial differences between the Keithitho repo and this one are?

@begeekmyfriend
Copy link
Contributor Author

Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project.

@begeekmyfriend
Copy link
Contributor Author

Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder

@shartoo
Copy link

shartoo commented Jan 7, 2019

Nice job,thanks for your sharing!

@QueenKeys
Copy link

Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder

Hello, may I ask the last dimension of the bap feature extracted by pyworld is 1025, then need to change the bap parameter num_bap = 5 in hparams to num_bap = 1025?

@begeekmyfriend
Copy link
Contributor Author

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.

git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init

@QueenKeys
Copy link

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.

git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init

when type 'git submodule update --init', Is this normal?
子模组 'lib/World'(https://github.com/mmorise/World)未对路径 'lib/World' 注册
正克隆到 '/home/queen/document/Python-Wrapper-for-World-Vocoder/lib/World'...
子模组路径 'lib/World':检出 'd7c03432d572c5a162edba9c611b3c8e367069a9'

@begeekmyfriend
Copy link
Contributor Author

@QueenKeys You might use world_vocoder_resynth_scripts.zip provided on the 1st floor to testify if it has been installed successfully or not.

@QueenKeys
Copy link

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.

git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init

I have completed the installation according to your instructions. The following error still occurs in the file running train.py.
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/queen/下载/Tacotron-2-mandarin-world-vocoder/tacotron/feeder.py", line 173, in _enqueue_next_test_group
self._session.run(self._eval_enqueue_op, feed_dict=feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (20, 1236, 513) for Tensor 'datafeeder/bap_targets:0', which has shape '(?, ?, 5)'

@begeekmyfriend
Copy link
Contributor Author

@QueenKeys Did you checkout the right branch mandarin-world-vocoder for this test?

@QueenKeys
Copy link

@QueenKeys Did you checkout the right branch mandarin-world-vocoder for this test?
Yes, I have tested the dimensions of the three parameters lf0, mgc, and bap in Python-Wrapper-for-World-Vocoder ,which are (, 1), (, 60), (513), when I put the hparams.py file The num_bap = 5 is changed to num_bap = 513, the train.py will run normally, but the parameters set in world-v2 in merlin should be (, 1), (, 60), (, 5)

@begeekmyfriend
Copy link
Contributor Author

I do not use world-v2.

@QueenKeys
Copy link

I do not use world-v2.

Hello, you may have misunderstood what I mean. I didn't say that you are using world-v2, but you set num_bap to 5 in hparam.py, so I guess it is possible that you set bap as in world-v2, otherwise I am not really sure why you set num_bap to 5?

@OswaldoBornemann
Copy link

@begeekmyfriend It seems that in your https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder, the hparams.py does not have cbhg parameters. Does i do something wrong?

@begeekmyfriend
Copy link
Contributor Author

Hi everyone, I have upgraded WORLD vocoder into the latest version where we can use havest instead of dio for F0 pitch extraction. The link adress is still on the 1st floor. Any suggestion is welcome!

@Edresson
Copy link

Edresson commented Mar 19, 2019

Hi, I've been training the model for Portuguese. I have the same problem reported above with the LJ Speech dataset, during eval (during training using the tacotron teacher) I get good results, but during the synthesis the results are bad. My base has 10 hours of audio and I trained approximately 272k steps, is it necessary to train more, or is there a problem with the model?

The results reported here were obtained during eval (during training using the tacotron teacher)?

@begeekmyfriend
Copy link
Contributor Author

@Edresson You need to checkout the alignment like this mozilla/TTS#9 (comment)

@Edresson
Copy link

@begeekmyfriend I upgraded the repository as described, but the network does not converge, I believe it can be overfit, I tried Tacatron with grinffin-lim and also did not have good results. The Tacotron does not seem to converge on my own dataset. With my own dataset I managed to get good results with DCTTS, but with the tacotron the results are very bad. Do you have any suggestion ?

@begeekmyfriend
Copy link
Contributor Author

The Griffin Lim branch is only for Chinese mandarin. Did you change the dictionary of your own?
As for WORLD features in my tests. For some of dataset it can learn alignment quickly but for others it fails. I am still working around with it.

@Edresson
Copy link

Edresson commented Mar 26, 2019

@begeekmyfriend yes I changed but I trained the model few steps I believe that with Griffin Lim the tacotron with my dataset converge after many steps, using the DCTTS I needed 2000k steps to get good results. For Tacotron-World the loss varies a lot during the training, and does not learn the alimentation. I also tried using the DCTTS with World vocoder, and I have the same problem, the model does not converge. If you get new results please inform me, I'm working on it too, any progress I will report here.

@begeekmyfriend
Copy link
Contributor Author

If it fails under griffin lim branch it might well your dataset is not good enough for TTS

@Edresson
Copy link

@begeekmyfriend I agree with you however using DCTTS get good results, as tacotron is more powerful I believe you need more data, I will train a model with griffin lim to check if that is the problem.

@begeekmyfriend
Copy link
Contributor Author

Latest commit begeekmyfriend@e40a7b7
Any feedback is welcome!
step-13000-align

@superhg2012
Copy link

step-20000-eval-align

Hi, @begeekmyfriend , I am runing Tacotron2 + pyworld using Biaobei(10000) tts corpus. Why my above alignment result is not continuious?

@begeekmyfriend
Copy link
Contributor Author

begeekmyfriend commented Mar 28, 2019

I forgot to tell you that for differnt dataset, adjust the hp.max_frame_num and hp.max_text_length adopted by guided attention as the alignment slope for better convergence.

@superhg2012
Copy link

@begeekmyfriend I get it! thanks!!

@superhg2012
Copy link

Hi, @begeekmyfriend During my training T2, the eval stop_token loss is increasing while the train stop_token_loss is decreasing. I found that in my training corpus, there is no puncuntations while some of the sentences for eval in hparams contains puncuntations. Is this root cause?

@Edresson
Copy link

Hi @begeekmyfriend , the alignment looks good however, when I run synthesize.py, I get an audio with only noise, without speech. Did you get good results using synthesize.py?
See below the images of the alignments.

During training:
During train 57k
During the eval (in training)
During eval 57k

@begeekmyfriend
Copy link
Contributor Author

Because the stop token loss did not reduce to zero. My test is still undergoing as well.

@superhg2012
Copy link

Because the stop token loss did not reduce to zero. My test is still undergoing as well.

switch dio for f0 estimation to harvest. Does it work for synthesize improvement?

@begeekmyfriend
Copy link
Contributor Author

You may test it by resynth script.
world_resynth.zip

@begeekmyfriend
Copy link
Contributor Author

I found that MSE fits WORLD features more than MAE does because the value scales of lf0, mgc and bap differ. And the attention can be kept through the whole training. MAE works well for mel spectrograms because it contains only one kind of feature. See begeekmyfriend@5863d55

@superhg2012
Copy link

@begeekmyfriend can you share a better synthesized sample audio?

@begeekmyfriend
Copy link
Contributor Author

The demo has been shown on the floors where there is spectrogram graph. Maybe I need to reduce the frame period to obtain better quality. However I was told that the fidelity of sample is not as good as that from G&L. The quality is better indeed.

@begeekmyfriend
Copy link
Contributor Author

By the way when you want to hear complete synthesized samples, please wait until the stop token loss reduced to zero.

@begeekmyfriend
Copy link
Contributor Author

As for alignment, remember adapting your max_text_length and max_frame_num to the best ratio of N:T which depends on your dataset.

@begeekmyfriend
Copy link
Contributor Author

image

@mrgloom
Copy link

mrgloom commented Apr 2, 2019

@begeekmyfriend
Is pretrained model compatible with this repo https://github.com/begeekmyfriend/tacotron/tree/mandarin-world-vocoder is available for test?

@begeekmyfriend
Copy link
Contributor Author

Imcompatible. In fact mantaining both of those tacotron project would exhaust me. So I am fixed on my T2 fork currently.

@superhg2012
Copy link

@begeekmyfriend your learning cuve is better than mine, great!!

@begeekmyfriend
Copy link
Contributor Author

Here is biaobei mandarin demo from T2 + WORLD. The f0 feature value prediction is tough for this model.
xmly_biaobei_world.zip

@sujeendran
Copy link

Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project.

@begeekmyfriend Can you please tell what are the modifications/steps required in the current version of your T1 repo to make it run with LJ Speech Dataset?
I am sorry I am asking you this now because most of the steps are mixed up in the previous comments and I thought it would be helpful for others to have it in one place too.
Also does this T1 version run with your updated WORLD vocoder repo?

@begeekmyfriend
Copy link
Contributor Author

T1 modification is just a trivial version. I am focus on T2 currently.

@sujeendran
Copy link

@begeekmyfriend Thanks for the response. In that case, do you mind giving a concise list of steps required for LJ Speech dataset?

@begeekmyfriend
Copy link
Contributor Author

@sujeendran
Copy link

https://github.com/begeekmyfriend/Tacotron-2/tree/griffin-lim

Sorry, but the synthesizer is still using Griffin-Lim algorithm in this branch right? I was hoping for your World Vocoder implementation for LJ Speech. Changes needed for Hparams and preprocessing?

@begeekmyfriend
Copy link
Contributor Author

Sorry, I forgot it. You might diff the griffin-lim and mandarin-griffin-lim branch and merge it into mandarin-world-vocoder branch. By the way in my test, the evaluation from WORLD plus Tacotron did not work as well as that from Griffin-Lim.

@sujeendran
Copy link

Sorry, I forgot it. You might diff the griffin-lim and mandarin-griffin-lim branch and merge it into mandarin-world-vocoder branch. By the way in my test, the evaluation from WORLD plus Tacotron did not work as well as that from Griffin-Lim.

Thanks! I actually wanted to try World Vocoder as GL is really slow. I have managed to perform optimizations in the Tacotron part of the model and it is performing faster than before on CPU, but GL is still slow and reducing the number of iterations just kills the quality. I will try your implementation and see if i can get a good result.

@begeekmyfriend
Copy link
Contributor Author

Give up this solution and turn to WaveRNN. Feel free to reopen this issue.

@byuns9334
Copy link

@begeekmyfriend Hi, thank you for your great work. Have you tried Tacotron2 + WORLD vocoder for long sentence? When I test WORLD vocoder on very long sentence (about 200 characters), it gets some uncomfortable sound (something like break sound) in the sound. Have you experienced this problem? and what do you think of the solution for this?
long.wav.zip

@begeekmyfriend
Copy link
Contributor Author

You might split that long sentence into two sentences according to the pause point.

@JJun-Guo
Copy link

您好,Python-Wrapper-for-World-Vocoder 这个链接已经失效了,可以重新发一下吗?是否已经提供了pretrain model呢?

@begeekmyfriend
Copy link
Contributor Author

@JJ-Guo1996 I am not engaging in TTS any more. And I do not remember how I deleted that repo you mentioned. Maybe there are substitutes for the WORLD vocoder listed in the README file. You can refer to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests