-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tacoton-2 plus World vocoder #304
Comments
It gets better when more training now. |
This is great, also wanted to tackle this some time ago but was busy with other projects. So you use mgc2sp and vice versa from SPTK as in the Merlin project and not the codec WORLD provides? (https://github.com/mmorise/World/blob/master/examples/codec_test/readandsynthesis.cpp) I've tried the WORLD codec with Merlin and I found that the MGC parameterization performed better (also REAPER got rid of most of the V/UV errors) but I never dug deep into the reason for it. |
I am using early version of WORLD vocoder source from merlin but not the latest version in mmorise's repo which seems difficult to pass the resynth test scripts provided by merlin. I have not still deep insight on it. But I have forked my own modifed early verion WORLD vocoder source on my repo which works for me. |
Thanks. |
Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project. |
Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder |
Nice job,thanks for your sharing! |
Hello, may I ask the last dimension of the bap feature extracted by pyworld is 1025, then need to change the bap parameter num_bap = 5 in hparams to num_bap = 1025? |
@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library. git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init |
when type 'git submodule update --init', Is this normal? |
@QueenKeys You might use |
I have completed the installation according to your instructions. The following error still occurs in the file running train.py. |
@QueenKeys Did you checkout the right branch |
|
I do not use world-v2. |
Hello, you may have misunderstood what I mean. I didn't say that you are using world-v2, but you set num_bap to 5 in hparam.py, so I guess it is possible that you set bap as in world-v2, otherwise I am not really sure why you set num_bap to 5? |
@begeekmyfriend It seems that in your |
Hi everyone, I have upgraded WORLD vocoder into the latest version where we can use havest instead of dio for F0 pitch extraction. The link adress is still on the 1st floor. Any suggestion is welcome! |
Hi, I've been training the model for Portuguese. I have the same problem reported above with the LJ Speech dataset, during eval (during training using the tacotron teacher) I get good results, but during the synthesis the results are bad. My base has 10 hours of audio and I trained approximately 272k steps, is it necessary to train more, or is there a problem with the model? The results reported here were obtained during eval (during training using the tacotron teacher)? |
@Edresson You need to checkout the alignment like this mozilla/TTS#9 (comment) |
@begeekmyfriend I upgraded the repository as described, but the network does not converge, I believe it can be overfit, I tried Tacatron with grinffin-lim and also did not have good results. The Tacotron does not seem to converge on my own dataset. With my own dataset I managed to get good results with DCTTS, but with the tacotron the results are very bad. Do you have any suggestion ? |
The |
@begeekmyfriend yes I changed but I trained the model few steps I believe that with Griffin Lim the tacotron with my dataset converge after many steps, using the DCTTS I needed 2000k steps to get good results. For Tacotron-World the loss varies a lot during the training, and does not learn the alimentation. I also tried using the DCTTS with World vocoder, and I have the same problem, the model does not converge. If you get new results please inform me, I'm working on it too, any progress I will report here. |
If it fails under |
@begeekmyfriend I agree with you however using DCTTS get good results, as tacotron is more powerful I believe you need more data, I will train a model with griffin lim to check if that is the problem. |
Latest commit begeekmyfriend@e40a7b7 |
Hi, @begeekmyfriend , I am runing Tacotron2 + pyworld using Biaobei(10000) tts corpus. Why my above alignment result is not continuious? |
I forgot to tell you that for differnt dataset, adjust the |
@begeekmyfriend I get it! thanks!! |
Hi, @begeekmyfriend During my training T2, the eval stop_token loss is increasing while the train stop_token_loss is decreasing. I found that in my training corpus, there is no puncuntations while some of the sentences for eval in hparams contains puncuntations. Is this root cause? |
Hi @begeekmyfriend , the alignment looks good however, when I run synthesize.py, I get an audio with only noise, without speech. Did you get good results using synthesize.py? |
Because the stop token loss did not reduce to zero. My test is still undergoing as well. |
switch dio for f0 estimation to harvest. Does it work for synthesize improvement? |
You may test it by resynth script. |
I found that MSE fits WORLD features more than MAE does because the value scales of lf0, mgc and bap differ. And the attention can be kept through the whole training. MAE works well for mel spectrograms because it contains only one kind of feature. See begeekmyfriend@5863d55 |
@begeekmyfriend can you share a better synthesized sample audio? |
The demo has been shown on the floors where there is spectrogram graph. Maybe I need to reduce the frame period to obtain better quality. However I was told that the fidelity of sample is not as good as that from G&L. The quality is better indeed. |
By the way when you want to hear complete synthesized samples, please wait until the stop token loss reduced to zero. |
As for alignment, remember adapting your |
@begeekmyfriend |
Imcompatible. In fact mantaining both of those tacotron project would exhaust me. So I am fixed on my T2 fork currently. |
@begeekmyfriend your learning cuve is better than mine, great!! |
Here is biaobei mandarin demo from T2 + WORLD. The f0 feature value prediction is tough for this model. |
@begeekmyfriend Can you please tell what are the modifications/steps required in the current version of your T1 repo to make it run with LJ Speech Dataset? |
T1 modification is just a trivial version. I am focus on T2 currently. |
@begeekmyfriend Thanks for the response. In that case, do you mind giving a concise list of steps required for LJ Speech dataset? |
Sorry, but the synthesizer is still using Griffin-Lim algorithm in this branch right? I was hoping for your World Vocoder implementation for LJ Speech. Changes needed for Hparams and preprocessing? |
Sorry, I forgot it. You might diff the |
Thanks! I actually wanted to try World Vocoder as GL is really slow. I have managed to perform optimizations in the Tacotron part of the model and it is performing faster than before on CPU, but GL is still slow and reducing the number of iterations just kills the quality. I will try your implementation and see if i can get a good result. |
Give up this solution and turn to WaveRNN. Feel free to reopen this issue. |
@begeekmyfriend Hi, thank you for your great work. Have you tried Tacotron2 + WORLD vocoder for long sentence? When I test WORLD vocoder on very long sentence (about 200 characters), it gets some uncomfortable sound (something like break sound) in the sound. Have you experienced this problem? and what do you think of the solution for this? |
You might split that long sentence into two sentences according to the pause point. |
您好,Python-Wrapper-for-World-Vocoder 这个链接已经失效了,可以重新发一下吗?是否已经提供了pretrain model呢? |
@JJ-Guo1996 I am not engaging in TTS any more. And I do not remember how I deleted that repo you mentioned. Maybe there are substitutes for the WORLD vocoder listed in the README file. You can refer to it. |
Hey I am glad to inform you that I have succeeded to merge Tacotron model with World vocoder and generated some evaluation results as follows. The results sound not bad but still not perfect. However it shows another way to train different feature parameters with Tacotron. The World vocoder is an open source project and thus everyone can use it for all. Moreover the quality of resynth results from that vocoder is better than that from Griffin-Lim since the three features (lf0[1], mgc[60] and ap[5]) contain not only magnitude spectrograms but also phase information. Furthermore the depth of the features is low enough that we do not need postnet for Tacotron model. The performance of training can be reduced to 0.7 second per step. The inference can also be quick enough even it only works on CPU. So it really worthes trying.
I would like to share my experimental source code with you as follows. Note that it currently only for Chinese mandarin. You may modify it for other languages:
tacotron-world-vocoder branch
Python-Wrapper-for-World-Vocoder
pysptk merlin-world-vocoder branch
By the way you need use
python setup.py install
and the copy the so file manually into the system path forpysptk
and python wrapper project.Besides I also would like to provide two Python scripts for World vocoder resynth test.
world_vocoder_resynth_scripts.zip
@Rayhane-mamah Let us rock with it! And @r9y9 thanks for your
pysptk
project.world_vocoder_demo.zip
The text was updated successfully, but these errors were encountered: