Tacoton-2 plus World vocoder #304

begeekmyfriend · 2018-12-28T03:25:47Z

Hey I am glad to inform you that I have succeeded to merge Tacotron model with World vocoder and generated some evaluation results as follows. The results sound not bad but still not perfect. However it shows another way to train different feature parameters with Tacotron. The World vocoder is an open source project and thus everyone can use it for all. Moreover the quality of resynth results from that vocoder is better than that from Griffin-Lim since the three features (lf0[1], mgc[60] and ap[5]) contain not only magnitude spectrograms but also phase information. Furthermore the depth of the features is low enough that we do not need postnet for Tacotron model. The performance of training can be reduced to 0.7 second per step. The inference can also be quick enough even it only works on CPU. So it really worthes trying.

I would like to share my experimental source code with you as follows. Note that it currently only for Chinese mandarin. You may modify it for other languages:
tacotron-world-vocoder branch
Python-Wrapper-for-World-Vocoder
pysptk merlin-world-vocoder branch
By the way you need use python setup.py install and the copy the so file manually into the system path for pysptk and python wrapper project.

Besides I also would like to provide two Python scripts for World vocoder resynth test.
world_vocoder_resynth_scripts.zip

@Rayhane-mamah Let us rock with it! And @r9y9 thanks for your pysptk project.
world_vocoder_demo.zip

The text was updated successfully, but these errors were encountered:

begeekmyfriend · 2018-12-28T08:13:13Z

It gets better when more training now.
world_vocoder_demo.zip

m-toman · 2018-12-28T09:52:32Z

This is great, also wanted to tackle this some time ago but was busy with other projects.

So you use mgc2sp and vice versa from SPTK as in the Merlin project and not the codec WORLD provides? (https://github.com/mmorise/World/blob/master/examples/codec_test/readandsynthesis.cpp)

I've tried the WORLD codec with Merlin and I found that the MGC parameterization performed better (also REAPER got rid of most of the V/UV errors) but I never dug deep into the reason for it.

begeekmyfriend · 2018-12-28T10:06:11Z

I am using early version of WORLD vocoder source from merlin but not the latest version in mmorise's repo which seems difficult to pass the resynth test scripts provided by merlin. I have not still deep insight on it. But I have forked my own modifed early verion WORLD vocoder source on my repo which works for me.

m-toman · 2018-12-28T10:12:09Z

Thanks.
Perhaps worth to integrate the modifications into this repository... do you know where the criticial differences between the Keithitho repo and this one are?

begeekmyfriend · 2018-12-28T10:19:09Z

Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project.

begeekmyfriend · 2018-12-30T03:07:57Z

Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder

shartoo · 2019-01-07T01:58:31Z

Nice job,thanks for your sharing!

QueenKeys · 2019-01-14T01:58:31Z

Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder

Hello, may I ask the last dimension of the bap feature extracted by pyworld is 1025, then need to change the bap parameter num_bap = 5 in hparams to num_bap = 1025?

begeekmyfriend · 2019-01-14T03:17:07Z

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.

git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init

QueenKeys · 2019-01-14T05:37:18Z

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.
git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init

when type 'git submodule update --init', Is this normal?
子模组 'lib/World'（https://github.com/mmorise/World）未对路径 'lib/World' 注册
正克隆到 '/home/queen/document/Python-Wrapper-for-World-Vocoder/lib/World'...
子模组路径 'lib/World'：检出 'd7c03432d572c5a162edba9c611b3c8e367069a9'

begeekmyfriend · 2019-01-14T05:56:03Z

@QueenKeys You might use world_vocoder_resynth_scripts.zip provided on the 1st floor to testify if it has been installed successfully or not.

QueenKeys · 2019-01-14T06:22:13Z

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.
git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init

I have completed the installation according to your instructions. The following error still occurs in the file running train.py.
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/queen/下载/Tacotron-2-mandarin-world-vocoder/tacotron/feeder.py", line 173, in _enqueue_next_test_group
self._session.run(self._eval_enqueue_op, feed_dict=feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (20, 1236, 513) for Tensor 'datafeeder/bap_targets:0', which has shape '(?, ?, 5)'

begeekmyfriend · 2019-01-14T08:10:51Z

@QueenKeys Did you checkout the right branch mandarin-world-vocoder for this test?

QueenKeys · 2019-01-14T08:26:59Z

@QueenKeys Did you checkout the right branch mandarin-world-vocoder for this test?
Yes, I have tested the dimensions of the three parameters lf0, mgc, and bap in Python-Wrapper-for-World-Vocoder ，which are (, 1), (, 60), (513), when I put the hparams.py file The num_bap = 5 is changed to num_bap = 513, the train.py will run normally, but the parameters set in world-v2 in merlin should be (, 1), (, 60), (, 5)

begeekmyfriend · 2019-01-14T08:34:11Z

I do not use world-v2.

QueenKeys · 2019-01-14T09:08:50Z

I do not use world-v2.

Hello, you may have misunderstood what I mean. I didn't say that you are using world-v2, but you set num_bap to 5 in hparam.py, so I guess it is possible that you set bap as in world-v2, otherwise I am not really sure why you set num_bap to 5?

OswaldoBornemann · 2019-02-27T03:33:16Z

@begeekmyfriend It seems that in your https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder, the hparams.py does not have cbhg parameters. Does i do something wrong?

begeekmyfriend · 2019-03-19T08:25:02Z

Hi everyone, I have upgraded WORLD vocoder into the latest version where we can use havest instead of dio for F0 pitch extraction. The link adress is still on the 1st floor. Any suggestion is welcome!

Edresson · 2019-03-19T20:14:34Z

Hi, I've been training the model for Portuguese. I have the same problem reported above with the LJ Speech dataset, during eval (during training using the tacotron teacher) I get good results, but during the synthesis the results are bad. My base has 10 hours of audio and I trained approximately 272k steps, is it necessary to train more, or is there a problem with the model?

The results reported here were obtained during eval (during training using the tacotron teacher)?

begeekmyfriend · 2019-03-20T01:42:56Z

@Edresson You need to checkout the alignment like this mozilla/TTS#9 (comment)

Edresson · 2019-03-26T11:52:09Z

@begeekmyfriend I upgraded the repository as described, but the network does not converge, I believe it can be overfit, I tried Tacatron with grinffin-lim and also did not have good results. The Tacotron does not seem to converge on my own dataset. With my own dataset I managed to get good results with DCTTS, but with the tacotron the results are very bad. Do you have any suggestion ?

begeekmyfriend · 2019-03-26T12:05:29Z

The Griffin Lim branch is only for Chinese mandarin. Did you change the dictionary of your own?
As for WORLD features in my tests. For some of dataset it can learn alignment quickly but for others it fails. I am still working around with it.

Edresson · 2019-03-26T12:15:16Z

@begeekmyfriend yes I changed but I trained the model few steps I believe that with Griffin Lim the tacotron with my dataset converge after many steps, using the DCTTS I needed 2000k steps to get good results. For Tacotron-World the loss varies a lot during the training, and does not learn the alimentation. I also tried using the DCTTS with World vocoder, and I have the same problem, the model does not converge. If you get new results please inform me, I'm working on it too, any progress I will report here.

begeekmyfriend · 2019-03-26T12:19:32Z

If it fails under griffin lim branch it might well your dataset is not good enough for TTS

Edresson · 2019-03-26T12:23:59Z

@begeekmyfriend I agree with you however using DCTTS get good results, as tacotron is more powerful I believe you need more data, I will train a model with griffin lim to check if that is the problem.

begeekmyfriend · 2019-03-27T07:16:55Z

Latest commit begeekmyfriend@e40a7b7
Any feedback is welcome!

superhg2012 · 2019-03-28T01:40:04Z

Hi, @begeekmyfriend , I am runing Tacotron2 + pyworld using Biaobei(10000) tts corpus. Why my above alignment result is not continuious?

begeekmyfriend · 2019-03-28T01:58:42Z

I forgot to tell you that for differnt dataset, adjust the hp.max_frame_num and hp.max_text_length adopted by guided attention as the alignment slope for better convergence.

superhg2012 · 2019-03-28T02:57:07Z

@begeekmyfriend I get it! thanks!!

superhg2012 · 2019-03-28T11:26:18Z

Hi, @begeekmyfriend During my training T2, the eval stop_token loss is increasing while the train stop_token_loss is decreasing. I found that in my training corpus, there is no puncuntations while some of the sentences for eval in hparams contains puncuntations. Is this root cause?

Edresson · 2019-03-28T23:26:33Z

Hi @begeekmyfriend , the alignment looks good however, when I run synthesize.py, I get an audio with only noise, without speech. Did you get good results using synthesize.py?
See below the images of the alignments.

During training:

During the eval (in training)

begeekmyfriend · 2019-03-29T01:48:10Z

Because the stop token loss did not reduce to zero. My test is still undergoing as well.

superhg2012 · 2019-03-29T03:00:17Z

Because the stop token loss did not reduce to zero. My test is still undergoing as well.

switch dio for f0 estimation to harvest. Does it work for synthesize improvement?

begeekmyfriend · 2019-03-29T07:28:30Z

You may test it by resynth script.
world_resynth.zip

begeekmyfriend · 2019-03-30T01:01:19Z

I found that MSE fits WORLD features more than MAE does because the value scales of lf0, mgc and bap differ. And the attention can be kept through the whole training. MAE works well for mel spectrograms because it contains only one kind of feature. See begeekmyfriend@5863d55

superhg2012 · 2019-04-02T05:58:27Z

@begeekmyfriend can you share a better synthesized sample audio?

begeekmyfriend · 2019-04-02T06:15:32Z

The demo has been shown on the floors where there is spectrogram graph. Maybe I need to reduce the frame period to obtain better quality. However I was told that the fidelity of sample is not as good as that from G&L. The quality is better indeed.

begeekmyfriend · 2019-04-02T06:16:57Z

By the way when you want to hear complete synthesized samples, please wait until the stop token loss reduced to zero.

begeekmyfriend · 2019-04-02T06:19:27Z

As for alignment, remember adapting your max_text_length and max_frame_num to the best ratio of N:T which depends on your dataset.

begeekmyfriend · 2019-04-02T06:37:57Z

mrgloom · 2019-04-02T21:51:10Z

@begeekmyfriend
Is pretrained model compatible with this repo https://github.com/begeekmyfriend/tacotron/tree/mandarin-world-vocoder is available for test?

begeekmyfriend · 2019-04-03T01:46:43Z

Imcompatible. In fact mantaining both of those tacotron project would exhaust me. So I am fixed on my T2 fork currently.

superhg2012 · 2019-04-03T12:11:08Z

@begeekmyfriend your learning cuve is better than mine, great!!

begeekmyfriend · 2019-04-10T09:12:16Z

Here is biaobei mandarin demo from T2 + WORLD. The f0 feature value prediction is tough for this model.
xmly_biaobei_world.zip

sujeendran · 2019-04-28T14:34:05Z

Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project.

@begeekmyfriend Can you please tell what are the modifications/steps required in the current version of your T1 repo to make it run with LJ Speech Dataset?
I am sorry I am asking you this now because most of the steps are mixed up in the previous comments and I thought it would be helpful for others to have it in one place too.
Also does this T1 version run with your updated WORLD vocoder repo?

begeekmyfriend · 2019-04-28T16:26:28Z

T1 modification is just a trivial version. I am focus on T2 currently.

sujeendran · 2019-04-28T16:29:04Z

@begeekmyfriend Thanks for the response. In that case, do you mind giving a concise list of steps required for LJ Speech dataset?

begeekmyfriend · 2019-04-28T16:32:55Z

https://github.com/begeekmyfriend/Tacotron-2/tree/griffin-lim

sujeendran · 2019-04-28T16:45:51Z

https://github.com/begeekmyfriend/Tacotron-2/tree/griffin-lim

Sorry, but the synthesizer is still using Griffin-Lim algorithm in this branch right? I was hoping for your World Vocoder implementation for LJ Speech. Changes needed for Hparams and preprocessing?

begeekmyfriend · 2019-04-28T16:48:14Z

Sorry, I forgot it. You might diff the griffin-lim and mandarin-griffin-lim branch and merge it into mandarin-world-vocoder branch. By the way in my test, the evaluation from WORLD plus Tacotron did not work as well as that from Griffin-Lim.

sujeendran · 2019-04-28T16:54:25Z

Sorry, I forgot it. You might diff the griffin-lim and mandarin-griffin-lim branch and merge it into mandarin-world-vocoder branch. By the way in my test, the evaluation from WORLD plus Tacotron did not work as well as that from Griffin-Lim.

Thanks! I actually wanted to try World Vocoder as GL is really slow. I have managed to perform optimizations in the Tacotron part of the model and it is performing faster than before on CPU, but GL is still slow and reducing the number of iterations just kills the quality. I will try your implementation and see if i can get a good result.

begeekmyfriend · 2019-05-16T07:59:14Z

Give up this solution and turn to WaveRNN. Feel free to reopen this issue.

byuns9334 · 2020-05-04T07:01:50Z

@begeekmyfriend Hi, thank you for your great work. Have you tried Tacotron2 + WORLD vocoder for long sentence? When I test WORLD vocoder on very long sentence (about 200 characters), it gets some uncomfortable sound (something like break sound) in the sound. Have you experienced this problem? and what do you think of the solution for this?
long.wav.zip

begeekmyfriend · 2020-05-04T13:52:04Z

You might split that long sentence into two sentences according to the pause point.

JJun-Guo · 2022-12-15T07:10:07Z

您好，Python-Wrapper-for-World-Vocoder 这个链接已经失效了，可以重新发一下吗？是否已经提供了pretrain model呢？

begeekmyfriend · 2022-12-23T01:37:29Z

@JJ-Guo1996 I am not engaging in TTS any more. And I do not remember how I deleted that repo you mentioned. Maybe there are substitutes for the WORLD vocoder listed in the README file. You can refer to it.

begeekmyfriend mentioned this issue Jan 12, 2019

about background noises [solved. excellent work, lol] #312

Closed

m-toman mentioned this issue Jan 17, 2019

Use WORLD vocoder mozilla/TTS#9

Closed

Edresson mentioned this issue Apr 3, 2019

have ever tried to implement WORLD vocoder? NVIDIA/tacotron2#156

Closed

begeekmyfriend closed this as completed May 16, 2019

Tacoton-2 plus World vocoder #304

Tacoton-2 plus World vocoder #304

Comments

begeekmyfriend commented Dec 28, 2018 • edited Loading

begeekmyfriend commented Dec 28, 2018 • edited Loading

m-toman commented Dec 28, 2018

begeekmyfriend commented Dec 28, 2018

m-toman commented Dec 28, 2018

begeekmyfriend commented Dec 28, 2018

begeekmyfriend commented Dec 30, 2018

shartoo commented Jan 7, 2019

QueenKeys commented Jan 14, 2019

begeekmyfriend commented Jan 14, 2019

QueenKeys commented Jan 14, 2019

begeekmyfriend commented Jan 14, 2019

QueenKeys commented Jan 14, 2019

begeekmyfriend commented Jan 14, 2019

QueenKeys commented Jan 14, 2019

begeekmyfriend commented Jan 14, 2019

QueenKeys commented Jan 14, 2019

OswaldoBornemann commented Feb 27, 2019

begeekmyfriend commented Mar 19, 2019

Edresson commented Mar 19, 2019 • edited Loading

begeekmyfriend commented Mar 20, 2019

Edresson commented Mar 26, 2019

begeekmyfriend commented Mar 26, 2019

Edresson commented Mar 26, 2019 • edited Loading

begeekmyfriend commented Mar 26, 2019

Edresson commented Mar 26, 2019

begeekmyfriend commented Mar 27, 2019

superhg2012 commented Mar 28, 2019

begeekmyfriend commented Mar 28, 2019 • edited Loading

superhg2012 commented Mar 28, 2019

superhg2012 commented Mar 28, 2019

Edresson commented Mar 28, 2019

begeekmyfriend commented Mar 29, 2019

superhg2012 commented Mar 29, 2019

begeekmyfriend commented Mar 29, 2019

begeekmyfriend commented Mar 30, 2019

superhg2012 commented Apr 2, 2019

begeekmyfriend commented Apr 2, 2019

begeekmyfriend commented Apr 2, 2019

begeekmyfriend commented Apr 2, 2019

begeekmyfriend commented Apr 2, 2019

mrgloom commented Apr 2, 2019 • edited Loading

begeekmyfriend commented Apr 3, 2019

superhg2012 commented Apr 3, 2019

begeekmyfriend commented Apr 10, 2019

sujeendran commented Apr 28, 2019

begeekmyfriend commented Apr 28, 2019

sujeendran commented Apr 28, 2019

begeekmyfriend commented Apr 28, 2019

sujeendran commented Apr 28, 2019

begeekmyfriend commented Apr 28, 2019

sujeendran commented Apr 28, 2019

begeekmyfriend commented May 16, 2019

byuns9334 commented May 4, 2020

begeekmyfriend commented May 4, 2020

JJun-Guo commented Dec 15, 2022

begeekmyfriend commented Dec 23, 2022

begeekmyfriend commented Dec 28, 2018 •

edited

Loading

begeekmyfriend commented Dec 28, 2018 •

edited

Loading

Edresson commented Mar 19, 2019 •

edited

Loading

Edresson commented Mar 26, 2019 •

edited

Loading

begeekmyfriend commented Mar 28, 2019 •

edited

Loading

mrgloom commented Apr 2, 2019 •

edited

Loading