-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking T2 output to r9y9 wavenet vocoder #6
Comments
Hello @danshirron, thanks for your contribution. This is indeed an extremely interesting remark. I'm at school right now, i'll look into it tonight and provide a clear answer. (The audio hparams i'm currently using are those used on keithito's original tacotron implementation, i haven't really considered adapting them yet until now) |
Hello again @danshirron. After doing some research this is the best I can help with:
Others: About the fmin and fmax, I am not really sure as to why they limited the frequency in wavenet implementation, but considering how promising their results are, I also added this limitation in my preprocessing so it's in concordance with their work. (And in concordance with T2 paper as well even if I still don't understand the actual reason behind it). Thanks again for your contribution @danshirron and if you happen to get results using our work, please feel free to share them with us (especially if they need improvement x) ). |
@Rayhane-mamah, taking a Tacotron-2 generated mel-spectrogram (generated using Is this correct? |
Just to follow-up on my previous post: I did generate a mel using an unknown text-sentence. Using my method above (...reshape) I converted and inverted to audio. The result was not good (ok, no problem). Then I took the mel, fed it to wavenet_vocoder by r9y9. The resulting audio-quality was better though the audio itself is still incomprehensible. I haven't changed any fft_size or so to adapt Tacotron-2 to r9y9's wavenet_vocoder. They both were trained on the same dataset. So, using this method works, though I may need to continue training on T2. |
@imdatsolak, Thanks for reporting the issue, your method works perfectly (I forgot to remove the batch size (1) before saving the mels). I will add this to the repository shortly. Naturally the wavenet outputs a better audio quality than a simple griffin-lim, that's why we will use a Wavenet as vocoder later. However, To pinpoint the cause of such results I would like to have some info about your training configuration (especially after noticing that the decoder did the total max steps of 175 (with a reduction factor of 5 it gives 175 x 5 = 875 mel_frames), which means that the model has not learned to dynamically stop generation yet, you probably noticed a noisy silence at the end of the generated audio). So could you please provide the answers for the following questions? (it would help me find the optimal configuration as well):
|
Hi @Rayhane-mamah, here are the answers:
Here are the alignments ( at 61000 steps) and the mel spectrogram The mel-file: ljspeech-mel-prediction-step-61000.npy.zip Also, note: Apart from those parameters above, I haven't changed anything in the hparams.py The training data is German dataset. |
I am no German expert but after listening to the inverted audio for this mel spectrogram, I don't think that the output is understandable right? Without paying much attention to the voice quality, can you understand what "she's" saying? (Could you also provide what she's supposed to say please? ) The attention seems very ugly at 61k steps however.. So I have to have another look at that.. On the other hand, Mels are quite average, they should get better with further training (that or I will have to increase the model complexity). EDIT: |
The audio is in fact just noise, there is no recognizable words. Even though the voice itself is correct (from the speaker), it is just a repetition of the initial frame (I think). The thing is that the text I used is not know to the model - though the words can be composed of frames from the training set. Unfortunately, I'll be busy tomorrow, but Friday morning (my time), I can do a test and maybe send you something. Also, I'll pull the changes and start training tomorrow early morning again. Let's see if things get better. BTW: We're preparing some large amounts of training data, to be available next week. There is also some english data that is not LJSpeech, so we can start testing with that as well. Thanks for great work! |
Yes the model should be able to know how to read new unseen words since it works on a character level. (As long as the pronounciation rules apply to them the same way as in training data). And Yes the 741 hours dataset of march 20th. I am really excited about that one. Thank you so much for all the work you're doing to gather, clean and align that much data. |
I have pulled the latest changes and am re-training. I'll let you know what happens. |
Okey @imdatsolak thanks for your support. If you reach 18~20k steps and the alignment is still not learned please notify me. Also, could you report if the training got noticeably slower? Or does it seem normal? |
Hi, @Rayhane-mamah, thanks for previous comments. As your comment, I have pulled the latest commits and re-trained. (All hparams and DB are same.) But I think the alignments are still not learned. (not monotonic) Also, I think training speed seems normal. (1.082 sec/step, Tesla P40 1 GPU) Here are the alignments at 60000 steps. I attached Mel-spectrogram plot. 1 : “In Dallas, one of the nine agents was assigned to assist in security measures at Love Field, and four had protective assignments at the Trade Mart." 2 : ”The remaining four had key responsibilities as members of the complement of the follow-up car in the motorcade." 3 : “Three of these agents occupied positions on the running boards of the car, and the fourth was seated in the car." |
So since I haven't changed anything in the TacoTestHelper (simply because I found nothing suspicious yet), and the natural synthesis output changed a bit, I'm going to suppose it's due to the actual model architecture (and by that I'm pointing out to the attention mechanism in particular). I am however at that point where I have absolutely no idea as to where the alignment problem is coming from sigh. So to try and locate it, I'm going to change the model to use a simple BahdanauAttention/LuongAttention (tonight probably). Here are the possible outcomes and their interpretation:
Finally, you mentionned that the output audio using GTA is audible, could you spare a sample please? To have an idea about the voice quality. |
@Rayhane-mamah, thanks for comments and great work. Here is a sample of GTA synthesis. |
@ohleo, Thank you very much for the samples. I have refactored the code to work with all implemented attentions in tensorflow and our SensitiveLocationAttention by simply changing the name of the attention mechanism in tacotron.py file. (Please note that you need tensorflow v1.6 or it will raise bugs, if you have a lower version and don't want to upgrade please notify me, I'll make sure to add a version that supports tensorflow 1.4). As expected, I have temporarily set the model to use BahdanauAttention and switched to use previous decoder step outputs concatenated to previous context vector as a query vector (this seemed the most natural choice after looking at tensorflow's attention wrapper). Hopefully we locate where the attention problems are coming from, or at least become a step closer to improve the model quality. If you happen to retrain the model, please feel free to share the results with us. Thanks a lot for your contributions, your help is adding huge value to this work. |
@Rayhane-mamah : just started new training with the latest commit. Will let you know ho it proceeds. The training is not noticeably slower. |
I believe all porting to r9y9's wavenet has been taken care of by the man himself in his repository. We also have our own Wavenet in case you prefer to keep things under tensorflow. |
Anyone had experience with the above?
I guess that audio hparams need to be same for both. My intuition for using ljspeech:
Others: in t2 i dont have fmin(125 in wavenet) and fmax (7600 in wavenet). looking into t2 code,
the spectrogram fmin is set to 0 and fmax is set to 2/fsample = 22050/2=11025Hz. Since im using a pre-trained wavenet model i guess ill need to change params in t2 code.
Any remarks, suggestions?
The text was updated successfully, but these errors were encountered: