Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking T2 output to r9y9 wavenet vocoder #6

Closed
danshirron opened this issue Mar 6, 2018 · 17 comments
Closed

Taking T2 output to r9y9 wavenet vocoder #6

danshirron opened this issue Mar 6, 2018 · 17 comments

Comments

@danshirron
Copy link

Anyone had experience with the above?
I guess that audio hparams need to be same for both. My intuition for using ljspeech:

  • num_mels=80
  • num_freq=1025; In wavenet code fft_size=1024 in t2 fft_size=(1025-1)*2=2048. As far as i understand i can keep this as is since anyways this accumulates to mel bands
  • sample_rate=20050 (As the ljspeech dataset)
  • frame_length_ms=46.44 (correlates to wavenet's fft_size/22050).
  • frame_shift_ms=11.61 (correlates to wavenet's hop_size=256, 256/22050=11.61ms)
  • preemphasis, not available in wavenet r9y9 implementation

Others: in t2 i dont have fmin(125 in wavenet) and fmax (7600 in wavenet). looking into t2 code,
the spectrogram fmin is set to 0 and fmax is set to 2/fsample = 22050/2=11025Hz. Since im using a pre-trained wavenet model i guess ill need to change params in t2 code.

Any remarks, suggestions?

@Rayhane-mamah
Copy link
Owner

Hello @danshirron, thanks for your contribution.

This is indeed an extremely interesting remark. I'm at school right now, i'll look into it tonight and provide a clear answer. (The audio hparams i'm currently using are those used on keithito's original tacotron implementation, i haven't really considered adapting them yet until now)

@Rayhane-mamah
Copy link
Owner

Rayhane-mamah commented Mar 6, 2018

Hello again @danshirron.

After doing some research this is the best I can help with:

  • no problem in num_mels
  • I'm not really sure about the num_freq, if you really want to make sure you're in the same case as r9y9 wavenet simply change it to 513, fft_size will fall down to 1024
  • you're right, I don't know where I came up with 24kHz, 22050 is the correct sample rate for ljspeech. (you made a typo :p)
  • Now, for the frame_length and frame_shift: In the T2 paper, they addressed an issue where a small feature spacing can cause pronunciation issues, so they used a 12.5ms frame_shift which if i'm not mistaking, is equivalent to a hop_size =300 (for a sample_rate=24000 Hz). Because of that i'm pretty sure that values i'm using right now will be changed in the future depending on the model's output but i'll be keeping them as is for now. However, since you will use a pre-trained model, i think the natural thing to do would be to use the values you proposed.
  • Finally, for the preemphasis, we're just using it in an lfilter which is indeed nonexistent in r9y9 wavenet, for this matter, in this commit (be09c5a) I added "lfilter" hyperparameter to choose if you want to use preemphasis. (In your case, setting it to False would be the natural thing to do).

Others: About the fmin and fmax, I am not really sure as to why they limited the frequency in wavenet implementation, but considering how promising their results are, I also added this limitation in my preprocessing so it's in concordance with their work. (And in concordance with T2 paper as well even if I still don't understand the actual reason behind it).

Thanks again for your contribution @danshirron and if you happen to get results using our work, please feel free to share them with us (especially if they need improvement x) ).

@imdatceleste
Copy link

@Rayhane-mamah, taking a Tacotron-2 generated mel-spectrogram (generated using python synthesize.py --mode=eval) requires the mels to be of shape (T, 80). The mels you are saving are of shape (1, 875, 80). I needed to reshape them using mel_spectrogram.reshape(-1, 80) in order to use them for wavenet_vocoder (or, in fact, to generate audio using inv_mel_spectrogram(mel_spectrogram.T).

Is this correct?
Thanks for great work

@imdatceleste
Copy link

Just to follow-up on my previous post: I did generate a mel using an unknown text-sentence. Using my method above (...reshape) I converted and inverted to audio. The result was not good (ok, no problem). Then I took the mel, fed it to wavenet_vocoder by r9y9. The resulting audio-quality was better though the audio itself is still incomprehensible. I haven't changed any fft_size or so to adapt Tacotron-2 to r9y9's wavenet_vocoder. They both were trained on the same dataset.

So, using this method works, though I may need to continue training on T2.

@Rayhane-mamah
Copy link
Owner

Rayhane-mamah commented Mar 14, 2018

@imdatsolak, Thanks for reporting the issue, your method works perfectly (I forgot to remove the batch size (1) before saving the mels). I will add this to the repository shortly.

Naturally the wavenet outputs a better audio quality than a simple griffin-lim, that's why we will use a Wavenet as vocoder later.

However, To pinpoint the cause of such results I would like to have some info about your training configuration (especially after noticing that the decoder did the total max steps of 175 (with a reduction factor of 5 it gives 175 x 5 = 875 mel_frames), which means that the model has not learned to dynamically stop generation yet, you probably noticed a noisy silence at the end of the generated audio).

So could you please provide the answers for the following questions? (it would help me find the optimal configuration as well):

  • What batch size did you use?
  • How many steps did you train for?
  • Did you update your audio.py file? (after this commit, @neverjoe fixed a _mel_to_linear bug)
  • Could you provide the alignment plot, accompanied with the predicted mel spectrogram plot of your last checkpoint? (they should be under "logs-Tacotron/plots"

@imdatceleste
Copy link

imdatceleste commented Mar 14, 2018

Hi @Rayhane-mamah, here are the answers:

  • batch_size = 32
  • 61000 steps
  • Yes, I had already pulled the commit

Here are the alignments ( at 61000 steps) and the mel spectrogram
step-61000-align

PREDICTED MEL:
step-61000-pred-mel-spectrogram

REAL MEL:
step-61000-real-mel-spectrogram

The mel-file:

ljspeech-mel-prediction-step-61000.npy.zip

Also, note:
Audio-Freq: 16,000 Hz
Training Data size: 18.5 hrs
Cleaners: basic_cleaners
The charset includes additional German characters (that I have added to symbols.py)

Apart from those parameters above, I haven't changed anything in the hparams.py

The training data is German dataset.
Avg loss at Step 61,000 was 0.032499

@Rayhane-mamah
Copy link
Owner

Rayhane-mamah commented Mar 14, 2018

I am no German expert but after listening to the inverted audio for this mel spectrogram, I don't think that the output is understandable right? Without paying much attention to the voice quality, can you understand what "she's" saying? (Could you also provide what she's supposed to say please? )

The attention seems very ugly at 61k steps however.. So I have to have another look at that.. On the other hand, Mels are quite average, they should get better with further training (that or I will have to increase the model complexity).

EDIT:
I have pushed a commit that hopefully makes things better. In the mean time, I'll be locally testing another approach for attention (Since the paper discussed the matter vaguely, one has to try multiple approaches depending on his understanding). Sorry you had to train the model countless times.. Hopefully we find good results shortly. (Some changes made require to run preprocessing once again..)

@imdatceleste
Copy link

The audio is in fact just noise, there is no recognizable words. Even though the voice itself is correct (from the speaker), it is just a repetition of the initial frame (I think). The thing is that the text I used is not know to the model - though the words can be composed of frames from the training set.

Unfortunately, I'll be busy tomorrow, but Friday morning (my time), I can do a test and maybe send you something. Also, I'll pull the changes and start training tomorrow early morning again. Let's see if things get better.

BTW: We're preparing some large amounts of training data, to be available next week. There is also some english data that is not LJSpeech, so we can start testing with that as well.

Thanks for great work!

@Rayhane-mamah
Copy link
Owner

Rayhane-mamah commented Mar 14, 2018

Yes the model should be able to know how to read new unseen words since it works on a character level. (As long as the pronounciation rules apply to them the same way as in training data).

And Yes the 741 hours dataset of march 20th. I am really excited about that one.

Thank you so much for all the work you're doing to gather, clean and align that much data.

@imdatceleste
Copy link

I have pulled the latest changes and am re-training. I'll let you know what happens.
Here is an audio sample from training data:
20081231_neujahrsansprache_f000019.wav.zip

@Rayhane-mamah
Copy link
Owner

Okey @imdatsolak thanks for your support. If you reach 18~20k steps and the alignment is still not learned please notify me.

Also, could you report if the training got noticeably slower? Or does it seem normal?

@ohleo
Copy link

ohleo commented Mar 16, 2018

Hi, @Rayhane-mamah, thanks for previous comments.

As your comment, I have pulled the latest commits and re-trained. (All hparams and DB are same.)

But I think the alignments are still not learned. (not monotonic)
Equally, natural synthesis(using TacoTestHelper) cannnot get proper Mel-spectrograms, GTA synthesis generated audible results.

Also, I think training speed seems normal. (1.082 sec/step, Tesla P40 1 GPU)

Here are the alignments at 60000 steps.
image

I attached Mel-spectrogram plot.

1 : “In Dallas, one of the nine agents was assigned to assist in security measures at Love Field, and four had protective assignments at the Trade Mart."

Ground Truth
image

GTA
image

Natural
image

2 : ”The remaining four had key responsibilities as members of the complement of the follow-up car in the motorcade."

Ground Truth
image

GTA
image

Natural
image

3 : “Three of these agents occupied positions on the running boards of the car, and the fourth was seated in the car."

Ground Truth
image

GTA
image

Natural
image

@Rayhane-mamah
Copy link
Owner

Rayhane-mamah commented Mar 16, 2018

So since I haven't changed anything in the TacoTestHelper (simply because I found nothing suspicious yet), and the natural synthesis output changed a bit, I'm going to suppose it's due to the actual model architecture (and by that I'm pointing out to the attention mechanism in particular).

I am however at that point where I have absolutely no idea as to where the alignment problem is coming from sigh. So to try and locate it, I'm going to change the model to use a simple BahdanauAttention/LuongAttention (tonight probably). Here are the possible outcomes and their interpretation:

  • The alignment works and the natural synthesis is fixed: This proves that natural synthesis is only messed up because of the attention not being learned. And also proves that the problem is somewhere in the LocationSensitiveAttention itself.
  • The alignment works and the natural synthesis is still not good enough: We will try to train the model further or the problem is somewhere else
  • The alignment is still not learned with one of the basic attentions: I am using the wrong query vector (I have actually tried using the prenet output (as I understood in the paper) or the decoder rnn hidden states (which corresponds to the classic attention works like NMT) and didn't get something noticeably "good", any suggestions about the query vector can help us a lot).

Finally, you mentionned that the output audio using GTA is audible, could you spare a sample please? To have an idea about the voice quality.

@ohleo
Copy link

ohleo commented Mar 16, 2018

@Rayhane-mamah, thanks for comments and great work.

Here is a sample of GTA synthesis.
gta_synth.zip

@Rayhane-mamah
Copy link
Owner

Rayhane-mamah commented Mar 16, 2018

@ohleo, Thank you very much for the samples.

I have refactored the code to work with all implemented attentions in tensorflow and our SensitiveLocationAttention by simply changing the name of the attention mechanism in tacotron.py file. (Please note that you need tensorflow v1.6 or it will raise bugs, if you have a lower version and don't want to upgrade please notify me, I'll make sure to add a version that supports tensorflow 1.4).

As expected, I have temporarily set the model to use BahdanauAttention and switched to use previous decoder step outputs concatenated to previous context vector as a query vector (this seemed the most natural choice after looking at tensorflow's attention wrapper). Hopefully we locate where the attention problems are coming from, or at least become a step closer to improve the model quality.

If you happen to retrain the model, please feel free to share the results with us. Thanks a lot for your contributions, your help is adding huge value to this work.

@imdatceleste
Copy link

@Rayhane-mamah : just started new training with the latest commit. Will let you know ho it proceeds. The training is not noticeably slower.

@Rayhane-mamah
Copy link
Owner

I believe all porting to r9y9's wavenet has been taken care of by the man himself in his repository. We also have our own Wavenet in case you prefer to keep things under tensorflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants