Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduction window is vital for the model to pick up alignment. #280

Open
bfs18 opened this issue Nov 2, 2019 · 89 comments
Open

reduction window is vital for the model to pick up alignment. #280

bfs18 opened this issue Nov 2, 2019 · 89 comments

Comments

@bfs18
Copy link

bfs18 commented Nov 2, 2019

The hparams.py says n_frames_per_step=1, # currently only 1 is supported, but reduction window is very important for them model to pick up alignment. Using a reduction window can be considered as dropping teacher forcing frames at equal intervals, and thus increases the information gap between the teacher forcing input and the target. Tacotron2 tends to predict the target from the autoregressive input (teacher forcing input at training) without exploiting the conditional text if the information gap is not large enough.
The reduction window can be replaced by a frame dropout trick if it is not continent to implement in the current code. Just set the teacher forcing input frames to the global mean according to a certain percentage.
In implement this in my fork. It can pick up alignment at much earlier steps without warmstart.
my fork
df_mi
NVIDIA-tacotorn2
nv

@xDuck
Copy link

xDuck commented Nov 12, 2019

Read up a bit on your implementation and it seems very promising. Going to give it a go with a fork I've been working on that is struggling to learn attention fully. I was looking into applying something similar (but not nearly as elegant) myself.

Can you provide a link to the paper you reference in your fork's README?

@bfs18
Copy link
Author

bfs18 commented Nov 13, 2019

Hi, the paper is available at https://arxiv.org/abs/1909.01145

@onyedikilo
Copy link

@bfs18 Hi, tried your fork but somehow I am getting NaN's on gradient.norm and mi loss, any ideas? I trained master successfully with the same data.

Capture

@bfs18
Copy link
Author

bfs18 commented Dec 5, 2019

@bfs18 Hi, tried your fork but somehow I am getting NaN's on gradient.norm and mi loss, any ideas? I trained master successfully with the same data.

Capture

Hi @onyedikilo

  1. You can only use drop frame first, just set use_mmi=False.
  2. If you would like to use mmi, make sure that the blank index and vocab size are correctly setted (torch.CTCLoss doc). Besides silent symbols, such as 'SPACE' and punctuation, should be avoided in the CTC target. Finally you may try reducing gaf to 0.1.

@onyedikilo
Copy link

@bfs18 Sorry I couldn't understand what you meant with

You can only use drop frame first

Can you explain it in different words?

@bfs18
Copy link
Author

bfs18 commented Dec 5, 2019

Hi @onyedikilo
I add several new options in hparams.py in my fork.

  • use_mmi (use mmi training objective or not)
  • use_gaf (use gradient adaptive factor or not, to keep the max norm of gradients from the taco_loss and mi_loss approximately equal)
  • max_gaf (maximum value of gradient adaptive factor)
  • drop_frame_rate (drop teacher-frocing input frames to a certain rate)

When setting use_mmi=False and drop_frame_rate to a value in range (0., 1.), only drop frame trick is used.

@hadaev8
Copy link

hadaev8 commented Dec 7, 2019

@bfs18
Copy link
Author

bfs18 commented Dec 9, 2019

@bfs18
Why you have this line?
https://github.com/bfs18/tacotron2/blob/master/train.py#L253

Hi @hadaev8 , this line has no influence on the numerical values of gradients. When calculating taco_loss, the variables of CTC recognizer is not used, so the gradients of taco_loss with respect to these variables are None. The gradients become zero tensors after adding this line.

@onyedikilo
Copy link

I can confirm that the alignment picks up significantly faster with my data set.

@bfs18
Copy link
Author

bfs18 commented Dec 16, 2019

I can confirm that the alignment picks up significantly faster with my data set.

Hi @onyedikilo , thanks a lot for your confirmation.

@hadaev8
Copy link

hadaev8 commented Dec 17, 2019

@bfs18
Any ideas why my alignment looks like this with CTC loss?
https://i.imgur.com/17Wz22v.png

@bfs18
Copy link
Author

bfs18 commented Dec 18, 2019

@bfs18
Any ideas why my alignment looks like this with CTC loss?
https://i.imgur.com/17Wz22v.png

Hi @hadaev8 , This is caused by that the CTC loss is over weighted. When CTC loss is over weighted, the model would depend more on the text input to reduce the total loss. It leads to a diagonal alignment combined with the Local Sensitive Attention.

Setting hparams.use_gaf=True and a smaller hparams.max_gaf, such as 0.1, would solve the problem.

@hadaev8
Copy link

hadaev8 commented Dec 18, 2019

Well, I read again paper
They say

In Tacotron2, the attention context is concatenated to the LSTM out-put and projected by a linear transform to predict the Mel spectrum.This means the predicated Mel spectrum contains linear componentsof the text information. If we use this Mel spectrum as the input tothe CTC recognizer, the text information is too easily accessible forthe recognizer. This may cause the text information to be encodedin a pathological way in the Mel spectrum and lead to a strict di-agonal alignment map (one acoustic frame output for one phonemeinput) combined with location-sensitive attention. So before the lin-ear transform operation, we add an extra LSTM layer to mix the textinformation and acoustic information.

Should you point where should be this lstm layer?

@bfs18
Copy link
Author

bfs18 commented Dec 18, 2019

Well, I read again paper
They say

In Tacotron2, the attention context is concatenated to the LSTM out-put and projected by a linear transform to predict the Mel spectrum.This means the predicated Mel spectrum contains linear componentsof the text information. If we use this Mel spectrum as the input tothe CTC recognizer, the text information is too easily accessible forthe recognizer. This may cause the text information to be encodedin a pathological way in the Mel spectrum and lead to a strict di-agonal alignment map (one acoustic frame output for one phonemeinput) combined with location-sensitive attention. So before the lin-ear transform operation, we add an extra LSTM layer to mix the textinformation and acoustic information.

Should you point where should be this lstm layer?

Hi, the paper uses a internal Tensorflow implementation. It is a bit different from the open-sourced fork. In the open-sourced fork a ff_layer with relu activation is used to mix the information. It is this line https://github.com/bfs18/tacotron2/blob/8f8605ee0f67f6f571e74725030f16b13e4c7d2d/model.py#L388

@xDuck
Copy link

xDuck commented Dec 18, 2019

Finally got around to trying out your fork on my modified spectrums and I can confirm it picked up attention much faster! Thanks!

@hadaev8
Copy link

hadaev8 commented Dec 18, 2019

@bfs18
Are you author of paper?
Do you know lstm mixer dim?

@bfs18
Copy link
Author

bfs18 commented Dec 18, 2019

Hi @hadaev8

Are you author of paper?

yes.

Do you know lstm mixer dim?

I just use the same dimension as the decoder_rnn_dim, of which the value is 1024.

@bfs18
Copy link
Author

bfs18 commented Dec 18, 2019

Finally got around to trying out your fork on my modified spectrums and I can confirm it picked up attention much faster! Thanks!

Hi @xDuck , I am glad to hear that.

@hadaev8
Copy link

hadaev8 commented Dec 18, 2019

@bfs18
I added lstm for mixing decoder outputs like this
https://pastebin.com/SNxAPcUD
but looks like it does not mix it enough.
Alignment crush bit later, but still.
Maybe it should be bi-directional lstm?
Or I doing wrong something?

@bfs18
Copy link
Author

bfs18 commented Dec 19, 2019

Hi @hadaev8 ,
Have you tried a smaller max_gaf?
The diagonal alignment is due to mixed causes. Usually a feed forward layer with a nonlinear activation function would mix the information sufficiently according to my later experiments.
It also leads to a corrupted alignment when gradients from the CTC loss dominates the training. Text information in the decoder_output is too much picked to reduce the total loss in such occasion.

@hadaev8
Copy link

hadaev8 commented Dec 19, 2019

@bfs18
Using gaf in distributed setup worse training.
So I trying this approach https://arxiv.org/pdf/1705.07115.pdf
With lstm mixer and feeding mel output to CTC recognizer makes it more stable, but still training crush later.

My gradients indeed suffer.
https://i.imgur.com/amcGgHJ.png
Orange is your original implementation, others are my expiriments.

@bfs18
Copy link
Author

bfs18 commented Dec 19, 2019

@bfs18
Using gaf in distributed setup worse training.
So I trying this approach https://arxiv.org/pdf/1705.07115.pdf
With lstm mixer and feeding mel output to CTC recognizer makes it more stable, but still training crush later.

My gradients indeed suffer.
https://i.imgur.com/amcGgHJ.png
Orange is your original implementation, others are my expiriments.

Hi @hadaev8 ,

The paper is bit complicated. I haven't go through it.

gaf is just a dynamic weight for the mi_loss.
https://github.com/bfs18/tacotron2/blob/8f8605ee0f67f6f571e74725030f16b13e4c7d2d/train.py#L259
You can use a small weight, such as 1e-2 or 1e-3, instead of gaf. You can even use an annealing schedule just like the KL-annealing trick. In my experiments, the gaf become very small after 10k steps.

@xDuck
Copy link

xDuck commented Dec 19, 2019

Hey @bfs18 Just wanted to let you know your fork is working great with my GST adaption as well based on Google's GST paper. Alignment learning super quickly and my models produce recognizable speech in about 3 hours on a 2070 Graphics card - way faster than before.

@rafaelvalle
Copy link
Contributor

@bfs18 what are the most important changes for the results @xDuck mentioned?

@xDuck
Copy link

xDuck commented Dec 19, 2019

@rafaelvalle I should mention I am using bark-scale spectrograms with 18 channels and 2 pitch features for my spectrograms along with a lpcnet-forked vocoder (Targetting faster than realtime CPU inference. Currently 1/3 realtime speed on a 2017 macbook pro for synthesis). I have noticed in general that speeds up training a lot too (less features to predict). Samples attached of her after not much training with different GST reference clips. Single-Speaker LJSpeech used - These are from my very first test.

gst_results.zip

Alignment after 3k steps w/ batch size 20
image

@bfs18
Copy link
Author

bfs18 commented Dec 20, 2019

Hey @bfs18 Just wanted to let you know your fork is working great with my GST adaption as well based on Google's GST paper. Alignment learning super quickly and my models produce recognizable speech in about 3 hours on a 2070 Graphics card - way faster than before.

Hi @xDuck , thanks for your information.

@bfs18
Copy link
Author

bfs18 commented Dec 20, 2019

@bfs18 what are the most important changes for the results @xDuck mentioned?

Hi @rafaelvalle , setting the teacher forcing input to the global mean to a certain percentage is a stable trick, which boosts up alignment learning a lot. The extra CTC loss would speed up alignment learning and reduce bad case. However, it is a bit tricky to tune.

@hadaev8
Copy link

hadaev8 commented Dec 21, 2019

I can align this bad boy slaps tacotron with only 2k steps.

F3rr7Th

Also wondering why you here have it not aligned on the decoder timestep axis.

@bfs18
Copy link
Author

bfs18 commented Dec 22, 2019

Hi @hadaev8

I can align this bad boy slaps tacotron with only 2k steps.

How did you solve your problem?

Also wondering why you here have it not aligned on the decoder timestep axis.

I don't quite get what are you trying to say. I guess you are saying the tail of the alignment is different form the above figures. It's a bit wired. I am also wondering. However, the padding frames are not important.

@hadaev8
Copy link

hadaev8 commented Dec 22, 2019

I turn off ctc loss then it became too low.

@CookiePPP
Copy link

@chazo1994
Updating the dropout like @bfs18 stated earlier may help with Loss.
#280 (comment)


I had not heard of this paper before, though reading it, very interesting indeed!
I solved most of my alignment issues by using more data and a multispeaker model however I would definitely be interested in recreating the paper and using Guided Attention in later models.

In regards to your problem, I don't fully understand (I still consider myself new to Deep Learning so not too useful), but I will try to help with parts I do understand.

I've also heard guided attention from espnet a few times, though looking further, I believe they just Diagonal-Guided Attention.

Maybe explore FastSpeech/ForwardTacotron
https://github.com/as-ideas/ForwardTacotron#-training-your-own-model
for ways to generate alignments and input them to your loss function?

@chazo1994
Copy link

Hi @CookiePPP ,
You may try smaller dropout rate in these 2 lines, e.g. setting p=0.2.
https://github.com/bfs18/tacotron2/blob/8f8605ee0f67f6f571e74725030f16b13e4c7d2d/model.py#L240
https://github.com/bfs18/tacotron2/blob/8f8605ee0f67f6f571e74725030f16b13e4c7d2d/model.py#L246

I would try smaller dropout but over-smoothed mel spectrogram and horizontal line noise is big problems.

@hadaev8
Copy link

hadaev8 commented May 3, 2020

@bfs18
Why you dropout frames with mean value?

@chazo1994
Copy link

chazo1994 commented May 4, 2020

I report my results with MMI and DFR:

Drop Frame Rate = 0

  • Alignment:
    alignment_DFR0
  • Mels:
    meldfr0

Drop Frame Rate = 0.1

  • Alignment:
    alignmentdfr1
  • Mels:
    meldfr01
  • Loss:
    lossdfr01

Drop Frame Rate = 0.2
19k Step

  • alignment:
    alignmentdfr02_19k
  • mels:
    meldfr02_19k
  • Loss:
    lossdfr02_34k

34k_Step

  • Alignment:
    alignmentdfr02_34k
  • Mels:
    meldfr02_34k

gaf is nan after 30k step (I modified code to train model with mixed_precision).
As you can see, my models converged soon but loss explode after 30k steps.
@bfs18
@rafaelvalle
@CookiePPP

@chazo1994
Copy link

@bfs18
In your paper, you decay learning rate by a factor sqrt(4000/step) from 4000 step. But your fork don't have any learning-rate decay code.

@bfs18
Copy link
Author

bfs18 commented May 5, 2020

@bfs18
Why you dropout frames with mean value?

Hi @hadaev8 , because this value would not distort the input values to the prenet a lot, then it would not distort the activation values to the following modules.

@bfs18
Copy link
Author

bfs18 commented May 5, 2020

Hi @chazo1994 It seems that numerical errors occurred in your running.

In your paper, you decay learning rate by a factor sqrt(4000/step) from 4000 step. But your fork don't have any learning-rate decay code.

I found gradient adaptive factor works better. I use that trick instead.

@hadaev8
Copy link

hadaev8 commented May 5, 2020

@bfs18
I wonder to try gaussian noise instead of a fixed value.
And thinking should I set the whole frame to a single value or add noise separatly to every frame value.

@rafaelvalle
Copy link
Contributor

@chazo1994 thank you for sharing this, can you share spectrogram reconstruction training and validation loss?

@chazo1994
Copy link

@chazo1994 thank you for sharing this, can you share spectrogram reconstruction training and validation loss?

This is my validation loss with MMI and DFR=0.2.
I don't know how to get spectrogram reconstruction?
valossdfr02

@lalindra-desilva
Copy link

@bfs18 Just trying out your fork for the first time and followed instructions in this thread with ljspeech pretrained model. Running into the following error. Any idea why?

RuntimeError: Error(s) in loading state_dict for Tacotron2: size mismatch for decoder.gate_layer.linear_layer.weight: copying a param with shape torch.Size([1, 1536]) from checkpoint, the shape in current model is torch.Size([1, 1024]).

Appreciate any feedback.

@terryyizhong
Copy link

@bfs18 Just trying out your fork for the first time and followed instructions in this thread with ljspeech pretrained model. Running into the following error. Any idea why?

RuntimeError: Error(s) in loading state_dict for Tacotron2: size mismatch for decoder.gate_layer.linear_layer.weight: copying a param with shape torch.Size([1, 1536]) from checkpoint, the shape in current model is torch.Size([1, 1024]).

Appreciate any feedback.

you cannot use the pretrained Tacotron2 model of this branch. The model structure has been modified.

@titospadini
Copy link

@bfs18 I am using your fork with my dataset and it just started to align, but I am facing some problems when I try to use inference.ipynb with the tacotron model trained with your fork and the waveglow model; but, when I use this very same waveglow model that I have mentioned with the tacotron model trained with the NVidia repository, I have no problem.

The problem is this one:

AttributeError: 'WN' object has no attribute 'cond_layer'

If I am not wrong, the convert_model.py (from WaveGlow) should be used in this case, right? I have used it, but this error persists.

I need to use WaveGlow.
Any ideas to solve this, please?

@CookiePPP
Copy link

@titocaco
Download WaveGlow repo from
https://github.com/NVIDIA/waveglow
and replace the one in the tacotron2 folder?
(WaveGlow from bfs18's repo might be out of date)

@titospadini
Copy link

@CookiePPP it works! Thank you! =)

@zhitiankai
Copy link

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
calculating global mean...
Traceback (most recent call last):
File "train.py", line 341, in
train(args.output_directory, args.log_directory, args.checkpoint_path, args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 193, in train
global_mean = calculate_global_mean(train_loader, hparams.global_mean_npy)
File "train.py", line 159, in calculate_global_mean
for i, batch in enumerate(data_loader):
File "/data2/user/ztk/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/data2/user/ztk/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/data2/user/ztk/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/data2/user/ztk/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/data2/user/ztk/tacotron2_tibetan/data_utils.py", line 63, in getitem
return self.get_mel_text_pair(self.audiopaths_and_text[index])
File "/data2/user/ztk/tacotron2_tibetan/data_utils.py", line 34, in get_mel_text_pair
mel = self.get_mel(audiopath)
File "/data2/user/ztk/tacotron2_tibetan/data_utils.py", line 46, in get_mel
melspec = self.stft.mel_spectrogram(audio_norm)
File "/data2/user/ztk/tacotron2_tibetan/layers.py", line 73, in mel_spectrogram
assert(torch.min(y.data) >= -1)
RuntimeError: invalid argument 1: tensor must have one dimension at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:574

When i run your code https://github.com/bfs18/tacotron2 , i meet this bug,Can you give me some suggestions?

@ErfolgreichCharismatisch

I should mention I am using bark-scale spectrograms with 18 channels and 2 pitch features for my spectrograms along with a lpcnet-forked vocoder

Can lpcnet help me with this issue: #463?

@xDuck
Copy link

xDuck commented Mar 24, 2021 via email

@ErfolgreichCharismatisch

Interesting. How do you use your nvidia Tacotron2 model with LPCNet?

Yes, Tacotron 2 + LPCNet should get you to be able to perform inference on CPU but the best speeds I was able to achieve were about 2x real-time on a current gen intel CPU with AVX2 support.

On Wed, Mar 24, 2021 at 6:06 AM Erfolgreich charismatisch < @.***> wrote: I should mention I am using bark-scale spectrograms with 18 channels and 2 pitch features for my spectrograms along with a lpcnet-forked vocoder Can lpcnet help me with this issue: #463 <#463>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#280 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABICRIOQ52NG47IWO5X6C7DTFG2RTANCNFSM4JIESUGQ .

@xDuck
Copy link

xDuck commented Mar 24, 2021 via email

@ErfolgreichCharismatisch
Copy link

ErfolgreichCharismatisch commented Mar 24, 2021

If you had to do it all over again, how would you start?

PS: Can you share a diff between your files and the vanilla files?

You will have to adjust the number of mels (and maybe other Params) used and feed it bark spectrograms for training from scratch. I made a lot of modifications that I don’t really remember, but it is not a simple task.

@xDuck
Copy link

xDuck commented Mar 24, 2021 via email

@ErfolgreichCharismatisch
Copy link

ErfolgreichCharismatisch commented Mar 24, 2021

Yes. Which setup would you recommend for my goal?

EDIT: I just tried SqeezeWave, but Nvidia is in it yet again, this time in apex. Therefore I get AssertionError: Torch not compiled with CUDA enabled.

I’ve already mostly abandoned the project after considering my research “completed”. I do not have the diff accessible anymore, sorry. As for doing it over again, now there are better alternatives like SqueezeWave, HiFi GAN, etc. Keep in mind you will trade quality for speed in the vocoders, it is hard to compare to the quality of WaveGlow. This project was not designed to run on the CPU (rightfully so, NVIDA makes GPUs not CPUs), so it might not be what you are looking for - but it does a damn good job on GPUs.

@xDuck
Copy link

xDuck commented Mar 24, 2021 via email

@ErfolgreichCharismatisch

Tutorial: Training on GPU with Colab, Inference with CPU on Server here.

@keonlee9420
Copy link

hey guys, I just published the comprehensive tacotron2 which includes reduction factor (reduction window) and other techniques to boost up model robustness and efficiency. also you can play around with the pre-trained models. check the following link:
https://github.com/keonlee9420/Comprehensive-Tacotron2

@A-d-DASARE
Copy link

A-d-DASARE commented Aug 9, 2022

Hi, can someone pls explain me what are the x and y axis of Mel spectrogram. And how is it different from alignment graph of x and y coordinates. Thanks!

@finardi
Copy link

finardi commented Jan 2, 2023

Hi, the paper is available at https://arxiv.org/abs/1909.01145

There's a implementation with FP16?
I've trying run your fork with apex and hparams FP16 Run: True, but not have been succeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests