reduction window is vital for the model to pick up alignment. #280

bfs18 · 2019-11-02T10:26:55Z

The hparams.py says n_frames_per_step=1, # currently only 1 is supported, but reduction window is very important for them model to pick up alignment. Using a reduction window can be considered as dropping teacher forcing frames at equal intervals, and thus increases the information gap between the teacher forcing input and the target. Tacotron2 tends to predict the target from the autoregressive input (teacher forcing input at training) without exploiting the conditional text if the information gap is not large enough.
The reduction window can be replaced by a frame dropout trick if it is not continent to implement in the current code. Just set the teacher forcing input frames to the global mean according to a certain percentage.
In implement this in my fork. It can pick up alignment at much earlier steps without warmstart.
my fork

NVIDIA-tacotorn2

The text was updated successfully, but these errors were encountered:

xDuck · 2019-11-12T14:27:39Z

Read up a bit on your implementation and it seems very promising. Going to give it a go with a fork I've been working on that is struggling to learn attention fully. I was looking into applying something similar (but not nearly as elegant) myself.

Can you provide a link to the paper you reference in your fork's README?

bfs18 · 2019-11-13T01:02:11Z

Hi, the paper is available at https://arxiv.org/abs/1909.01145

onyedikilo · 2019-12-04T16:09:51Z

@bfs18 Hi, tried your fork but somehow I am getting NaN's on gradient.norm and mi loss, any ideas? I trained master successfully with the same data.

bfs18 · 2019-12-05T02:12:36Z

@bfs18 Hi, tried your fork but somehow I am getting NaN's on gradient.norm and mi loss, any ideas? I trained master successfully with the same data.

Hi @onyedikilo

You can only use drop frame first, just set use_mmi=False.
If you would like to use mmi, make sure that the blank index and vocab size are correctly setted (torch.CTCLoss doc). Besides silent symbols, such as 'SPACE' and punctuation, should be avoided in the CTC target. Finally you may try reducing gaf to 0.1.

onyedikilo · 2019-12-05T13:01:17Z

@bfs18 Sorry I couldn't understand what you meant with

You can only use drop frame first

Can you explain it in different words?

bfs18 · 2019-12-05T14:22:13Z

Hi @onyedikilo
I add several new options in hparams.py in my fork.

use_mmi (use mmi training objective or not)
use_gaf (use gradient adaptive factor or not, to keep the max norm of gradients from the taco_loss and mi_loss approximately equal)
max_gaf (maximum value of gradient adaptive factor)
drop_frame_rate (drop teacher-frocing input frames to a certain rate)

When setting use_mmi=False and drop_frame_rate to a value in range (0., 1.), only drop frame trick is used.

hadaev8 · 2019-12-07T20:52:36Z

@bfs18
Why you have this line?
https://github.com/bfs18/tacotron2/blob/master/train.py#L253

bfs18 · 2019-12-09T07:47:52Z

@bfs18
Why you have this line?
https://github.com/bfs18/tacotron2/blob/master/train.py#L253

Hi @hadaev8 , this line has no influence on the numerical values of gradients. When calculating taco_loss, the variables of CTC recognizer is not used, so the gradients of taco_loss with respect to these variables are None. The gradients become zero tensors after adding this line.

onyedikilo · 2019-12-14T23:44:30Z

I can confirm that the alignment picks up significantly faster with my data set.

bfs18 · 2019-12-16T02:43:37Z

I can confirm that the alignment picks up significantly faster with my data set.

Hi @onyedikilo , thanks a lot for your confirmation.

hadaev8 · 2019-12-17T22:14:19Z

@bfs18
Any ideas why my alignment looks like this with CTC loss?
https://i.imgur.com/17Wz22v.png

bfs18 · 2019-12-18T01:11:24Z

@bfs18
Any ideas why my alignment looks like this with CTC loss?
https://i.imgur.com/17Wz22v.png

Hi @hadaev8 , This is caused by that the CTC loss is over weighted. When CTC loss is over weighted, the model would depend more on the text input to reduce the total loss. It leads to a diagonal alignment combined with the Local Sensitive Attention.

Setting hparams.use_gaf=True and a smaller hparams.max_gaf, such as 0.1, would solve the problem.

hadaev8 · 2019-12-18T01:27:17Z

Well, I read again paper
They say

In Tacotron2, the attention context is concatenated to the LSTM out-put and projected by a linear transform to predict the Mel spectrum.This means the predicated Mel spectrum contains linear componentsof the text information. If we use this Mel spectrum as the input tothe CTC recognizer, the text information is too easily accessible forthe recognizer. This may cause the text information to be encodedin a pathological way in the Mel spectrum and lead to a strict di-agonal alignment map (one acoustic frame output for one phonemeinput) combined with location-sensitive attention. So before the lin-ear transform operation, we add an extra LSTM layer to mix the textinformation and acoustic information.

Should you point where should be this lstm layer?

bfs18 · 2019-12-18T01:47:08Z

Well, I read again paper
They say

In Tacotron2, the attention context is concatenated to the LSTM out-put and projected by a linear transform to predict the Mel spectrum.This means the predicated Mel spectrum contains linear componentsof the text information. If we use this Mel spectrum as the input tothe CTC recognizer, the text information is too easily accessible forthe recognizer. This may cause the text information to be encodedin a pathological way in the Mel spectrum and lead to a strict di-agonal alignment map (one acoustic frame output for one phonemeinput) combined with location-sensitive attention. So before the lin-ear transform operation, we add an extra LSTM layer to mix the textinformation and acoustic information.

Should you point where should be this lstm layer?

Hi, the paper uses a internal Tensorflow implementation. It is a bit different from the open-sourced fork. In the open-sourced fork a ff_layer with relu activation is used to mix the information. It is this line https://github.com/bfs18/tacotron2/blob/8f8605ee0f67f6f571e74725030f16b13e4c7d2d/model.py#L388

xDuck · 2019-12-18T12:25:41Z

Finally got around to trying out your fork on my modified spectrums and I can confirm it picked up attention much faster! Thanks!

hadaev8 · 2019-12-18T14:09:45Z

@bfs18
Are you author of paper?
Do you know lstm mixer dim?

bfs18 · 2019-12-18T16:33:07Z

Hi @hadaev8

Are you author of paper?

yes.

Do you know lstm mixer dim?

I just use the same dimension as the decoder_rnn_dim, of which the value is 1024.

bfs18 · 2019-12-18T16:34:57Z

Finally got around to trying out your fork on my modified spectrums and I can confirm it picked up attention much faster! Thanks!

Hi @xDuck , I am glad to hear that.

hadaev8 · 2019-12-18T16:58:36Z

@bfs18
I added lstm for mixing decoder outputs like this
https://pastebin.com/SNxAPcUD
but looks like it does not mix it enough.
Alignment crush bit later, but still.
Maybe it should be bi-directional lstm?
Or I doing wrong something?

bfs18 · 2019-12-19T02:21:08Z

Hi @hadaev8 ,
Have you tried a smaller max_gaf?
The diagonal alignment is due to mixed causes. Usually a feed forward layer with a nonlinear activation function would mix the information sufficiently according to my later experiments.
It also leads to a corrupted alignment when gradients from the CTC loss dominates the training. Text information in the decoder_output is too much picked to reduce the total loss in such occasion.

hadaev8 · 2019-12-19T10:13:54Z

@bfs18
Using gaf in distributed setup worse training.
So I trying this approach https://arxiv.org/pdf/1705.07115.pdf
With lstm mixer and feeding mel output to CTC recognizer makes it more stable, but still training crush later.

My gradients indeed suffer.
https://i.imgur.com/amcGgHJ.png
Orange is your original implementation, others are my expiriments.

bfs18 · 2019-12-19T12:56:41Z

@bfs18
Using gaf in distributed setup worse training.
So I trying this approach https://arxiv.org/pdf/1705.07115.pdf
With lstm mixer and feeding mel output to CTC recognizer makes it more stable, but still training crush later.

My gradients indeed suffer.
https://i.imgur.com/amcGgHJ.png
Orange is your original implementation, others are my expiriments.

Hi @hadaev8 ,

The paper is bit complicated. I haven't go through it.

gaf is just a dynamic weight for the mi_loss.
https://github.com/bfs18/tacotron2/blob/8f8605ee0f67f6f571e74725030f16b13e4c7d2d/train.py#L259
You can use a small weight, such as 1e-2 or 1e-3, instead of gaf. You can even use an annealing schedule just like the KL-annealing trick. In my experiments, the gaf become very small after 10k steps.

xDuck · 2019-12-19T14:01:23Z

Hey @bfs18 Just wanted to let you know your fork is working great with my GST adaption as well based on Google's GST paper. Alignment learning super quickly and my models produce recognizable speech in about 3 hours on a 2070 Graphics card - way faster than before.

rafaelvalle · 2019-12-19T14:15:42Z

@bfs18 what are the most important changes for the results @xDuck mentioned?

xDuck · 2019-12-19T14:28:02Z

@rafaelvalle I should mention I am using bark-scale spectrograms with 18 channels and 2 pitch features for my spectrograms along with a lpcnet-forked vocoder (Targetting faster than realtime CPU inference. Currently 1/3 realtime speed on a 2017 macbook pro for synthesis). I have noticed in general that speeds up training a lot too (less features to predict). Samples attached of her after not much training with different GST reference clips. Single-Speaker LJSpeech used - These are from my very first test.

gst_results.zip

Alignment after 3k steps w/ batch size 20

bfs18 · 2019-12-20T03:10:37Z

Hey @bfs18 Just wanted to let you know your fork is working great with my GST adaption as well based on Google's GST paper. Alignment learning super quickly and my models produce recognizable speech in about 3 hours on a 2070 Graphics card - way faster than before.

Hi @xDuck , thanks for your information.

bfs18 · 2019-12-20T03:16:06Z

@bfs18 what are the most important changes for the results @xDuck mentioned?

Hi @rafaelvalle , setting the teacher forcing input to the global mean to a certain percentage is a stable trick, which boosts up alignment learning a lot. The extra CTC loss would speed up alignment learning and reduce bad case. However, it is a bit tricky to tune.

hadaev8 · 2019-12-21T15:41:24Z

I can align this bad boy slaps tacotron with only 2k steps.

Also wondering why you here have it not aligned on the decoder timestep axis.

bfs18 · 2019-12-22T13:32:48Z

Hi @hadaev8

I can align this bad boy slaps tacotron with only 2k steps.

How did you solve your problem?

Also wondering why you here have it not aligned on the decoder timestep axis.

I don't quite get what are you trying to say. I guess you are saying the tail of the alignment is different form the above figures. It's a bit wired. I am also wondering. However, the padding frames are not important.

hadaev8 · 2019-12-22T16:55:24Z

I turn off ctc loss then it became too low.

CookiePPP · 2020-04-29T15:28:38Z

@chazo1994
Updating the dropout like @bfs18 stated earlier may help with Loss.
#280 (comment)

I had not heard of this paper before, though reading it, very interesting indeed!
I solved most of my alignment issues by using more data and a multispeaker model however I would definitely be interested in recreating the paper and using Guided Attention in later models.

In regards to your problem, I don't fully understand (I still consider myself new to Deep Learning so not too useful), but I will try to help with parts I do understand.

I've also heard guided attention from espnet a few times, though looking further, I believe they just Diagonal-Guided Attention.

Maybe explore FastSpeech/ForwardTacotron
https://github.com/as-ideas/ForwardTacotron#-training-your-own-model
for ways to generate alignments and input them to your loss function?

chazo1994 · 2020-04-29T15:49:24Z

Hi @CookiePPP ,
You may try smaller dropout rate in these 2 lines, e.g. setting p=0.2.
https://github.com/bfs18/tacotron2/blob/8f8605ee0f67f6f571e74725030f16b13e4c7d2d/model.py#L240
https://github.com/bfs18/tacotron2/blob/8f8605ee0f67f6f571e74725030f16b13e4c7d2d/model.py#L246

I would try smaller dropout but over-smoothed mel spectrogram and horizontal line noise is big problems.

hadaev8 · 2020-05-03T18:18:32Z

@bfs18
Why you dropout frames with mean value?

chazo1994 · 2020-05-04T03:24:56Z

I report my results with MMI and DFR:

Drop Frame Rate = 0

Alignment:
Mels:

Drop Frame Rate = 0.1

Alignment:
Mels:
Loss:

Drop Frame Rate = 0.2
19k Step

alignment:
mels:
Loss:

34k_Step

Alignment:
Mels:

gaf is nan after 30k step (I modified code to train model with mixed_precision).
As you can see, my models converged soon but loss explode after 30k steps.
@bfs18
@rafaelvalle
@CookiePPP

chazo1994 · 2020-05-04T04:18:11Z

@bfs18
In your paper, you decay learning rate by a factor sqrt(4000/step) from 4000 step. But your fork don't have any learning-rate decay code.

bfs18 · 2020-05-05T13:50:15Z

@bfs18
Why you dropout frames with mean value?

Hi @hadaev8 , because this value would not distort the input values to the prenet a lot, then it would not distort the activation values to the following modules.

bfs18 · 2020-05-05T13:53:01Z

Hi @chazo1994 It seems that numerical errors occurred in your running.

In your paper, you decay learning rate by a factor sqrt(4000/step) from 4000 step. But your fork don't have any learning-rate decay code.

I found gradient adaptive factor works better. I use that trick instead.

hadaev8 · 2020-05-05T14:20:09Z

@bfs18
I wonder to try gaussian noise instead of a fixed value.
And thinking should I set the whole frame to a single value or add noise separatly to every frame value.

rafaelvalle · 2020-05-05T18:10:54Z

@chazo1994 thank you for sharing this, can you share spectrogram reconstruction training and validation loss?

chazo1994 · 2020-05-06T02:26:51Z

@chazo1994 thank you for sharing this, can you share spectrogram reconstruction training and validation loss?

This is my validation loss with MMI and DFR=0.2.
I don't know how to get spectrogram reconstruction?

lalindra-desilva · 2020-07-08T20:02:40Z

@bfs18 Just trying out your fork for the first time and followed instructions in this thread with ljspeech pretrained model. Running into the following error. Any idea why?

RuntimeError: Error(s) in loading state_dict for Tacotron2: size mismatch for decoder.gate_layer.linear_layer.weight: copying a param with shape torch.Size([1, 1536]) from checkpoint, the shape in current model is torch.Size([1, 1024]).

Appreciate any feedback.

terryyizhong · 2020-07-14T14:11:04Z

@bfs18 Just trying out your fork for the first time and followed instructions in this thread with ljspeech pretrained model. Running into the following error. Any idea why?

RuntimeError: Error(s) in loading state_dict for Tacotron2: size mismatch for decoder.gate_layer.linear_layer.weight: copying a param with shape torch.Size([1, 1536]) from checkpoint, the shape in current model is torch.Size([1, 1024]).

Appreciate any feedback.

you cannot use the pretrained Tacotron2 model of this branch. The model structure has been modified.

titospadini · 2020-07-25T01:55:56Z

@bfs18 I am using your fork with my dataset and it just started to align, but I am facing some problems when I try to use inference.ipynb with the tacotron model trained with your fork and the waveglow model; but, when I use this very same waveglow model that I have mentioned with the tacotron model trained with the NVidia repository, I have no problem.

The problem is this one:

AttributeError: 'WN' object has no attribute 'cond_layer'

If I am not wrong, the convert_model.py (from WaveGlow) should be used in this case, right? I have used it, but this error persists.

I need to use WaveGlow.
Any ideas to solve this, please?

CookiePPP · 2020-07-25T22:53:47Z

@titocaco
Download WaveGlow repo from
https://github.com/NVIDIA/waveglow
and replace the one in the tacotron2 folder?
(WaveGlow from bfs18's repo might be out of date)

titospadini · 2020-07-26T02:44:40Z

@CookiePPP it works! Thank you! =)

zhitiankai · 2020-09-10T05:15:42Z

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
calculating global mean...
Traceback (most recent call last):
File "train.py", line 341, in
train(args.output_directory, args.log_directory, args.checkpoint_path, args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 193, in train
global_mean = calculate_global_mean(train_loader, hparams.global_mean_npy)
File "train.py", line 159, in calculate_global_mean
for i, batch in enumerate(data_loader):
File "/data2/user/ztk/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/data2/user/ztk/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/data2/user/ztk/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/data2/user/ztk/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/data2/user/ztk/tacotron2_tibetan/data_utils.py", line 63, in getitem
return self.get_mel_text_pair(self.audiopaths_and_text[index])
File "/data2/user/ztk/tacotron2_tibetan/data_utils.py", line 34, in get_mel_text_pair
mel = self.get_mel(audiopath)
File "/data2/user/ztk/tacotron2_tibetan/data_utils.py", line 46, in get_mel
melspec = self.stft.mel_spectrogram(audio_norm)
File "/data2/user/ztk/tacotron2_tibetan/layers.py", line 73, in mel_spectrogram
assert(torch.min(y.data) >= -1)
RuntimeError: invalid argument 1: tensor must have one dimension at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:574

When i run your code https://github.com/bfs18/tacotron2 , i meet this bug,Can you give me some suggestions?

ErfolgreichCharismatisch · 2021-03-24T10:05:59Z

I should mention I am using bark-scale spectrograms with 18 channels and 2 pitch features for my spectrograms along with a lpcnet-forked vocoder

Can lpcnet help me with this issue: #463?

xDuck · 2021-03-24T10:40:04Z

Yes, Tacotron 2 + LPCNet should get you to be able to perform inference on CPU but the best speeds I was able to achieve were about 2x real-time on a current gen intel CPU with AVX2 support.

…

On Wed, Mar 24, 2021 at 6:06 AM Erfolgreich charismatisch < ***@***.***> wrote: I should mention I am using bark-scale spectrograms with 18 channels and 2 pitch features for my spectrograms along with a lpcnet-forked vocoder Can lpcnet help me with this issue: #463 <#463>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#280 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABICRIOQ52NG47IWO5X6C7DTFG2RTANCNFSM4JIESUGQ> .

ErfolgreichCharismatisch · 2021-03-24T11:07:50Z

Interesting. How do you use your nvidia Tacotron2 model with LPCNet?

Yes, Tacotron 2 + LPCNet should get you to be able to perform inference on CPU but the best speeds I was able to achieve were about 2x real-time on a current gen intel CPU with AVX2 support.
…
On Wed, Mar 24, 2021 at 6:06 AM Erfolgreich charismatisch < @.***> wrote: I should mention I am using bark-scale spectrograms with 18 channels and 2 pitch features for my spectrograms along with a lpcnet-forked vocoder Can lpcnet help me with this issue: #463 <#463>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#280 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABICRIOQ52NG47IWO5X6C7DTFG2RTANCNFSM4JIESUGQ .

xDuck · 2021-03-24T11:13:07Z

You will have to adjust the number of mels (and maybe other Params) used and feed it bark spectrograms for training from scratch. I made a lot of modifications that I don’t really remember, but it is not a simple task. On Wed, Mar 24, 2021 at 7:08 AM Erfolgreich charismatisch < ***@***.***> wrote:

…

Interesting. How do you use your nvidia Tacotron2 model with LPCNet? Yes, Tacotron 2 + LPCNet should get you to be able to perform inference on CPU but the best speeds I was able to achieve were about 2x real-time on a current gen intel CPU with AVX2 support. … <#m_7142798940015613202_> On Wed, Mar 24, 2021 at 6:06 AM Erfolgreich charismatisch < *@*.***> wrote: I should mention I am using bark-scale spectrograms with 18 channels and 2 pitch features for my spectrograms along with a lpcnet-forked vocoder Can lpcnet help me with this issue: #463 <#463> <#463 <#463>>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#280 (comment) <#280 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABICRIOQ52NG47IWO5X6C7DTFG2RTANCNFSM4JIESUGQ . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#280 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABICRIJN3URWMU7W455TFCDTFHBZPANCNFSM4JIESUGQ> .

ErfolgreichCharismatisch · 2021-03-24T11:22:36Z

If you had to do it all over again, how would you start?

PS: Can you share a diff between your files and the vanilla files?

You will have to adjust the number of mels (and maybe other Params) used and feed it bark spectrograms for training from scratch. I made a lot of modifications that I don’t really remember, but it is not a simple task.

xDuck · 2021-03-24T11:30:08Z

I’ve already mostly abandoned the project after considering my research “completed”. I do not have the diff accessible anymore, sorry. As for doing it over again, now there are better alternatives like SqueezeWave, HiFi GAN, etc. Keep in mind you will trade quality for speed in the vocoders, it is hard to compare to the quality of WaveGlow. This project was not designed to run on the CPU (rightfully so, NVIDA makes GPUs not CPUs), so it might not be what you are looking for - but it does a damn good job on GPUs.

…

On Wed, Mar 24, 2021 at 7:22 AM Erfolgreich charismatisch < ***@***.***> wrote: Well, that does not exactly sound encouraging ;) If you had to do it all over again, how would you start? PS: Can you share a diff between your files and the vanilla files? You will have to adjust the number of mels (and maybe other Params) used and feed it bark spectrograms for training from scratch. I made a lot of modifications that I don’t really remember, but it is not a simple task. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#280 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABICRIMYDWUMCMRV2SXQ7R3TFHDQ3ANCNFSM4JIESUGQ> .

ErfolgreichCharismatisch · 2021-03-24T12:17:54Z

Yes. Which setup would you recommend for my goal?

EDIT: I just tried SqeezeWave, but Nvidia is in it yet again, this time in apex. Therefore I get AssertionError: Torch not compiled with CUDA enabled.

I’ve already mostly abandoned the project after considering my research “completed”. I do not have the diff accessible anymore, sorry. As for doing it over again, now there are better alternatives like SqueezeWave, HiFi GAN, etc. Keep in mind you will trade quality for speed in the vocoders, it is hard to compare to the quality of WaveGlow. This project was not designed to run on the CPU (rightfully so, NVIDA makes GPUs not CPUs), so it might not be what you are looking for - but it does a damn good job on GPUs.

xDuck · 2021-03-24T12:34:13Z

I cannot give you a good answer without knowing everything about what you want to do. I suggest you do some research and evaluate your options in your setup.

…

On Wed, Mar 24, 2021 at 8:18 AM Erfolgreich charismatisch < ***@***.***> wrote: Yes. Which setup would you recommend for my goal? I’ve already mostly abandoned the project after considering my research “completed”. I do not have the diff accessible anymore, sorry. As for doing it over again, now there are better alternatives like SqueezeWave, HiFi GAN, etc. Keep in mind you will trade quality for speed in the vocoders, it is hard to compare to the quality of WaveGlow. This project was not designed to run on the CPU (rightfully so, NVIDA makes GPUs not CPUs), so it might not be what you are looking for - but it does a damn good job on GPUs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#280 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABICRIPOE47XMZXCJKNJ243TFHKALANCNFSM4JIESUGQ> .

ErfolgreichCharismatisch · 2021-03-26T08:53:01Z

Tutorial: Training on GPU with Colab, Inference with CPU on Server here.

keonlee9420 · 2021-07-26T01:04:43Z

hey guys, I just published the comprehensive tacotron2 which includes reduction factor (reduction window) and other techniques to boost up model robustness and efficiency. also you can play around with the pre-trained models. check the following link:
https://github.com/keonlee9420/Comprehensive-Tacotron2

A-d-DASARE · 2022-08-09T17:36:11Z

Hi, can someone pls explain me what are the x and y axis of Mel spectrogram. And how is it different from alignment graph of x and y coordinates. Thanks!

finardi · 2023-01-02T13:42:03Z

Hi, the paper is available at https://arxiv.org/abs/1909.01145

There's a implementation with FP16?
I've trying run your fork with apex and hparams FP16 Run: True, but not have been succeed.

m-toman mentioned this issue Jun 9, 2020

Experiments / Discussion as-ideas/ForwardTacotron#7

Closed

myagues mentioned this issue Jan 28, 2021

About reduction_factor_schedule as-ideas/TransformerTTS#79

Closed

reduction window is vital for the model to pick up alignment. #280

reduction window is vital for the model to pick up alignment. #280

Comments

bfs18 commented Nov 2, 2019 • edited Loading

xDuck commented Nov 12, 2019

bfs18 commented Nov 13, 2019

onyedikilo commented Dec 4, 2019

bfs18 commented Dec 5, 2019 • edited Loading

onyedikilo commented Dec 5, 2019

bfs18 commented Dec 5, 2019

hadaev8 commented Dec 7, 2019

bfs18 commented Dec 9, 2019

onyedikilo commented Dec 14, 2019

bfs18 commented Dec 16, 2019

hadaev8 commented Dec 17, 2019

bfs18 commented Dec 18, 2019

hadaev8 commented Dec 18, 2019

bfs18 commented Dec 18, 2019

xDuck commented Dec 18, 2019

hadaev8 commented Dec 18, 2019

bfs18 commented Dec 18, 2019 • edited Loading

bfs18 commented Dec 18, 2019

hadaev8 commented Dec 18, 2019

bfs18 commented Dec 19, 2019

hadaev8 commented Dec 19, 2019

bfs18 commented Dec 19, 2019

xDuck commented Dec 19, 2019

rafaelvalle commented Dec 19, 2019

xDuck commented Dec 19, 2019 • edited Loading

bfs18 commented Dec 20, 2019

bfs18 commented Dec 20, 2019

hadaev8 commented Dec 21, 2019

bfs18 commented Dec 22, 2019

hadaev8 commented Dec 22, 2019

CookiePPP commented Apr 29, 2020

chazo1994 commented Apr 29, 2020

hadaev8 commented May 3, 2020

chazo1994 commented May 4, 2020 • edited Loading

chazo1994 commented May 4, 2020

bfs18 commented May 5, 2020

bfs18 commented May 5, 2020

hadaev8 commented May 5, 2020

rafaelvalle commented May 5, 2020

chazo1994 commented May 6, 2020

lalindra-desilva commented Jul 8, 2020

terryyizhong commented Jul 14, 2020

titospadini commented Jul 25, 2020

CookiePPP commented Jul 25, 2020

titospadini commented Jul 26, 2020

zhitiankai commented Sep 10, 2020

ErfolgreichCharismatisch commented Mar 24, 2021

xDuck commented Mar 24, 2021 via email

ErfolgreichCharismatisch commented Mar 24, 2021

xDuck commented Mar 24, 2021 via email

ErfolgreichCharismatisch commented Mar 24, 2021 • edited Loading

xDuck commented Mar 24, 2021 via email

ErfolgreichCharismatisch commented Mar 24, 2021 • edited Loading

xDuck commented Mar 24, 2021 via email

ErfolgreichCharismatisch commented Mar 26, 2021

keonlee9420 commented Jul 26, 2021

A-d-DASARE commented Aug 9, 2022 • edited Loading

finardi commented Jan 2, 2023

bfs18 commented Nov 2, 2019 •

edited

Loading

bfs18 commented Dec 5, 2019 •

edited

Loading

bfs18 commented Dec 18, 2019 •

edited

Loading

xDuck commented Dec 19, 2019 •

edited

Loading

chazo1994 commented May 4, 2020 •

edited

Loading

ErfolgreichCharismatisch commented Mar 24, 2021 •

edited

Loading

ErfolgreichCharismatisch commented Mar 24, 2021 •

edited

Loading

A-d-DASARE commented Aug 9, 2022 •

edited

Loading