Tacotron2: Everything become nan at 53k steps #125

tekinek · 2020-07-21T14:54:44Z

Hi, I am not that experienced in TTS, so I've faced many problem before get the code running with my non-English dataset which has about 10k sentences (~26h long) . However, still some issues and questions.

When training process reaches at 53.5k steps, the model seems lost "everything". The values of train, eval losses and model predictions became nan (but training continues without reporting exception).

So I stopped training and resumed from 50k; I will wait until 53.5k and see if it happens again.
By the way, do my figures look fine? looks like model is overfitting; should I wait for a "surprise"?

My language is somehow under-resourced and there is no (at least I couldn't find one) phoneme dictionary to train a G2P and MFA model. However, unlike English, a character roughly represents a phone, except some vowels sound longer or shorter according to meaning of host word. So character-based model seems fine with me. This tacotron2 has been trained just for duration extraction.

Which step seems best for duration extraction so far?
How can I improve the quality of duration extraction?
extract_duration.py extracts durations from model prediction but they are supposed to be used with ground-truth mels. Although, the sum of tactron2-extracted durations is forced to match the length of ground-truth mels by alignment = alignment[:real_char_length, :real_mel_length] , this is just based on an assumption that predicted mels and their ground-truth counterparts are roughly one-to-one (from index 0).

So, when the goal of training a tactron2 is to extract good duration only, is it a good idea to use whole dataset for training and make a severely over-fitted model (maybe up to 200k steps or more in my case)?
Any idea on MFA model training for a language with no phone dictionary available?
Has anyone tried making a fake phone dictionary like this to force MFA align character instead of phoneme.
....
hello h e l l o
nice n i c e
....

Thanks.

The text was updated successfully, but these errors were encountered:

dathudeptrai · 2020-07-21T15:03:16Z

hi @tekinek, i never get nan when training with Tacotron-2 but i can give you some suggests :)):

Can you disble guided attention loss when resume training at 50k steps. You can do this by simply multiple loss_att to 0.0.
Around 60k->80k is ok for duration extract, ur 50k steps is also enough :v.
when extract duration based on tacotron-2, we use teacher forcing, that mean the prev mel is groundtruth so that ok.
Let me think :))).

And also, let pull the newest code and run it with newst tensorflow version may help you solve the nan problem. I guess disble guided attention loss is the solution for nan problem, but let try :v. BTW, can you share ur alignment figure ?

tekinek · 2020-07-21T15:26:38Z

@dathudeptrai thanks for your quick reply. I will try loss_att * 0.0 if my current run gets nan again. Now it is at 51k.

Here are some predicted alignments at 50k steps. Do they look fine? stopnet seems have long way to go, right? :)

dathudeptrai · 2020-07-21T15:31:25Z

@tekinek hmm, it's not as good as ljspeech and other datasets i tried before, the alignment is not strong but i hope it's still enough to get duration for fastspeech2 training with windown masking trick. There is something wrong in ur preprocessing, did you add stop symbols in the end of charactor_ids ?, did you lower all ur text and did you change english cleaner to ur target language cleaner ?

tekinek · 2020-07-21T16:05:58Z

@dathudeptrai

did you add stop symbols in the end of charactor_ids ?

It seems I haven't done that explicitly. Every sentence in the dataset ends with one of ".?!".
I've wrote a cleaner and processor based on cleaner.py and ljspeech.py, here is processor ugspeech.py:

import re
import os
import numpy as np
import soundfile as sf

from tensorflow_tts.utils import ugspeech_cleaners

valid_symbols = [
]

_pad = "_"
_eos = "~"
_punctuation = "!'(),.:;?«» "
_special = "-"
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
_arpabet = ["@" + s for s in valid_symbols]

symbols = (
    [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet + [_eos]
)

_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
_curly_re = re.compile(r"(.*?)\{(.+?)\}(.*)")

class UGSpeechProcessor(object):
    def __init__(self, root_path, cleaner_names):
        self.root_path = root_path
        self.cleaner_names = cleaner_names
        items = []
        self.speaker_name = "ugspeech"
        if root_path is not None:
            with open(os.path.join(root_path, "metadata.csv"), encoding="utf-8") as ttf:
                for line in ttf:
                    parts = line.strip().split("|")
                    wav_path = os.path.join(root_path, "wavs", "%s.wav" % parts[0])
                    text = parts[2]
                    if len(self.text_to_sequence(text)) > 200: 
                        continue
                    print(text)
                    items.append([text, wav_path, self.speaker_name])

            self.items = items

    def get_one_sample(self, idx):
        text, wav_file, speaker_name = self.items[idx]
        audio, rate = sf.read(wav_file)
        audio = audio.astype(np.float32)
        text_ids = np.asarray(self.text_to_sequence(text), np.int32)
        sample = {
            "raw_text": text,
            "text_ids": text_ids,
            "audio": audio,
            "utt_id": self.items[idx][1].split("/")[-1].split(".")[0],
            "speaker_name": speaker_name,
            "rate": rate,
        }

        return sample

    def text_to_sequence(self, text):
        global _symbol_to_id

        sequence = []
        while len(text):
            m = _curly_re.match(text)
            if not m:
                sequence += _symbols_to_sequence(
                    _clean_text(text, [self.cleaner_names])
                )
                break
            sequence += _symbols_to_sequence(
                _clean_text(m.group(1), self.cleaner_names)
            )
            sequence += _arpabet_to_sequence(m.group(2))
            text = m.group(3)
        return sequence

def _clean_text(text, cleaner_names):
    for name in cleaner_names:
        cleaner = getattr(ugspeech_cleaners, name)
        if not cleaner:
            raise Exception("Unknown cleaner: %s" % name)
        text = cleaner(text)
    return text

def _symbols_to_sequence(symbols):
    return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]

def _arpabet_to_sequence(text):
    return _symbols_to_sequence(["@" + s for s in text.split()])

def _should_keep_symbol(s):
    return s in _symbol_to_id and s != "_" and s != "~"

or do you mean I should append id of _eos to text_ids in somewhere before get_one_sample returns?

did you lower all ur text and did you change english cleaner to ur target language cleaner ?

No, because in the transcript used in dataset, lower and upper cases of same letter represent different character (that is due to more than 26 letters in alphabet of my language).

did you change english cleaner to ur target language cleaner;

Yes, I did.

FYI: I have formatted my dataset into LJSpeech style including folder structure and metadata.csv.

tekinek · 2020-07-22T04:21:59Z

@dathudeptrai restarting from 50k seems solved the "nan" problem

tekinek · 2020-07-22T12:22:49Z

@dathudeptrai where is _eos actually used in prepossessing with ljspeech.py. Is it supposed to be attached to every sentence in text_to_sequence whatever phone or character based normalization? but it is not the case there.

dathudeptrai · 2020-07-22T13:09:28Z

@tekinek it is on generator function in tacotron_dataset.py

tekinek · 2020-07-29T07:55:12Z

Hi @dathudeptrai

By your suggestion, I got back to the dataset and prepossessing. Yes, there are some issues: long silences between words, and bad min/max freq level settings for mel-spec.

I realize that inconsistent and long silences in between words are somehow common in my dataset. Sure, almost every utterance has long leading and trailing silence, but they should have been handled by trim_silence = True before. This time, I did 50% shortening for every silence > 500ms. (By the way, I wrote a script for that; I will share it soon.)

My initial setting for mel-spec min/max freqs were 60-7600. But I found that 0-8000 is much better by doing:
ground truth waveform -> mel -> griffin_lim -> waveform -> hearing

Now, I can see alignment is becoming stronger, but model still fails to stop at right location in most cases. What might be other reasons? thanks!

(blue one is a fresh run on newly cleaned data)

dathudeptrai · 2020-07-29T08:01:25Z

@tekinek it seem ok, the alignment is strong enough to extract duration for fastspeech. For stop token, I think the reason is that you don't add the stop_token to the end of sentence. And in you might need to train it to 100k to be able inference without teacher forcing :D.

tekinek · 2020-07-29T08:10:18Z

@dathudeptrai thanks for your quick response.

"you don't add the stop_token to the end of sentence"

How should I interpret this sentence? Should I manually append stop_token "_" to each sentence in my dataset before prepossessing? I see this happening as default behavior in tacotron_dataset.py (not in inference time?)

dathudeptrai · 2020-07-29T08:23:51Z

@tekinek in inference time you should add eos token as tacotron2_dataset does :d

tekinek · 2020-07-29T08:28:33Z

@dathudeptrai I got it, thanks.

tekinek · 2020-07-29T10:51:09Z

@dathudeptrai Sorry, wait a minute. The above figures are taken from predictions folder generated by generate_and_save_intermediate_result at certain training step. So corresponding sentences should have _eos appended already.

dathudeptrai · 2020-07-29T11:56:35Z

@tekinek the yellow line you see is padding, everything is fine :)))

tekinek · 2020-07-30T13:21:10Z

Hi @dathudeptrai,
I extracted duration using 50k tacotron2 without error and started a fastspeech2 training session. Whilst tac2 has been trained for almost 5 days to just see the door of 70k, fs2 passes 120k within a day and produces better sound (maybe tac2 is yet to be ready).

Here is learning curve and some mels from fs2:

How these figures look to your eyes? what is wrong with energy and f0 losses?
One observed problem is: fs2 fails to synthesis short sentence with single word that griffin_limed sound is not understandable at all (tac2 is fine in such cases), but longer sentences are fine though both tac2 and fs2 have more noise compared to mozilla TTS version of tac2:

Thanks!

dathudeptrai · 2020-07-30T13:25:58Z

@tekinek a mel for fastspeech2 very good i think. You need to train mb melgan to get better audio. GL always noise.

tekinek · 2020-08-02T10:17:49Z

Hi @dathudeptrai ,
When I try to train a multi band melgan model, I got an error syas "Paddings must be non-negative: 0 -6400". It happens in evaluation. Anything wrong with eval data?

[train]: 0%| | 0/4000000 [00:00<?, ?it/s]
2020-08-02 12:54:31.041261: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 2465 of 9238
2020-08-02 12:54:41.040099: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 5080 of 9238
2020-08-02 12:54:51.040689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 7644 of 9238
2020-08-02 12:54:57.273513: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
2020-08-02 12:55:11.379315: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
[eval]: 6it [00:27, 4.55s/it] | 5000/4000000 [14:12<182:44:23, 6.07it/s]
Traceback (most recent call last):
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1986, in execution_mode
yield
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 655, in _next_internal
output_shapes=self._flat_output_shapes)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2363, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -6400
[[{{node cond_4/else/_38/Pad}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]] [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "examples/multiband_melgan/train_multiband_melgan.py", line 492, in
main()
File "examples/multiband_melgan/train_multiband_melgan.py", line 484, in main
resume=args.resume,
File "./tensorflow_tts/trainers/base_trainer.py", line 587, in fit
self.run()
File "./tensorflow_tts/trainers/base_trainer.py", line 101, in run
self._train_epoch()
File "./tensorflow_tts/trainers/base_trainer.py", line 127, in _train_epoch
self._check_eval_interval()
File "./tensorflow_tts/trainers/base_trainer.py", line 164, in _check_eval_interval
self._eval_epoch()
File "./tensorflow_tts/trainers/base_trainer.py", line 422, in _eval_epoch
tqdm(self.eval_data_loader, desc="[eval]"), 1
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tqdm/std.py", line 1129, in iter
for obj in iterable:
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 296, in next
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 296, in next
return self.get_next()
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 316, in get_next
self._iterators[i].get_next_as_list_static_shapes(new_name))
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 1112, in get_next_as_list_static_shapes
return self._iterator.get_next()
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 581, in get_next
result.append(self._device_iterators[i].get_next())
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 741, in get_next
return self._next_internal()
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 661, in _next_internal
return structure.from_compatible_tensor_list(self._element_spec, ret)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/contextlib.py", line 130, in exit
self.gen.throw(type, value, traceback)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1989, in execution_mode
executor_new.wait()
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 67, in wait
pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -6400
[[{{node cond_4/else/_38/Pad}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[train]: 0%|▏ | 5000/4000000 [14:40<195:28:46, 5.68it/s]

dathudeptrai · 2020-08-02T10:27:52Z

@tekinek are u using newest code ? If no let try newest code then i can easily debug

tekinek · 2020-08-02T11:52:59Z

@dathudeptrai Yes, it was an older code base. But updating to the newest introduced new error. It seems your recent update to the multiband_melgan.v1.yaml is not fully compatible with train_multiband_melgan.py, where older naming "generator_params" still apears and causes problem when remove_short_samples is enabled.

Traceback (most recent call last):
  File "examples/multiband_melgan/train_multiband_melgan.py", line 492, in <module>
    main()
  File "examples/multiband_melgan/train_multiband_melgan.py", line 366, in main
    ] + 2 * config["generator_params"].get("aux_context_window", 0)
KeyError: 'generator_params'

dathudeptrai · 2020-08-02T11:58:00Z

@tekinek let replace generator_params to multiband_generator_params

manmay-nakhashi · 2020-08-06T14:47:50Z

@dathudeptrai i'll send you tensorboard image shortly

manmay-nakhashi · 2020-08-06T17:03:43Z

@dathudeptrai

before clipping

after clipping at 5 and applying tanh , it fixes the issue i guess :))

dathudeptrai · 2020-08-06T17:06:56Z

@tekinek what is ur upper bound value :))). @manmay-nakhashi 5.0 is magic number haha :)) i guess 4.0 is a best number :v.

manmay-nakhashi · 2020-08-06T17:09:20Z

@dathudeptrai haha i'll try it with 4.0 :P

manmay-nakhashi · 2020-08-07T05:40:08Z

@dathudeptrai
after starting a discriminator it again happened one time but after that it settles down
[WARNING] (Step: 205600) train_adversarial_loss = 1.0104.
[WARNING] (Step: 205600) train_subband_spectral_convergence_loss = 0.9997.
[WARNING] (Step: 205600) train_subband_log_magnitude_loss = 1.1088.
[WARNING] (Step: 205600) train_fullband_spectral_convergence_loss = 1.0251.
[WARNING] (Step: 205600) train_fullband_log_magnitude_loss = 1.3121.
[WARNING] (Step: 205600) train_gen_loss = 4.7488.
[WARNING] (Step: 205600) train_real_loss = 0.0664.
[WARNING] (Step: 205600) train_fake_loss = 0.1495.
[WARNING] (Step: 205600) train_dis_loss = 0.2159.
[train]: 5%|███████▍ | 205800/4000000 [47:51<511:38:08, 2.06it/s]
[WARNING] (Step: 205800) train_adversarial_loss = 267.4664.
[WARNING] (Step: 205800) train_subband_spectral_convergence_loss = 1.0560.
[WARNING] (Step: 205800) train_subband_log_magnitude_loss = 1.1541.
[WARNING] (Step: 205800) train_fullband_spectral_convergence_loss = 1.0531.
[WARNING] (Step: 205800) train_fullband_log_magnitude_loss = 1.3672.
[WARNING] (Step: 205800) train_gen_loss = 670.9814.
[WARNING] (Step: 205800) train_real_loss = 16.8144.
[WARNING] (Step: 205800) train_fake_loss = 1557.5889.
[WARNING] (Step: 205800) train_dis_loss = 1574.4030.

i was looking into discriminator loss and it doesn't have real vs fake loss in master branch is it needed ?

if self.steps >= self.config["discriminator_train_start_steps"]:
            p_hat = self._discriminator(y_hat)
            p = self._discriminator(tf.expand_dims(audios, 2))
            adv_loss = 0.0
            for i in range(len(p_hat)):
                adv_loss += calculate_3d_loss(
                    tf.ones_like(p_hat[i][-1]), p_hat[i][-1], loss_fn=self.mse_loss
                )
            adv_loss /= i + 1
            gen_loss += self.config["lambda_adv"] * adv_loss

            dict_metrics_losses.update({"adversarial_loss": adv_loss},)
           **# is real and fake loss calculation is needed in discriminator ?? ** 
           # discriminator
            p = self.discriminator(tf.expand_dims(y, 2))
            p_hat = self.discriminator(y_hat)
            real_loss = 0.0
            fake_loss = 0.0
            for i in range(len(p)):
                real_loss += self.mse_loss(p[i][-1], tf.ones_like(p[i][-1], tf.float32))
                fake_loss += self.mse_loss(
                    p_hat[i][-1], tf.zeros_like(p_hat[i][-1], tf.float32)
                )
            real_loss /= i + 1
            fake_loss /= i + 1
            dis_loss = real_loss + fake_loss

dathudeptrai · 2020-08-07T05:53:14Z

@manmay-nakhashi so for now, everything is still ok ?. I think we should apply sigmoid function for discriminator :))). Can you try apply sigmoid for the last convolution ? here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/melgan.py#L411-L416). Then retrain and report the training progress here ?

manmay-nakhashi · 2020-08-07T05:57:07Z

@dathudeptrai generator trained properly till 200k steps once i start discriminator it becomes unstable after 5k steps
i'll make that change and post tensorboard over here

dathudeptrai · 2020-08-07T06:01:26Z

@manmay-nakhashi real/fake loss computed here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py#L179-L198)

manmay-nakhashi · 2020-08-07T08:41:54Z

@dathudeptrai it's been 20k steps and traning is mimicking english graph pattern so i am hoping it'll converge better after sometime. i'll post tensorboard after 50k training steps

tekinek · 2020-08-07T12:50:24Z

hi @dathudeptrai

opps here:( My mb-melgan training still seems problematic. Mine was 10.0 clip for stft losses and tanh to synthesis output. Should I try 4.0, and is resuming from 200k fine?

dathudeptrai · 2020-08-07T12:54:21Z

@tekinek what is ur discriminator parameter?

manmay-nakhashi · 2020-08-07T13:11:17Z

@dathudeptrai i have tried sigmoid function , but as discriminator starts it starts adding beep to the waveform , then i replaced it with swish and it started working for me but there is an edge effect in the audio "straight spikes" , i think can be handled with padding or filtering (or may be it'll go away as model converges )

tekinek · 2020-08-07T13:18:20Z

@dathudeptrai I haven't touch the defaults.

dathudeptrai · 2020-08-07T13:21:21Z

@tekinek there is no problem about stft loss in ur tensorboard. The problem is about discriminator :D. Let check ur current code and this line (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/melgan.py#L379-L380).

manmay-nakhashi · 2020-08-07T14:02:49Z

@dathudeptrai have you encountered edge effects in initial discriminator training ?

tekinek · 2020-08-07T14:39:26Z

@dathudeptrai

It is like this:

      discriminator += [
                    GroupConv1D(
                        filters=out_chs,
                        kernel_size= * 10 + 1,
                        strides=downsample_scale,
                        padding="same",
                        use_bias=use_bias,
                        groups=in_chs // 4,
                        kernel_initializer=get_initializer(initializer_seed),
                    )
                ]

A quick debug shows values of all that downsample_scales are 4.

dathudeptrai · 2020-08-07T15:44:09Z

@tekinek what is a number of parameter on ur discriminator ?. All downsample_sacles are 4 is correct.

tekinek · 2020-08-07T15:49:31Z

@dathudeptrai Parameter number of discriminator is 3,981,507

tekinek · 2020-08-07T15:50:48Z

FYI: this pytorch implementation of mb-melgan worked before with the same dataset.

TensorFlowTTS/tensorflow_tts/models/melgan.py

Lines 379 to 380 in 7d9e497

    
           kernel_size=downsample_scale * 10 + 1, 
        
           strides=downsample_scale,

dathudeptrai · 2020-08-07T15:52:10Z

@dathudeptrai Parameter number of discriminator is 3,981,507

that is totally wrong somehow, the correct parameter is > 16M. That is why ur discrimiator loss convergence at 0.25 :)). everything is ok :)).

tekinek · 2020-08-07T16:05:35Z

@dathudeptrai I see. Then what causes such a big difference in #of params under default settings?

Model: "multi_band_melgan_discriminator"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
melgan_discriminator_scale_. multiple                  1327169   
_________________________________________________________________
melgan_discriminator_scale_. multiple                  1327169   
_________________________________________________________________
melgan_discriminator_scale_. multiple                  1327169   
_________________________________________________________________
average_pooling1d_2 (Average multiple                  0         
=================================================================
Total params: 3,981,507
Trainable params: 3,981,507
Non-trainable params: 0

dathudeptrai · 2020-08-07T16:09:17Z

@tekinek sorry. it should be:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
sequential (Sequential)      (None, None, 4)           2534356   
=================================================================
Total params: 2,534,356
Trainable params: 2,534,356
Non-trainable params: 0
_________________________________________________________________
Model: "multi_band_melgan_discriminator"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
melgan_discriminator_scale_. multiple                  1450305   
_________________________________________________________________
melgan_discriminator_scale_. multiple                  1450305   
_________________________________________________________________
melgan_discriminator_scale_. multiple                  1450305   
_________________________________________________________________
average_pooling1d_2 (Average multiple                  0         
=================================================================
Total params: 4,350,915
Trainable params: 4,350,915
Non-trainable params: 0

Let me check the private framework again :)).

cxcxcxcx · 2020-08-21T07:58:03Z

sub_sc_loss = tf.where(sub_sc_loss >= 2.0, 0.0, sub_sc_loss)
sub_mag_loss = tf.where(sub_mag_loss >= 2.0, 0.0, sub_mag_loss)
full_sc_loss = tf.where(full_sc_loss >= 2.0, 0.0, full_sc_loss)
full_mag_loss = tf.where(full_mag_loss >= 2.0, 0.0, full_mag_loss)
gen_loss = 0.5 * (sub_sc_loss + sub_mag_loss) + 0.5 * (
     full_sc_loss + full_mag_loss
)

Something is weird here. I ran into the same problem, and I notice that y_mb_hat become either 1 or -1. Once it's in that state, masking loss wouldn't help.
For the baker dataset, I also noticed it trains ok for some time if I don't interrupt it. But if I stop and then load the saved checkpoint, the loss becomes crazy fast (in <10 cycles).

GavinStein1 · 2020-10-12T05:02:54Z

Hi there, I am getting the NaN problem when training my Tacotron2. it occurs between 18.1k and 18.2k iterations.

18,100 itr

18,200 itr

I have tried setting loss_att * 0.0 but still occurs. I am training on my own dataset which is much smaller than ljspeech but is still english. I use the ljspeech preprocessor. Any idea what is causing this?

stale · 2020-12-11T05:43:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

bemnet4u · 2022-01-30T06:04:13Z

I am experiencing this issue when training non english data set. I have about 1 hour audio with text to test this model and after 18k steps, i see the mel_loss gets to nan and won't recover. I tried seting attention loss as suggested by multiplying it to 0 like below.

I realize that the data set I have might be too small but all I was hopping to aquire more data set if I got a small light of success that this is working. I created preprocessors and modifications for this datasource based on the existing samples but I am not certain if I am doing this right. My changes are in this fork https://github.com/bemnet4u/TensorFlowTTS

@dathudeptrai or @GavinStein1 any advice on how to overcome this?

loss_att = loss_att * 0.0
        per_example_losses = (
            stop_token_loss + mel_loss_before + mel_loss_after + loss_att
        )

        dict_metrics_losses = {
            "stop_token_loss": stop_token_loss,
            "mel_loss_before": mel_loss_before,
            "mel_loss_after": mel_loss_after,
            "guided_attention_loss": loss_att,
        }``` 

I see this in the logs.

```2022-01-30 05:40:05,977 (base_trainer:988) INFO: (Step: 24400) train_mel_loss_before = nan.
2022-01-30 05:40:05,978 (base_trainer:988) INFO: (Step: 24400) train_mel_loss_after = nan.
2022-01-30 05:40:05,978 (base_trainer:988) INFO: (Step: 24400) train_guided_attention_loss = 0.0000.

More training logs

2022-01-29 20:38:28,882 (tacotron_dataset:93) INFO: Using guided attention loss
2022-01-29 20:38:28,887 (train_tacotron2:456) INFO: Updating save_interval_steps from 2000 to 500
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: hop_size = 256
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: format = npy
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: model_type = tacotron2
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: tacotron2_params = {'dataset': 'ljspeech', 'embedding_hidden_size': 512, 'initializer_range': 0.02, 'embedding_dropout_prob': 0.1, 'n_speakers': 1, 'n_conv_encoder': 5, 'encoder_conv_filters': 512, 'encoder_conv_kernel_sizes': 5, 'encoder_conv_activation': 'relu', 'encoder_conv_dropout_rate': 0.5, 'encoder_lstm_units': 256, 'n_prenet_layers': 2, 'prenet_units': 256, 'prenet_activation': 'relu', 'prenet_dropout_rate': 0.5, 'n_lstm_decoder': 1, 'reduction_factor': 1, 'decoder_lstm_units': 1024, 'attention_dim': 128, 'attention_filters': 32, 'attention_kernel': 31, 'n_mels': 80, 'n_conv_postnet': 5, 'postnet_conv_filters': 512, 'postnet_conv_kernel_sizes': 5, 'postnet_dropout_rate': 0.1, 'attention_type': 'lsa'}
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: batch_size = 32
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: remove_short_samples = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: allow_cache = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: mel_length_threshold = 32
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: is_shuffle = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: use_fixed_shapes = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 1e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001}
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: gradient_accumulation_steps = 1
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: var_train_expr = None
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: train_max_steps = 200000
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: save_interval_steps = 500
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: eval_interval_steps = 500
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: log_interval_steps = 200
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: start_schedule_teacher_forcing = 200001
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: start_ratio_value = 0.5
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: schedule_decay_steps = 50000
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: end_ratio_value = 0.0
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: num_save_intermediate_results = 1
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: train_dir = /tmp/dataset/dump/dump_amharic/train/
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: dev_dir = /tmp/dataset/dump/dump_amharic/valid
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: use_norm = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: outdir = /tmp/dataset/dump/examples/tacotron2/exp/train.tacotron2.v1
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: config = /databricks/driver/TensorFlowTTS/examples/tacotron2/conf/tacotron2.v1.yaml
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: resume = /tmp/dataset/dump/examples/tacotron2/exp/train.tacotron2.v1/checkpoints/ckpt-16000
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: verbose = 1
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: mixed_precision = False
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: pretrained = 
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: use_fal = False
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: version = 0.0
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: max_mel_length = 859
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: max_char_length = 84
2022-01-29 20:38:29,212 (tacotron_dataset:93) INFO: Using guided attention loss
Model: "tacotron2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
encoder (TFTacotronEncoder)  multiple                  8218624   
_________________________________________________________________
decoder_cell (TFTacotronDeco multiple                  18246402  
_________________________________________________________________
post_net (TFTacotronPostnet) multiple                  5460480   
_________________________________________________________________
residual_projection (Dense)  multiple                  41040     
=================================================================
Total params: 31,966,546
Trainable params: 31,956,306
Non-trainable params: 10,240
_________________________________________________________________

tekinek changed the title ~~Everything is nan from 53k steps~~ Everything become nan start from 53k steps Jul 21, 2020

dathudeptrai self-assigned this Jul 21, 2020

dathudeptrai added Discussion 😁 Discuss new feature performance 🏍 Slow question ❓ Further information is requested Tacotron Tacotron related question. labels Jul 21, 2020

dathudeptrai added this to In progress in Tacotron 2 Jul 21, 2020

tekinek changed the title ~~Everything become nan start from 53k steps~~ Everything become nan at 53k steps Jul 21, 2020

tekinek changed the title ~~Everything become nan at 53k steps~~ Tacotron2: Everything become nan at 53k steps Jul 21, 2020

tekinek mentioned this issue Jul 22, 2020

NaNs in FS2 posnet #133

Closed

tekinek mentioned this issue Aug 10, 2020

Tacotron2 fine-tune problem #194

Closed

dathudeptrai mentioned this issue Aug 11, 2020

Tacotron2: mels are not getting aligned properly. #196

Closed

stale bot added the wontfix label Dec 11, 2020

stale bot closed this as completed Dec 18, 2020

Tacotron2: Everything become nan at 53k steps #125

Tacotron2: Everything become nan at 53k steps #125

Comments

tekinek commented Jul 21, 2020 • edited

dathudeptrai commented Jul 21, 2020 • edited

tekinek commented Jul 21, 2020 • edited

dathudeptrai commented Jul 21, 2020 • edited

tekinek commented Jul 21, 2020 • edited

tekinek commented Jul 22, 2020

tekinek commented Jul 22, 2020

dathudeptrai commented Jul 22, 2020

tekinek commented Jul 29, 2020 • edited

dathudeptrai commented Jul 29, 2020

tekinek commented Jul 29, 2020 • edited

dathudeptrai commented Jul 29, 2020

tekinek commented Jul 29, 2020

tekinek commented Jul 29, 2020

dathudeptrai commented Jul 29, 2020

tekinek commented Jul 30, 2020

dathudeptrai commented Jul 30, 2020

tekinek commented Aug 2, 2020

dathudeptrai commented Aug 2, 2020

tekinek commented Aug 2, 2020

dathudeptrai commented Aug 2, 2020

manmay-nakhashi commented Aug 6, 2020

manmay-nakhashi commented Aug 6, 2020

dathudeptrai commented Aug 6, 2020

manmay-nakhashi commented Aug 6, 2020

manmay-nakhashi commented Aug 7, 2020 • edited

dathudeptrai commented Aug 7, 2020 • edited

manmay-nakhashi commented Aug 7, 2020 • edited

dathudeptrai commented Aug 7, 2020

manmay-nakhashi commented Aug 7, 2020

tekinek commented Aug 7, 2020

dathudeptrai commented Aug 7, 2020

manmay-nakhashi commented Aug 7, 2020 • edited

tekinek commented Aug 7, 2020

dathudeptrai commented Aug 7, 2020

manmay-nakhashi commented Aug 7, 2020 • edited

tekinek commented Aug 7, 2020 • edited

dathudeptrai commented Aug 7, 2020

tekinek commented Aug 7, 2020

tekinek commented Aug 7, 2020

dathudeptrai commented Aug 7, 2020

tekinek commented Aug 7, 2020

dathudeptrai commented Aug 7, 2020 • edited

cxcxcxcx commented Aug 21, 2020

GavinStein1 commented Oct 12, 2020

18,100 itr

18,200 itr

stale bot commented Dec 11, 2020

bemnet4u commented Jan 30, 2022 • edited

tekinek commented Jul 21, 2020 •

edited

dathudeptrai commented Jul 21, 2020 •

edited

tekinek commented Jul 21, 2020 •

edited

dathudeptrai commented Jul 21, 2020 •

edited

tekinek commented Jul 21, 2020 •

edited

tekinek commented Jul 29, 2020 •

edited

tekinek commented Jul 29, 2020 •

edited

manmay-nakhashi commented Aug 7, 2020 •

edited

dathudeptrai commented Aug 7, 2020 •

edited

manmay-nakhashi commented Aug 7, 2020 •

edited

manmay-nakhashi commented Aug 7, 2020 •

edited

manmay-nakhashi commented Aug 7, 2020 •

edited

tekinek commented Aug 7, 2020 •

edited

dathudeptrai commented Aug 7, 2020 •

edited

bemnet4u commented Jan 30, 2022 •

edited