Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tacotron2: Everything become nan at 53k steps #125

Closed
tekinek opened this issue Jul 21, 2020 · 57 comments
Closed

Tacotron2: Everything become nan at 53k steps #125

tekinek opened this issue Jul 21, 2020 · 57 comments
Assignees
Labels
Discussion 😁 Discuss new feature performance 🏍 Slow question ❓ Further information is requested Tacotron Tacotron related question. wontfix
Projects

Comments

@tekinek
Copy link

tekinek commented Jul 21, 2020

Hi, I am not that experienced in TTS, so I've faced many problem before get the code running with my non-English dataset which has about 10k sentences (~26h long) . However, still some issues and questions.

  1. When training process reaches at 53.5k steps, the model seems lost "everything". The values of train, eval losses and model predictions became nan (but training continues without reporting exception).

tensorboard1

So I stopped training and resumed from 50k; I will wait until 53.5k and see if it happens again.
By the way, do my figures look fine? looks like model is overfitting; should I wait for a "surprise"?

  1. My language is somehow under-resourced and there is no (at least I couldn't find one) phoneme dictionary to train a G2P and MFA model. However, unlike English, a character roughly represents a phone, except some vowels sound longer or shorter according to meaning of host word. So character-based model seems fine with me. This tacotron2 has been trained just for duration extraction.

    Which step seems best for duration extraction so far?

  2. How can I improve the quality of duration extraction?
    extract_duration.py extracts durations from model prediction but they are supposed to be used with ground-truth mels. Although, the sum of tactron2-extracted durations is forced to match the length of ground-truth mels by alignment = alignment[:real_char_length, :real_mel_length] , this is just based on an assumption that predicted mels and their ground-truth counterparts are roughly one-to-one (from index 0).

    So, when the goal of training a tactron2 is to extract good duration only, is it a good idea to use whole dataset for training and make a severely over-fitted model (maybe up to 200k steps or more in my case)?

  3. Any idea on MFA model training for a language with no phone dictionary available?
    Has anyone tried making a fake phone dictionary like this to force MFA align character instead of phoneme.
    ....
    hello h e l l o
    nice n i c e
    ....

Thanks.

@tekinek tekinek changed the title Everything is nan from 53k steps Everything become nan start from 53k steps Jul 21, 2020
@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Jul 21, 2020

hi @tekinek, i never get nan when training with Tacotron-2 but i can give you some suggests :)):

  1. Can you disble guided attention loss when resume training at 50k steps. You can do this by simply multiple loss_att to 0.0.
  2. Around 60k->80k is ok for duration extract, ur 50k steps is also enough :v.
  3. when extract duration based on tacotron-2, we use teacher forcing, that mean the prev mel is groundtruth so that ok.
  4. Let me think :))).

And also, let pull the newest code and run it with newst tensorflow version may help you solve the nan problem. I guess disble guided attention loss is the solution for nan problem, but let try :v. BTW, can you share ur alignment figure ?

@dathudeptrai dathudeptrai self-assigned this Jul 21, 2020
@dathudeptrai dathudeptrai added Discussion 😁 Discuss new feature performance 🏍 Slow question ❓ Further information is requested Tacotron Tacotron related question. labels Jul 21, 2020
@dathudeptrai dathudeptrai added this to In progress in Tacotron 2 Jul 21, 2020
@tekinek
Copy link
Author

tekinek commented Jul 21, 2020

@dathudeptrai thanks for your quick reply. I will try loss_att * 0.0 if my current run gets nan again. Now it is at 51k.

Here are some predicted alignments at 50k steps. Do they look fine? stopnet seems have long way to go, right? :)

7_alignment
14_alignment
2_alignment
12_alignment
11_alignment
1_alignment

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Jul 21, 2020

@tekinek hmm, it's not as good as ljspeech and other datasets i tried before, the alignment is not strong but i hope it's still enough to get duration for fastspeech2 training with windown masking trick. There is something wrong in ur preprocessing, did you add stop symbols in the end of charactor_ids ?, did you lower all ur text and did you change english cleaner to ur target language cleaner ?

@tekinek
Copy link
Author

tekinek commented Jul 21, 2020

@dathudeptrai

did you add stop symbols in the end of charactor_ids ?

It seems I haven't done that explicitly. Every sentence in the dataset ends with one of ".?!".
I've wrote a cleaner and processor based on cleaner.py and ljspeech.py, here is processor ugspeech.py:

import re
import os
import numpy as np
import soundfile as sf

from tensorflow_tts.utils import ugspeech_cleaners

valid_symbols = [
]

_pad = "_"
_eos = "~"
_punctuation = "!'(),.:;?«» "
_special = "-"
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
_arpabet = ["@" + s for s in valid_symbols]

symbols = (
    [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet + [_eos]
)

_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
_curly_re = re.compile(r"(.*?)\{(.+?)\}(.*)")

class UGSpeechProcessor(object):
    def __init__(self, root_path, cleaner_names):
        self.root_path = root_path
        self.cleaner_names = cleaner_names
        items = []
        self.speaker_name = "ugspeech"
        if root_path is not None:
            with open(os.path.join(root_path, "metadata.csv"), encoding="utf-8") as ttf:
                for line in ttf:
                    parts = line.strip().split("|")
                    wav_path = os.path.join(root_path, "wavs", "%s.wav" % parts[0])
                    text = parts[2]
                    if len(self.text_to_sequence(text)) > 200: 
                        continue
                    print(text)
                    items.append([text, wav_path, self.speaker_name])

            self.items = items

    def get_one_sample(self, idx):
        text, wav_file, speaker_name = self.items[idx]
        audio, rate = sf.read(wav_file)
        audio = audio.astype(np.float32)
        text_ids = np.asarray(self.text_to_sequence(text), np.int32)
        sample = {
            "raw_text": text,
            "text_ids": text_ids,
            "audio": audio,
            "utt_id": self.items[idx][1].split("/")[-1].split(".")[0],
            "speaker_name": speaker_name,
            "rate": rate,
        }

        return sample

    def text_to_sequence(self, text):
        global _symbol_to_id

        sequence = []
        while len(text):
            m = _curly_re.match(text)
            if not m:
                sequence += _symbols_to_sequence(
                    _clean_text(text, [self.cleaner_names])
                )
                break
            sequence += _symbols_to_sequence(
                _clean_text(m.group(1), self.cleaner_names)
            )
            sequence += _arpabet_to_sequence(m.group(2))
            text = m.group(3)
        return sequence

def _clean_text(text, cleaner_names):
    for name in cleaner_names:
        cleaner = getattr(ugspeech_cleaners, name)
        if not cleaner:
            raise Exception("Unknown cleaner: %s" % name)
        text = cleaner(text)
    return text

def _symbols_to_sequence(symbols):
    return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]

def _arpabet_to_sequence(text):
    return _symbols_to_sequence(["@" + s for s in text.split()])

def _should_keep_symbol(s):
    return s in _symbol_to_id and s != "_" and s != "~"

or do you mean I should append id of _eos to text_ids in somewhere before get_one_sample returns?

did you lower all ur text and did you change english cleaner to ur target language cleaner ?

No, because in the transcript used in dataset, lower and upper cases of same letter represent different character (that is due to more than 26 letters in alphabet of my language).

did you change english cleaner to ur target language cleaner;

Yes, I did.

FYI: I have formatted my dataset into LJSpeech style including folder structure and metadata.csv.

@tekinek tekinek changed the title Everything become nan start from 53k steps Everything become nan at 53k steps Jul 21, 2020
@tekinek tekinek changed the title Everything become nan at 53k steps Tacotron2: Everything become nan at 53k steps Jul 21, 2020
@tekinek
Copy link
Author

tekinek commented Jul 22, 2020

@dathudeptrai restarting from 50k seems solved the "nan" problem

tensorboard2

@tekinek
Copy link
Author

tekinek commented Jul 22, 2020

@dathudeptrai where is _eos actually used in prepossessing with ljspeech.py. Is it supposed to be attached to every sentence in text_to_sequence whatever phone or character based normalization? but it is not the case there.

@dathudeptrai
Copy link
Collaborator

@tekinek it is on generator function in tacotron_dataset.py

@tekinek
Copy link
Author

tekinek commented Jul 29, 2020

Hi @dathudeptrai

By your suggestion, I got back to the dataset and prepossessing. Yes, there are some issues: long silences between words, and bad min/max freq level settings for mel-spec.

I realize that inconsistent and long silences in between words are somehow common in my dataset. Sure, almost every utterance has long leading and trailing silence, but they should have been handled by trim_silence = True before. This time, I did 50% shortening for every silence > 500ms. (By the way, I wrote a script for that; I will share it soon.)

My initial setting for mel-spec min/max freqs were 60-7600. But I found that 0-8000 is much better by doing:
ground truth waveform -> mel -> griffin_lim -> waveform -> hearing

Now, I can see alignment is becoming stronger, but model still fails to stop at right location in most cases. What might be other reasons? thanks!

(blue one is a fresh run on newly cleaned data)
tensorboard3

tensorboard4

6_alignment
7_alignment-2
8_alignment
9_alignment
10_alignment-2
1
1_alignment-2
3_alignment
4_alignment
5_alignment
14_alignment-2
15_alignment

@dathudeptrai
Copy link
Collaborator

@tekinek it seem ok, the alignment is strong enough to extract duration for fastspeech. For stop token, I think the reason is that you don't add the stop_token to the end of sentence. And in you might need to train it to 100k to be able inference without teacher forcing :D.

@tekinek
Copy link
Author

tekinek commented Jul 29, 2020

@dathudeptrai thanks for your quick response.

"you don't add the stop_token to the end of sentence"

How should I interpret this sentence? Should I manually append stop_token "_" to each sentence in my dataset before prepossessing? I see this happening as default behavior in tacotron_dataset.py (not in inference time?)

@dathudeptrai
Copy link
Collaborator

@tekinek in inference time you should add eos token as tacotron2_dataset does :d

@tekinek
Copy link
Author

tekinek commented Jul 29, 2020

@dathudeptrai I got it, thanks.

@tekinek
Copy link
Author

tekinek commented Jul 29, 2020

@dathudeptrai Sorry, wait a minute. The above figures are taken from predictions folder generated by generate_and_save_intermediate_result at certain training step. So corresponding sentences should have _eos appended already.

@dathudeptrai
Copy link
Collaborator

@tekinek the yellow line you see is padding, everything is fine :)))

@tekinek
Copy link
Author

tekinek commented Jul 30, 2020

Hi @dathudeptrai,
I extracted duration using 50k tacotron2 without error and started a fastspeech2 training session. Whilst tac2 has been trained for almost 5 days to just see the door of 70k, fs2 passes 120k within a day and produces better sound (maybe tac2 is yet to be ready).

Here is learning curve and some mels from fs2:

fs2_eval
fs2_train

fs2_2
fs2_3
fs2_4

How these figures look to your eyes? what is wrong with energy and f0 losses?
One observed problem is: fs2 fails to synthesis short sentence with single word that griffin_limed sound is not understandable at all (tac2 is fine in such cases), but longer sentences are fine though both tac2 and fs2 have more noise compared to mozilla TTS version of tac2:

fs_short_word

Thanks!

@dathudeptrai
Copy link
Collaborator

@tekinek a mel for fastspeech2 very good i think. You need to train mb melgan to get better audio. GL always noise.

@tekinek
Copy link
Author

tekinek commented Aug 2, 2020

Hi @dathudeptrai ,
When I try to train a multi band melgan model, I got an error syas "Paddings must be non-negative: 0 -6400". It happens in evaluation. Anything wrong with eval data?

[train]: 0%| | 0/4000000 [00:00<?, ?it/s]
2020-08-02 12:54:31.041261: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 2465 of 9238
2020-08-02 12:54:41.040099: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 5080 of 9238
2020-08-02 12:54:51.040689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 7644 of 9238
2020-08-02 12:54:57.273513: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
2020-08-02 12:55:11.379315: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
[eval]: 6it [00:27, 4.55s/it] | 5000/4000000 [14:12<182:44:23, 6.07it/s]
Traceback (most recent call last):
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1986, in execution_mode
yield
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 655, in _next_internal
output_shapes=self._flat_output_shapes)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2363, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -6400
[[{{node cond_4/else/_38/Pad}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]] [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "examples/multiband_melgan/train_multiband_melgan.py", line 492, in
main()
File "examples/multiband_melgan/train_multiband_melgan.py", line 484, in main
resume=args.resume,
File "./tensorflow_tts/trainers/base_trainer.py", line 587, in fit
self.run()
File "./tensorflow_tts/trainers/base_trainer.py", line 101, in run
self._train_epoch()
File "./tensorflow_tts/trainers/base_trainer.py", line 127, in _train_epoch
self._check_eval_interval()
File "./tensorflow_tts/trainers/base_trainer.py", line 164, in _check_eval_interval
self._eval_epoch()
File "./tensorflow_tts/trainers/base_trainer.py", line 422, in _eval_epoch
tqdm(self.eval_data_loader, desc="[eval]"), 1
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tqdm/std.py", line 1129, in iter
for obj in iterable:
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 296, in next
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 296, in next
return self.get_next()
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 316, in get_next
self._iterators[i].get_next_as_list_static_shapes(new_name))
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 1112, in get_next_as_list_static_shapes
return self._iterator.get_next()
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 581, in get_next
result.append(self._device_iterators[i].get_next())
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 741, in get_next
return self._next_internal()
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 661, in _next_internal
return structure.from_compatible_tensor_list(self._element_spec, ret)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/contextlib.py", line 130, in exit
self.gen.throw(type, value, traceback)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1989, in execution_mode
executor_new.wait()
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 67, in wait
pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -6400
[[{{node cond_4/else/_38/Pad}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[train]: 0%|▏ | 5000/4000000 [14:40<195:28:46, 5.68it/s]

@dathudeptrai
Copy link
Collaborator

@tekinek are u using newest code ? If no let try newest code then i can easily debug

@tekinek
Copy link
Author

tekinek commented Aug 2, 2020

@dathudeptrai Yes, it was an older code base. But updating to the newest introduced new error. It seems your recent update to the multiband_melgan.v1.yaml is not fully compatible with train_multiband_melgan.py, where older naming "generator_params" still apears and causes problem when remove_short_samples is enabled.

Traceback (most recent call last):
  File "examples/multiband_melgan/train_multiband_melgan.py", line 492, in <module>
    main()
  File "examples/multiband_melgan/train_multiband_melgan.py", line 366, in main
    ] + 2 * config["generator_params"].get("aux_context_window", 0)
KeyError: 'generator_params'

@dathudeptrai
Copy link
Collaborator

@tekinek let replace generator_params to multiband_generator_params

@manmay-nakhashi
Copy link

@dathudeptrai i'll send you tensorboard image shortly

@manmay-nakhashi
Copy link

@dathudeptrai

before clipping
mb_melgan

after clipping at 5 and applying tanh , it fixes the issue i guess :))
mb_melgan_after_clipping_5_tanh

@dathudeptrai
Copy link
Collaborator

@tekinek what is ur upper bound value :))). @manmay-nakhashi 5.0 is magic number haha :)) i guess 4.0 is a best number :v.

@manmay-nakhashi
Copy link

@dathudeptrai haha i'll try it with 4.0 :P

@manmay-nakhashi
Copy link

manmay-nakhashi commented Aug 7, 2020

@dathudeptrai
after starting a discriminator it again happened one time but after that it settles down
[WARNING] (Step: 205600) train_adversarial_loss = 1.0104.
[WARNING] (Step: 205600) train_subband_spectral_convergence_loss = 0.9997.
[WARNING] (Step: 205600) train_subband_log_magnitude_loss = 1.1088.
[WARNING] (Step: 205600) train_fullband_spectral_convergence_loss = 1.0251.
[WARNING] (Step: 205600) train_fullband_log_magnitude_loss = 1.3121.
[WARNING] (Step: 205600) train_gen_loss = 4.7488.
[WARNING] (Step: 205600) train_real_loss = 0.0664.
[WARNING] (Step: 205600) train_fake_loss = 0.1495.
[WARNING] (Step: 205600) train_dis_loss = 0.2159.
[train]: 5%|███████▍ | 205800/4000000 [47:51<511:38:08, 2.06it/s]
[WARNING] (Step: 205800) train_adversarial_loss = 267.4664.
[WARNING] (Step: 205800) train_subband_spectral_convergence_loss = 1.0560.
[WARNING] (Step: 205800) train_subband_log_magnitude_loss = 1.1541.
[WARNING] (Step: 205800) train_fullband_spectral_convergence_loss = 1.0531.
[WARNING] (Step: 205800) train_fullband_log_magnitude_loss = 1.3672.
[WARNING] (Step: 205800) train_gen_loss = 670.9814.
[WARNING] (Step: 205800) train_real_loss = 16.8144.
[WARNING] (Step: 205800) train_fake_loss = 1557.5889.
[WARNING] (Step: 205800) train_dis_loss = 1574.4030.

i was looking into discriminator loss and it doesn't have real vs fake loss in master branch is it needed ?

if self.steps >= self.config["discriminator_train_start_steps"]:
            p_hat = self._discriminator(y_hat)
            p = self._discriminator(tf.expand_dims(audios, 2))
            adv_loss = 0.0
            for i in range(len(p_hat)):
                adv_loss += calculate_3d_loss(
                    tf.ones_like(p_hat[i][-1]), p_hat[i][-1], loss_fn=self.mse_loss
                )
            adv_loss /= i + 1
            gen_loss += self.config["lambda_adv"] * adv_loss

            dict_metrics_losses.update({"adversarial_loss": adv_loss},)
           **# is real and fake loss calculation is needed in discriminator ?? ** 
           # discriminator
            p = self.discriminator(tf.expand_dims(y, 2))
            p_hat = self.discriminator(y_hat)
            real_loss = 0.0
            fake_loss = 0.0
            for i in range(len(p)):
                real_loss += self.mse_loss(p[i][-1], tf.ones_like(p[i][-1], tf.float32))
                fake_loss += self.mse_loss(
                    p_hat[i][-1], tf.zeros_like(p_hat[i][-1], tf.float32)
                )
            real_loss /= i + 1
            fake_loss /= i + 1
            dis_loss = real_loss + fake_loss

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Aug 7, 2020

@manmay-nakhashi so for now, everything is still ok ?. I think we should apply sigmoid function for discriminator :))). Can you try apply sigmoid for the last convolution ? here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/melgan.py#L411-L416). Then retrain and report the training progress here ?

@manmay-nakhashi
Copy link

manmay-nakhashi commented Aug 7, 2020

@dathudeptrai generator trained properly till 200k steps once i start discriminator it becomes unstable after 5k steps
i'll make that change and post tensorboard over here

@dathudeptrai
Copy link
Collaborator

@manmay-nakhashi real/fake loss computed here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py#L179-L198)

@manmay-nakhashi
Copy link

@dathudeptrai it's been 20k steps and traning is mimicking english graph pattern so i am hoping it'll converge better after sometime. i'll post tensorboard after 50k training steps

@tekinek
Copy link
Author

tekinek commented Aug 7, 2020

hi @dathudeptrai

opps here:( My mb-melgan training still seems problematic. Mine was 10.0 clip for stft losses and tanh to synthesis output. Should I try 4.0, and is resuming from 200k fine?

melgan_err1_0807

melgan_err2_0807

@dathudeptrai
Copy link
Collaborator

@tekinek what is ur discriminator parameter?

@manmay-nakhashi
Copy link

manmay-nakhashi commented Aug 7, 2020

@dathudeptrai i have tried sigmoid function , but as discriminator starts it starts adding beep to the waveform , then i replaced it with swish and it started working for me but there is an edge effect in the audio "straight spikes" , i think can be handled with padding or filtering (or may be it'll go away as model converges )

@tekinek
Copy link
Author

tekinek commented Aug 7, 2020

@dathudeptrai I haven't touch the defaults.

@dathudeptrai
Copy link
Collaborator

@tekinek there is no problem about stft loss in ur tensorboard. The problem is about discriminator :D. Let check ur current code and this line (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/melgan.py#L379-L380).

@manmay-nakhashi
Copy link

manmay-nakhashi commented Aug 7, 2020

@dathudeptrai have you encountered edge effects in initial discriminator training ?

@tekinek
Copy link
Author

tekinek commented Aug 7, 2020

@dathudeptrai

It is like this:

      discriminator += [
                    GroupConv1D(
                        filters=out_chs,
                        kernel_size= * 10 + 1,
                        strides=downsample_scale,
                        padding="same",
                        use_bias=use_bias,
                        groups=in_chs // 4,
                        kernel_initializer=get_initializer(initializer_seed),
                    )
                ]

A quick debug shows values of all that downsample_scales are 4.

@dathudeptrai
Copy link
Collaborator

@tekinek what is a number of parameter on ur discriminator ?. All downsample_sacles are 4 is correct.

@tekinek
Copy link
Author

tekinek commented Aug 7, 2020

@dathudeptrai Parameter number of discriminator is 3,981,507

@tekinek
Copy link
Author

tekinek commented Aug 7, 2020

FYI: this pytorch implementation of mb-melgan worked before with the same dataset.

kernel_size=downsample_scale * 10 + 1,
strides=downsample_scale,

@dathudeptrai
Copy link
Collaborator

@dathudeptrai Parameter number of discriminator is 3,981,507

that is totally wrong somehow, the correct parameter is > 16M. That is why ur discrimiator loss convergence at 0.25 :)). everything is ok :)).

@tekinek
Copy link
Author

tekinek commented Aug 7, 2020

@dathudeptrai I see. Then what causes such a big difference in #of params under default settings?

Model: "multi_band_melgan_discriminator"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
melgan_discriminator_scale_. multiple                  1327169   
_________________________________________________________________
melgan_discriminator_scale_. multiple                  1327169   
_________________________________________________________________
melgan_discriminator_scale_. multiple                  1327169   
_________________________________________________________________
average_pooling1d_2 (Average multiple                  0         
=================================================================
Total params: 3,981,507
Trainable params: 3,981,507
Non-trainable params: 0

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Aug 7, 2020

@tekinek sorry. it should be:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
sequential (Sequential)      (None, None, 4)           2534356   
=================================================================
Total params: 2,534,356
Trainable params: 2,534,356
Non-trainable params: 0
_________________________________________________________________
Model: "multi_band_melgan_discriminator"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
melgan_discriminator_scale_. multiple                  1450305   
_________________________________________________________________
melgan_discriminator_scale_. multiple                  1450305   
_________________________________________________________________
melgan_discriminator_scale_. multiple                  1450305   
_________________________________________________________________
average_pooling1d_2 (Average multiple                  0         
=================================================================
Total params: 4,350,915
Trainable params: 4,350,915
Non-trainable params: 0

Let me check the private framework again :)).

@cxcxcxcx
Copy link

sub_sc_loss = tf.where(sub_sc_loss >= 2.0, 0.0, sub_sc_loss)
sub_mag_loss = tf.where(sub_mag_loss >= 2.0, 0.0, sub_mag_loss)
full_sc_loss = tf.where(full_sc_loss >= 2.0, 0.0, full_sc_loss)
full_mag_loss = tf.where(full_mag_loss >= 2.0, 0.0, full_mag_loss)
gen_loss = 0.5 * (sub_sc_loss + sub_mag_loss) + 0.5 * (
     full_sc_loss + full_mag_loss
)

Something is weird here. I ran into the same problem, and I notice that y_mb_hat become either 1 or -1. Once it's in that state, masking loss wouldn't help.
For the baker dataset, I also noticed it trains ok for some time if I don't interrupt it. But if I stop and then load the saved checkpoint, the loss becomes crazy fast (in <10 cycles).

@GavinStein1
Copy link

Hi there, I am getting the NaN problem when training my Tacotron2. it occurs between 18.1k and 18.2k iterations.


18,100 itr

0_alignment
1_alignment
2_alignment
3_alignment
4_alignment
5_alignment
6_alignment
7_alignment
8_alignment


18,200 itr

0_alignment
1_alignment
2_alignment
3_alignment
4_alignment
5_alignment
6_alignment
7_alignment
8_alignment

I have tried setting loss_att * 0.0 but still occurs. I am training on my own dataset which is much smaller than ljspeech but is still english. I use the ljspeech preprocessor. Any idea what is causing this?

@stale
Copy link

stale bot commented Dec 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix label Dec 11, 2020
@stale stale bot closed this as completed Dec 18, 2020
@bemnet4u
Copy link

bemnet4u commented Jan 30, 2022

I am experiencing this issue when training non english data set. I have about 1 hour audio with text to test this model and after 18k steps, i see the mel_loss gets to nan and won't recover. I tried seting attention loss as suggested by multiplying it to 0 like below.

I realize that the data set I have might be too small but all I was hopping to aquire more data set if I got a small light of success that this is working. I created preprocessors and modifications for this datasource based on the existing samples but I am not certain if I am doing this right. My changes are in this fork https://github.com/bemnet4u/TensorFlowTTS

@dathudeptrai or @GavinStein1 any advice on how to overcome this?

loss_att = loss_att * 0.0
        per_example_losses = (
            stop_token_loss + mel_loss_before + mel_loss_after + loss_att
        )

        dict_metrics_losses = {
            "stop_token_loss": stop_token_loss,
            "mel_loss_before": mel_loss_before,
            "mel_loss_after": mel_loss_after,
            "guided_attention_loss": loss_att,
        }``` 

I see this in the logs.

```2022-01-30 05:40:05,977 (base_trainer:988) INFO: (Step: 24400) train_mel_loss_before = nan.
2022-01-30 05:40:05,978 (base_trainer:988) INFO: (Step: 24400) train_mel_loss_after = nan.
2022-01-30 05:40:05,978 (base_trainer:988) INFO: (Step: 24400) train_guided_attention_loss = 0.0000.

image

More training logs

2022-01-29 20:38:28,882 (tacotron_dataset:93) INFO: Using guided attention loss
2022-01-29 20:38:28,887 (train_tacotron2:456) INFO: Updating save_interval_steps from 2000 to 500
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: hop_size = 256
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: format = npy
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: model_type = tacotron2
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: tacotron2_params = {'dataset': 'ljspeech', 'embedding_hidden_size': 512, 'initializer_range': 0.02, 'embedding_dropout_prob': 0.1, 'n_speakers': 1, 'n_conv_encoder': 5, 'encoder_conv_filters': 512, 'encoder_conv_kernel_sizes': 5, 'encoder_conv_activation': 'relu', 'encoder_conv_dropout_rate': 0.5, 'encoder_lstm_units': 256, 'n_prenet_layers': 2, 'prenet_units': 256, 'prenet_activation': 'relu', 'prenet_dropout_rate': 0.5, 'n_lstm_decoder': 1, 'reduction_factor': 1, 'decoder_lstm_units': 1024, 'attention_dim': 128, 'attention_filters': 32, 'attention_kernel': 31, 'n_mels': 80, 'n_conv_postnet': 5, 'postnet_conv_filters': 512, 'postnet_conv_kernel_sizes': 5, 'postnet_dropout_rate': 0.1, 'attention_type': 'lsa'}
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: batch_size = 32
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: remove_short_samples = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: allow_cache = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: mel_length_threshold = 32
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: is_shuffle = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: use_fixed_shapes = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 1e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001}
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: gradient_accumulation_steps = 1
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: var_train_expr = None
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: train_max_steps = 200000
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: save_interval_steps = 500
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: eval_interval_steps = 500
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: log_interval_steps = 200
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: start_schedule_teacher_forcing = 200001
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: start_ratio_value = 0.5
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: schedule_decay_steps = 50000
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: end_ratio_value = 0.0
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: num_save_intermediate_results = 1
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: train_dir = /tmp/dataset/dump/dump_amharic/train/
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: dev_dir = /tmp/dataset/dump/dump_amharic/valid
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: use_norm = True
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: outdir = /tmp/dataset/dump/examples/tacotron2/exp/train.tacotron2.v1
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: config = /databricks/driver/TensorFlowTTS/examples/tacotron2/conf/tacotron2.v1.yaml
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: resume = /tmp/dataset/dump/examples/tacotron2/exp/train.tacotron2.v1/checkpoints/ckpt-16000
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: verbose = 1
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: mixed_precision = False
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: pretrained = 
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: use_fal = False
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: version = 0.0
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: max_mel_length = 859
2022-01-29 20:38:28,887 (train_tacotron2:460) INFO: max_char_length = 84
2022-01-29 20:38:29,212 (tacotron_dataset:93) INFO: Using guided attention loss
Model: "tacotron2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
encoder (TFTacotronEncoder)  multiple                  8218624   
_________________________________________________________________
decoder_cell (TFTacotronDeco multiple                  18246402  
_________________________________________________________________
post_net (TFTacotronPostnet) multiple                  5460480   
_________________________________________________________________
residual_projection (Dense)  multiple                  41040     
=================================================================
Total params: 31,966,546
Trainable params: 31,956,306
Non-trainable params: 10,240
_________________________________________________________________

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion 😁 Discuss new feature performance 🏍 Slow question ❓ Further information is requested Tacotron Tacotron related question. wontfix
Projects
Tacotron 2
  
In progress
Development

No branches or pull requests

6 participants