Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should there be any noise output? #82

Open
deepglugs opened this issue Oct 21, 2020 · 6 comments
Open

Should there be any noise output? #82

deepglugs opened this issue Oct 21, 2020 · 6 comments

Comments

@deepglugs
Copy link

deepglugs commented Oct 21, 2020

I'm having trouble getting any decent results out of flowtron and trying to figure out why. With my somewhat small dataset (0.67hrs) and warmstart from ljs, I can't seem to get anything but noise when doing inference on my checkpoints. I tried warmstart ljs with flow=1 and flow=2. I trained for 240k steps. I've tried adjusting p_arpabet (1.0, 0.5), but no dice. Also tried lowering the learning rate to 1e-5.

It seems I should be getting something other than noise at some point?

pytorch 1.6, python 3.8: noise up to 200k+

pytorch 1.3, python 3.7.4: step 5k: (102400,) noise. step 10k: (9984,) noise, step 20k: (2816,) noise

I know the dataset can't be too bad because deepvoice3 works on it to a reasonable degree...

@deepglugs deepglugs changed the title Mean, LogVar, Prob = None in compute_validation_loss Should there be any noise output? Oct 21, 2020
@deepglugs
Copy link
Author

A bit of an update... I'm training on the LJ dataset and I don't get noise. So something about my dataset is troublesome for flowtron. My data has a lot of shorter utterances like maybe 2-5 words. I also notice that the loss decay was much much faster. -1.0 loss in under 500 steps. LJ isn't even below 0.9 at 100k steps. I also noticed that my wavs are 32bit and LJ are 16bit. My data was magically converted after using librosa's wav writer after trimming silence. Ooops! Retraining now. Hoping for the best.

@deepglugs
Copy link
Author

Another update. Looks like 32bit wav data was my issue. Now I get jibberish output with the model never attending to the text. Attention weights look poor after 1.6m steps similar to #41 and others:

image

Training loss

image

I wonder if my dataset is too small? I have < 1hr of audio data. Would adding another speaker help? Another difference between my dataset and say, LJS is that my dataset has many more smaller utterances (1-3 words).

I've gone through another pass and cleaned my data checking the transcript and removing things like laughing. At the same time, I'm training another model with this dataset and one more as an additional speaker. This makes almost 2hrs of data.

@deepglugs
Copy link
Author

Yes another update: Still trying to figure out the differences between my dataset and ljs. There are two remaining possibilities that come to mind: utterance length and total dataset size. I trimmed out of my training dataset any sample that was < 1s and > 10s. The min/max distribution now roughly matches ljs. However, even after 345k steps, no attention was learned.

I then created an ljs dataset with only 500 samples (~0.9hrs). Also no attention after 350k steps. Will try again at 1k samples (1.71hrs) and go up to figure out just how much data is required to learn attention on.

@deepglugs
Copy link
Author

LJS with 2500 samples I have attention starting at 85k. here's 185k
image

@rafaelvalle
Copy link
Contributor

please make sure you set the attention prior to True here
https://github.com/NVIDIA/flowtron/blob/master/config.json#L34

@deepglugs
Copy link
Author

That seems to have done the trick! The directions for training from scratch seem to apply to pre-trained models as well.

I'm seeing a lot of stuttering in the audio output though. What is typically the cause for this? Need more training time? Data issues? (sigma==0.8)

out.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants