Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan issue #8

Open
CaiwenXu opened this issue Jun 2, 2023 · 3 comments
Open

Nan issue #8

CaiwenXu opened this issue Jun 2, 2023 · 3 comments

Comments

@CaiwenXu
Copy link

CaiwenXu commented Jun 2, 2023

Hi, many thanks for your excellent work! I have a problem when training the VQ GAN, the loss will suddenly become nan, and do you know why this happens? I used the LIDC dataset.

@benearnthof
Copy link

I'm currently having the same problem I used the exact same configs provided here and still no luck, very unstable training.
The Model does also suffer from mode collapse after the Discriminator starts training.

@benearnthof
Copy link

I believe this problem may stem from the accumulate_grad_batches parameter. I trained a run for more than 50000 steps successfully, but trying to replicate training with accumulate_grad_batches > 1 runs into the nan problem. @CWX-student can you confirm this or do you have any other info on your end?

@benearnthof
Copy link

Update: Using setting the precision parameter in the config to at least 32 seems to alleviate this problem. https://discuss.pytorch.org/t/distributed-training-gives-nan-loss-but-single-gpu-training-is-fine/63664/6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants