Nan issue #8

CaiwenXu · 2023-06-02T17:19:41Z

Hi， many thanks for your excellent work! I have a problem when training the VQ GAN, the loss will suddenly become nan, and do you know why this happens? I used the LIDC dataset.

benearnthof · 2023-07-25T11:16:18Z

I'm currently having the same problem I used the exact same configs provided here and still no luck, very unstable training.
The Model does also suffer from mode collapse after the Discriminator starts training.

benearnthof · 2023-07-25T11:24:56Z

I believe this problem may stem from the accumulate_grad_batches parameter. I trained a run for more than 50000 steps successfully, but trying to replicate training with accumulate_grad_batches > 1 runs into the nan problem. @CWX-student can you confirm this or do you have any other info on your end?

benearnthof · 2023-07-25T12:02:33Z

Update: Using setting the precision parameter in the config to at least 32 seems to alleviate this problem. https://discuss.pytorch.org/t/distributed-training-gives-nan-loss-but-single-gpu-training-is-fine/63664/6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan issue #8

Nan issue #8

CaiwenXu commented Jun 2, 2023

benearnthof commented Jul 25, 2023

benearnthof commented Jul 25, 2023

benearnthof commented Jul 25, 2023

Nan issue #8

Nan issue #8

Comments

CaiwenXu commented Jun 2, 2023

benearnthof commented Jul 25, 2023

benearnthof commented Jul 25, 2023

benearnthof commented Jul 25, 2023