Unable to Reproduce Training Process #46

Toneyaya · 2024-06-19T08:03:39Z

Hello, thank you for your outstanding open-source work! I encountered a problem during the second stage of training when attempting to reproduce the training process. The loss becomes zero across all iterations. This happens regardless of whether I use my trained mm_projector.bin or the weights you released. The loss always drops to zero within the first few iterations. I have followed your instructions precisely to reproduce the training process.

(If anyone has successfully reproduced the training process, let's discuss this issue together.)

jpthu17 · 2024-06-20T11:34:18Z

loss is 0, which may be caused by numerical overflow. What kind of GPUs do you use? You can confirm whether the GPU supports bf16. torch.cuda.is_bf16_supported() If it does not support bf16, the value overflow may be caused.

jpthu17 closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Reproduce Training Process #46

Unable to Reproduce Training Process #46

Toneyaya commented Jun 19, 2024

jpthu17 commented Jun 20, 2024

Unable to Reproduce Training Process #46

Unable to Reproduce Training Process #46

Comments

Toneyaya commented Jun 19, 2024

jpthu17 commented Jun 20, 2024