Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Reproduce Training Process #46

Closed
Toneyaya opened this issue Jun 19, 2024 · 1 comment
Closed

Unable to Reproduce Training Process #46

Toneyaya opened this issue Jun 19, 2024 · 1 comment

Comments

@Toneyaya
Copy link

Hello, thank you for your outstanding open-source work! I encountered a problem during the second stage of training when attempting to reproduce the training process. The loss becomes zero across all iterations. This happens regardless of whether I use my trained mm_projector.bin or the weights you released. The loss always drops to zero within the first few iterations. I have followed your instructions precisely to reproduce the training process.

(If anyone has successfully reproduced the training process, let's discuss this issue together.)

image

@jpthu17
Copy link
Member

jpthu17 commented Jun 20, 2024

loss is 0, which may be caused by numerical overflow. What kind of GPUs do you use? You can confirm whether the GPU supports bf16. torch.cuda.is_bf16_supported() If it does not support bf16, the value overflow may be caused.

@jpthu17 jpthu17 closed this as completed Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants