-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
大家在预训练的时候有遇到过loss为nan吗 #17
Comments
我也是loss到4.多就开始nan了。请问你解决这个问题了吗==? |
warmup的问题,warmup iters设置大一点,还是不行的话学习率适当减小 |
好的,谢谢您 |
|
[2023-09-01 10:39:57,559][pretrain.py][INFO] Epoch:0/2 loss:11.271 lr:0.0000000 epoch_Time:150283.0min: |
作者您好,修改了warm ite和学习率之后依然存在nan,请问是跟bs设置有关系吗,我设置的是16,LLM是不是要设置的大一些? |
奇怪了,难道和数据有关?推荐从以下几个角度来排查哈: |
我之前用pytorch1.12.1预训练OK,今天切换到pytorch2.0也出现nan了。。
|
我用的也是2.0,因为用1.X显示我没法用flash Attention |
有无大佬测试过用和不用flash attenion差多少速度吗 |
严格按照requirements.txt配环境,尤其torch |
用 |
猜测是因为混合精度训练的问题,如果无法解决,并且显存支持的话,全部整上f32训练 |
[2023-08-30 16:04:47,404][pretrain.py][INFO] Epoch:0/2 loss:11.271 lr:0.0000000 epoch_Time:137483.0min:
[2023-08-30 16:08:27,427][pretrain.py][INFO] Epoch:0/2 loss:6.268 lr:0.0001000 epoch_Time:1208.0min:
[2023-08-30 16:12:01,041][pretrain.py][INFO] Epoch:0/2 loss:5.627 lr:0.0001000 epoch_Time:1121.0min:
[2023-08-30 16:15:35,618][pretrain.py][INFO] Epoch:0/2 loss:4.548 lr:0.0000999 epoch_Time:1091.0min:
[2023-08-30 16:19:08,321][pretrain.py][INFO] Epoch:0/2 loss:4.591 lr:0.0000997 epoch_Time:1072.0min:
[2023-08-30 16:22:43,731][pretrain.py][INFO] Epoch:0/2 loss:4.309 lr:0.0000994 epoch_Time:1062.0min:
[2023-08-30 16:26:16,924][pretrain.py][INFO] Epoch:0/2 loss:4.294 lr:0.0000991 epoch_Time:1053.0min:
[2023-08-30 16:29:49,699][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000987 epoch_Time:1044.0min:
[2023-08-30 16:33:33,730][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000983 epoch_Time:1043.0min:
[2023-08-30 16:37:10,391][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000977 epoch_Time:1039.0min:
[2023-08-30 16:40:49,196][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000971 epoch_Time:1035.0min:
[2023-08-30 16:44:29,060][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000965 epoch_Time:1031.0min:
[2023-08-30 16:48:10,314][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000958 epoch_Time:1029.0min:
[2023-08-30 16:51:50,553][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000950 epoch_Time:1025.0min:
[2023-08-30 16:55:41,688][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000941 epoch_Time:1025.0min:
[2023-08-30 16:59:56,754][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000932 epoch_Time:1033.0min:
[2023-08-30 17:04:02,156][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000922 epoch_Time:1036.0min:
The text was updated successfully, but these errors were encountered: