Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大家在预训练的时候有遇到过loss为nan吗 #17

Open
ZK-Zhou opened this issue Aug 30, 2023 · 15 comments
Open

大家在预训练的时候有遇到过loss为nan吗 #17

ZK-Zhou opened this issue Aug 30, 2023 · 15 comments

Comments

@ZK-Zhou
Copy link

ZK-Zhou commented Aug 30, 2023

[2023-08-30 16:04:47,404][pretrain.py][INFO] Epoch:0/2 loss:11.271 lr:0.0000000 epoch_Time:137483.0min:
[2023-08-30 16:08:27,427][pretrain.py][INFO] Epoch:0/2 loss:6.268 lr:0.0001000 epoch_Time:1208.0min:
[2023-08-30 16:12:01,041][pretrain.py][INFO] Epoch:0/2 loss:5.627 lr:0.0001000 epoch_Time:1121.0min:
[2023-08-30 16:15:35,618][pretrain.py][INFO] Epoch:0/2 loss:4.548 lr:0.0000999 epoch_Time:1091.0min:
[2023-08-30 16:19:08,321][pretrain.py][INFO] Epoch:0/2 loss:4.591 lr:0.0000997 epoch_Time:1072.0min:
[2023-08-30 16:22:43,731][pretrain.py][INFO] Epoch:0/2 loss:4.309 lr:0.0000994 epoch_Time:1062.0min:
[2023-08-30 16:26:16,924][pretrain.py][INFO] Epoch:0/2 loss:4.294 lr:0.0000991 epoch_Time:1053.0min:
[2023-08-30 16:29:49,699][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000987 epoch_Time:1044.0min:
[2023-08-30 16:33:33,730][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000983 epoch_Time:1043.0min:
[2023-08-30 16:37:10,391][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000977 epoch_Time:1039.0min:
[2023-08-30 16:40:49,196][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000971 epoch_Time:1035.0min:
[2023-08-30 16:44:29,060][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000965 epoch_Time:1031.0min:
[2023-08-30 16:48:10,314][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000958 epoch_Time:1029.0min:
[2023-08-30 16:51:50,553][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000950 epoch_Time:1025.0min:
[2023-08-30 16:55:41,688][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000941 epoch_Time:1025.0min:
[2023-08-30 16:59:56,754][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000932 epoch_Time:1033.0min:
[2023-08-30 17:04:02,156][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000922 epoch_Time:1036.0min:

@zhangheyi-1
Copy link

我也是loss到4.多就开始nan了。请问你解决这个问题了吗==?

@DLLXW
Copy link
Owner

DLLXW commented Aug 30, 2023

[2023-08-30 16:04:47,404][pretrain.py][INFO] Epoch:0/2 loss:11.271 lr:0.0000000 epoch_Time:137483.0min: [2023-08-30 16:08:27,427][pretrain.py][INFO] Epoch:0/2 loss:6.268 lr:0.0001000 epoch_Time:1208.0min: [2023-08-30 16:12:01,041][pretrain.py][INFO] Epoch:0/2 loss:5.627 lr:0.0001000 epoch_Time:1121.0min: [2023-08-30 16:15:35,618][pretrain.py][INFO] Epoch:0/2 loss:4.548 lr:0.0000999 epoch_Time:1091.0min: [2023-08-30 16:19:08,321][pretrain.py][INFO] Epoch:0/2 loss:4.591 lr:0.0000997 epoch_Time:1072.0min: [2023-08-30 16:22:43,731][pretrain.py][INFO] Epoch:0/2 loss:4.309 lr:0.0000994 epoch_Time:1062.0min: [2023-08-30 16:26:16,924][pretrain.py][INFO] Epoch:0/2 loss:4.294 lr:0.0000991 epoch_Time:1053.0min: [2023-08-30 16:29:49,699][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000987 epoch_Time:1044.0min: [2023-08-30 16:33:33,730][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000983 epoch_Time:1043.0min: [2023-08-30 16:37:10,391][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000977 epoch_Time:1039.0min: [2023-08-30 16:40:49,196][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000971 epoch_Time:1035.0min: [2023-08-30 16:44:29,060][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000965 epoch_Time:1031.0min: [2023-08-30 16:48:10,314][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000958 epoch_Time:1029.0min: [2023-08-30 16:51:50,553][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000950 epoch_Time:1025.0min: [2023-08-30 16:55:41,688][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000941 epoch_Time:1025.0min: [2023-08-30 16:59:56,754][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000932 epoch_Time:1033.0min: [2023-08-30 17:04:02,156][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000922 epoch_Time:1036.0min:

warmup的问题,warmup iters设置大一点,还是不行的话学习率适当减小

@ZK-Zhou
Copy link
Author

ZK-Zhou commented Aug 31, 2023

[2023-08-30 16:04:47,404][pretrain.py][INFO] Epoch:0/2 loss:11.271 lr:0.0000000 epoch_Time:137483.0min: [2023-08-30 16:08:27,427][pretrain.py][INFO] Epoch:0/2 loss:6.268 lr:0.0001000 epoch_Time:1208.0min: [2023-08-30 16:12:01,041][pretrain.py][INFO] Epoch:0/2 loss:5.627 lr:0.0001000 epoch_Time:1121.0min: [2023-08-30 16:15:35,618][pretrain.py][INFO] Epoch:0/2 loss:4.548 lr:0.0000999 epoch_Time:1091.0min: [2023-08-30 16:19:08,321][pretrain.py][INFO] Epoch:0/2 loss:4.591 lr:0.0000997 epoch_Time:1072.0min: [2023-08-30 16:22:43,731][pretrain.py][INFO] Epoch:0/2 loss:4.309 lr:0.0000994 epoch_Time:1062.0min: [2023-08-30 16:26:16,924][pretrain.py][INFO] Epoch:0/2 loss:4.294 lr:0.0000991 epoch_Time:1053.0min: [2023-08-30 16:29:49,699][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000987 epoch_Time:1044.0min: [2023-08-30 16:33:33,730][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000983 epoch_Time:1043.0min: [2023-08-30 16:37:10,391][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000977 epoch_Time:1039.0min: [2023-08-30 16:40:49,196][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000971 epoch_Time:1035.0min: [2023-08-30 16:44:29,060][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000965 epoch_Time:1031.0min: [2023-08-30 16:48:10,314][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000958 epoch_Time:1029.0min: [2023-08-30 16:51:50,553][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000950 epoch_Time:1025.0min: [2023-08-30 16:55:41,688][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000941 epoch_Time:1025.0min: [2023-08-30 16:59:56,754][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000932 epoch_Time:1033.0min: [2023-08-30 17:04:02,156][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000922 epoch_Time:1036.0min:

warmup的问题,warmup iters设置大一点,还是不行的话学习率适当减小

好的,谢谢您

@ZK-Zhou
Copy link
Author

ZK-Zhou commented Sep 1, 2023

gradient_accumulation_steps = 1 # used to simulate larger batch sizes
batch_size = 16  # if gradient_accumulation_steps > 1, this is the micro-batch size
# model 根据需要更改 
max_seq_len = 512
dim = 1024
n_layers = 12
n_heads = 8
multiple_of = 32
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
bias = False # do we use bias inside LayerNorm and Linear layers?
# adamw optimizer
learning_rate = 8e-5 # max learning rate
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 4000 # how many steps to warm up for
lr_decay_iters = 80000 # should be ~= max_iters per Chinchilla
min_lr = 1e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla

@ZK-Zhou
Copy link
Author

ZK-Zhou commented Sep 1, 2023

[2023-09-01 10:39:57,559][pretrain.py][INFO] Epoch:0/2 loss:11.271 lr:0.0000000 epoch_Time:150283.0min:
[2023-09-01 10:43:36,003][pretrain.py][INFO] Epoch:0/2 loss:7.089 lr:0.0000200 epoch_Time:1213.0min:
[2023-09-01 10:47:13,709][pretrain.py][INFO] Epoch:0/2 loss:6.374 lr:0.0000400 epoch_Time:1134.0min:
[2023-09-01 10:51:03,485][pretrain.py][INFO] Epoch:0/2 loss:5.279 lr:0.0000600 epoch_Time:1124.0min:
[2023-09-01 10:54:55,572][pretrain.py][INFO] Epoch:0/2 loss:5.059 lr:0.0000800 epoch_Time:1120.0min:
[2023-09-01 10:59:03,063][pretrain.py][INFO] Epoch:0/2 loss:4.653 lr:0.0000800 epoch_Time:1131.0min:
[2023-09-01 11:02:54,313][pretrain.py][INFO] Epoch:0/2 loss:4.521 lr:0.0000799 epoch_Time:1124.0min:
[2023-09-01 11:06:40,714][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000797 epoch_Time:1114.0min:
[2023-09-01 11:10:27,965][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000795 epoch_Time:1106.0min:
[2023-09-01 11:14:20,178][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000793 epoch_Time:1103.0min:
[2023-09-01 11:18:18,577][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000789 epoch_Time:1102.0min:
[2023-09-01 11:22:09,949][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000785 epoch_Time:1097.0min:

@ZK-Zhou
Copy link
Author

ZK-Zhou commented Sep 1, 2023

作者您好,修改了warm ite和学习率之后依然存在nan,请问是跟bs设置有关系吗,我设置的是16,LLM是不是要设置的大一些?

@zhaojainshi
Copy link

精调的时候也是出现这个问题
image

@DLLXW
Copy link
Owner

DLLXW commented Sep 1, 2023

作者您好,修改了warm ite和学习率之后依然存在nan,请问是跟bs设置有关系吗,我设置的是16,LLM是不是要设置的大一些?

奇怪了,难道和数据有关?推荐从以下几个角度来排查哈:
1.看看是不是遇到了某一段数据后出现了NaN,或许可以跳过那段。
2.尝试先不使用fp16混合精度训练,直接用fp32训练一段时间,等loss稳定了在切换到fp16

@DLLXW
Copy link
Owner

DLLXW commented Sep 1, 2023

精调的时候也是出现这个问题 image

你这个就更离谱了,loss都降到2.x了竟然还会突然NaN。先用fp32迭代一会儿再切回fp16试一试?

@qxj
Copy link

qxj commented Sep 5, 2023

我之前用pytorch1.12.1预训练OK,今天切换到pytorch2.0也出现nan了。。

[2023-09-05 11:59:59,270][pretrain.py][INFO] Epoch:[0/1](5600/260508) loss:3.004 lr:0.0002976 epoch_Time:1507.0min:
[2023-09-05 12:00:34,668][pretrain.py][INFO] Epoch:[0/1](5700/260508) loss:2.988 lr:0.0002975 epoch_Time:1507.0min:
[2023-09-05 12:01:10,066][pretrain.py][INFO] Epoch:[0/1](5800/260508) loss:3.152 lr:0.0002974 epoch_Time:1506.0min:
[2023-09-05 12:01:45,469][pretrain.py][INFO] Epoch:[0/1](5900/260508) loss:3.047 lr:0.0002973 epoch_Time:1506.0min:
[2023-09-05 12:02:20,865][pretrain.py][INFO] Epoch:[0/1](6000/260508) loss:2.898 lr:0.0002971 epoch_Time:1505.0min:
[2023-09-05 12:02:56,262][pretrain.py][INFO] Epoch:[0/1](6100/260508) loss:3.095 lr:0.0002970 epoch_Time:1504.0min:
[2023-09-05 12:03:31,657][pretrain.py][INFO] Epoch:[0/1](6200/260508) loss:2.887 lr:0.0002969 epoch_Time:1503.0min:
[2023-09-05 12:04:07,051][pretrain.py][INFO] Epoch:[0/1](6300/260508) loss:2.839 lr:0.0002968 epoch_Time:1502.0min:
[2023-09-05 12:04:42,444][pretrain.py][INFO] Epoch:[0/1](6400/260508) loss:3.011 lr:0.0002967 epoch_Time:1502.0min:
[2023-09-05 12:05:17,842][pretrain.py][INFO] Epoch:[0/1](6500/260508) loss:3.048 lr:0.0002965 epoch_Time:1501.0min:
[2023-09-05 12:05:53,236][pretrain.py][INFO] Epoch:[0/1](6600/260508) loss:2.900 lr:0.0002964 epoch_Time:1500.0min:
[2023-09-05 12:06:28,629][pretrain.py][INFO] Epoch:[0/1](6700/260508) loss:3.206 lr:0.0002963 epoch_Time:1500.0min:
[2023-09-05 12:07:04,028][pretrain.py][INFO] Epoch:[0/1](6800/260508) loss:2.725 lr:0.0002962 epoch_Time:1499.0min:
[2023-09-05 12:07:39,436][pretrain.py][INFO] Epoch:[0/1](6900/260508) loss:2.975 lr:0.0002960 epoch_Time:1499.0min:
[2023-09-05 12:08:14,897][pretrain.py][INFO] Epoch:[0/1](7000/260508) loss:nan lr:0.0002959 epoch_Time:1498.0min:
[2023-09-05 12:08:50,363][pretrain.py][INFO] Epoch:[0/1](7100/260508) loss:nan lr:0.0002958 epoch_Time:1498.0min:
[2023-09-05 12:09:25,826][pretrain.py][INFO] Epoch:[0/1](7200/260508) loss:nan lr:0.0002956 epoch_Time:1497.0min:
[2023-09-05 12:10:01,288][pretrain.py][INFO] Epoch:[0/1](7300/260508) loss:nan lr:0.0002955 epoch_Time:1496.0min:
[2023-09-05 12:10:36,755][pretrain.py][INFO] Epoch:[0/1](7400/260508) loss:nan lr:0.0002953 epoch_Time:1496.0min:
[2023-09-05 12:11:12,219][pretrain.py][INFO] Epoch:[0/1](7500/260508) loss:nan lr:0.0002952 epoch_Time:1495.0min:
[2023-09-05 12:11:47,689][pretrain.py][INFO] Epoch:[0/1](7600/260508) loss:nan lr:0.0002950 epoch_Time:1495.0min:

@ZK-Zhou
Copy link
Author

ZK-Zhou commented Sep 6, 2023

我之前用pytorch1.12.1预训练OK,今天切换到pytorch2.0也出现nan了。。

[2023-09-05 11:59:59,270][pretrain.py][INFO] Epoch:[0/1](5600/260508) loss:3.004 lr:0.0002976 epoch_Time:1507.0min:
[2023-09-05 12:00:34,668][pretrain.py][INFO] Epoch:[0/1](5700/260508) loss:2.988 lr:0.0002975 epoch_Time:1507.0min:
[2023-09-05 12:01:10,066][pretrain.py][INFO] Epoch:[0/1](5800/260508) loss:3.152 lr:0.0002974 epoch_Time:1506.0min:
[2023-09-05 12:01:45,469][pretrain.py][INFO] Epoch:[0/1](5900/260508) loss:3.047 lr:0.0002973 epoch_Time:1506.0min:
[2023-09-05 12:02:20,865][pretrain.py][INFO] Epoch:[0/1](6000/260508) loss:2.898 lr:0.0002971 epoch_Time:1505.0min:
[2023-09-05 12:02:56,262][pretrain.py][INFO] Epoch:[0/1](6100/260508) loss:3.095 lr:0.0002970 epoch_Time:1504.0min:
[2023-09-05 12:03:31,657][pretrain.py][INFO] Epoch:[0/1](6200/260508) loss:2.887 lr:0.0002969 epoch_Time:1503.0min:
[2023-09-05 12:04:07,051][pretrain.py][INFO] Epoch:[0/1](6300/260508) loss:2.839 lr:0.0002968 epoch_Time:1502.0min:
[2023-09-05 12:04:42,444][pretrain.py][INFO] Epoch:[0/1](6400/260508) loss:3.011 lr:0.0002967 epoch_Time:1502.0min:
[2023-09-05 12:05:17,842][pretrain.py][INFO] Epoch:[0/1](6500/260508) loss:3.048 lr:0.0002965 epoch_Time:1501.0min:
[2023-09-05 12:05:53,236][pretrain.py][INFO] Epoch:[0/1](6600/260508) loss:2.900 lr:0.0002964 epoch_Time:1500.0min:
[2023-09-05 12:06:28,629][pretrain.py][INFO] Epoch:[0/1](6700/260508) loss:3.206 lr:0.0002963 epoch_Time:1500.0min:
[2023-09-05 12:07:04,028][pretrain.py][INFO] Epoch:[0/1](6800/260508) loss:2.725 lr:0.0002962 epoch_Time:1499.0min:
[2023-09-05 12:07:39,436][pretrain.py][INFO] Epoch:[0/1](6900/260508) loss:2.975 lr:0.0002960 epoch_Time:1499.0min:
[2023-09-05 12:08:14,897][pretrain.py][INFO] Epoch:[0/1](7000/260508) loss:nan lr:0.0002959 epoch_Time:1498.0min:
[2023-09-05 12:08:50,363][pretrain.py][INFO] Epoch:[0/1](7100/260508) loss:nan lr:0.0002958 epoch_Time:1498.0min:
[2023-09-05 12:09:25,826][pretrain.py][INFO] Epoch:[0/1](7200/260508) loss:nan lr:0.0002956 epoch_Time:1497.0min:
[2023-09-05 12:10:01,288][pretrain.py][INFO] Epoch:[0/1](7300/260508) loss:nan lr:0.0002955 epoch_Time:1496.0min:
[2023-09-05 12:10:36,755][pretrain.py][INFO] Epoch:[0/1](7400/260508) loss:nan lr:0.0002953 epoch_Time:1496.0min:
[2023-09-05 12:11:12,219][pretrain.py][INFO] Epoch:[0/1](7500/260508) loss:nan lr:0.0002952 epoch_Time:1495.0min:
[2023-09-05 12:11:47,689][pretrain.py][INFO] Epoch:[0/1](7600/260508) loss:nan lr:0.0002950 epoch_Time:1495.0min:

我用的也是2.0,因为用1.X显示我没法用flash Attention

@ZK-Zhou
Copy link
Author

ZK-Zhou commented Sep 6, 2023

有无大佬测试过用和不用flash attenion差多少速度吗

@Fangddm123
Copy link

严格按照requirements.txt配环境,尤其torch

@crj1998
Copy link

crj1998 commented Apr 26, 2024

bfloat16

@Camellia-hz
Copy link

猜测是因为混合精度训练的问题,如果无法解决,并且显存支持的话,全部整上f32训练

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants