-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
请提出你的问题
使用多卡训练GLM,利用之前保存的模型加载继续训练出错:
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/GLM_paddle/glm/glm_multi_card/finetune_generation.py", line 193, in
main()
File "/root/paddlejob/workspace/env_run/GLM_paddle/glm/glm_multi_card/finetune_generation.py", line 181, in main
train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
File "/root/paddlejob/workspace/env_run/GLM_paddle/anaconda3/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 581, in train
self._load_optimizer_and_scheduler(resume_from_checkpoint)
File "/root/paddlejob/workspace/env_run/GLM_paddle/anaconda3/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 1984, in _load_optimizer_and_scheduler
self.scaler.load_state_dict(paddle.load(os.path.join(checkpoint, SCALER_NAME), return_numpy=True))
File "/root/paddlejob/workspace/env_run/GLM_paddle/anaconda3/lib/python3.9/site-packages/paddle/amp/grad_scaler.py", line 1197, in load_state_dict
super().load_state_dict(state_dict)
File "/root/paddlejob/workspace/env_run/GLM_paddle/anaconda3/lib/python3.9/site-packages/paddle/amp/grad_scaler.py", line 559, in load_state_dict
raise RuntimeError(
RuntimeError: The input state dict is empty, possibly because it was saved from a disabled instance of GradScaler.
I0829 10:16:15.477458 943 tcp_store.cc:273] receive shutdown event and so quit from MasterDaemon run loop
代码:
if training_args.do_train:
train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
trainer.save_state()