Skip to content

多卡加载训练时出现:RuntimeError: The input state dict is empty, possibly because it was saved from a disabled instance of GradScaler. #6853

@cycao77

Description

@cycao77

请提出你的问题

使用多卡训练GLM,利用之前保存的模型加载继续训练出错:
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/GLM_paddle/glm/glm_multi_card/finetune_generation.py", line 193, in
main()
File "/root/paddlejob/workspace/env_run/GLM_paddle/glm/glm_multi_card/finetune_generation.py", line 181, in main
train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
File "/root/paddlejob/workspace/env_run/GLM_paddle/anaconda3/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 581, in train
self._load_optimizer_and_scheduler(resume_from_checkpoint)
File "/root/paddlejob/workspace/env_run/GLM_paddle/anaconda3/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 1984, in _load_optimizer_and_scheduler
self.scaler.load_state_dict(paddle.load(os.path.join(checkpoint, SCALER_NAME), return_numpy=True))
File "/root/paddlejob/workspace/env_run/GLM_paddle/anaconda3/lib/python3.9/site-packages/paddle/amp/grad_scaler.py", line 1197, in load_state_dict
super().load_state_dict(state_dict)
File "/root/paddlejob/workspace/env_run/GLM_paddle/anaconda3/lib/python3.9/site-packages/paddle/amp/grad_scaler.py", line 559, in load_state_dict
raise RuntimeError(
RuntimeError: The input state dict is empty, possibly because it was saved from a disabled instance of GradScaler.
I0829 10:16:15.477458 943 tcp_store.cc:273] receive shutdown event and so quit from MasterDaemon run loop

代码:
if training_args.do_train:
train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
trainer.save_state()

Metadata

Metadata

Assignees

Labels

questionFurther information is requestedtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions