Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change mp size 后,训练会出现 size missmatch 的错误 #49

Open
samulew opened this issue Mar 7, 2023 · 2 comments
Open

change mp size 后,训练会出现 size missmatch 的错误 #49

samulew opened this issue Mar 7, 2023 · 2 comments

Comments

@samulew
Copy link

samulew commented Mar 7, 2023

下载100000模型,使用CPM-1-Generate/change_mp.py 将其转为16份,mpsize设置为16去训练,会出现 size missmatch 的错误,感觉应该是模型切分方式与训练脚本不同,请问怎么解决方便?

172.24.241.143: File "/CPM/CPM-2-Finetune-master/finetune_cpm2.py", line 790, in main
172.24.241.143: model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size, ds_config, prompt_config)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/utils.py", line 231, in setup_model_and_optimizer
172.24.241.143: args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/utils.py", line 497, in load_checkpoint
172.24.241.143: checkpoint_name, sd = model.load_checkpoint(
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1452, in load_checkpoint
172.24.241.143: load_path, client_states = self._load_checkpoint(load_dir,
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1487, in _load_checkpoint
172.24.241.143: self.load_module_state_dict(state_dict=checkpoint['module'],
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1389, in load_module_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/model/distributed.py", line 90, in load_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/fp16/fp16.py", line 71, in load_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
172.24.241.143: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
172.24.241.143: RuntimeError: Error(s) in loading state_dict for EncDecModel:
172.24.241.143: size mismatch for lm_head.weight: copying a param with shape torch.Size([6560, 1024]) from checkpoint, the shape in current model is torch.Size([1640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.4.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).

@zzy14
Copy link
Contributor

zzy14 commented Mar 8, 2023

可以参考CPM-1的修改MP的代码转换一下checkpoint。

@samulew
Copy link
Author

samulew commented Mar 8, 2023

可以参考CPM-1的修改MP的代码转换一下checkpoint。

感谢回答。我是直接用的CPM-1的修改MP的代码转换后出现的size 不匹配问题 。
你的意思是,CPM-1修改MP的代码只能作为参考,不能直接使用,还是要自己修改一下的是吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants