change mp size 后，训练会出现 size missmatch 的错误 #49

samulew · 2023-03-07T05:15:59Z

下载100000模型，使用CPM-1-Generate/change_mp.py 将其转为16份，mpsize设置为16去训练，会出现 size missmatch 的错误，感觉应该是模型切分方式与训练脚本不同，请问怎么解决方便？

172.24.241.143: File "/CPM/CPM-2-Finetune-master/finetune_cpm2.py", line 790, in main
172.24.241.143: model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size, ds_config, prompt_config)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/utils.py", line 231, in setup_model_and_optimizer
172.24.241.143: args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/utils.py", line 497, in load_checkpoint
172.24.241.143: checkpoint_name, sd = model.load_checkpoint(
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1452, in load_checkpoint
172.24.241.143: load_path, client_states = self._load_checkpoint(load_dir,
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1487, in _load_checkpoint
172.24.241.143: self.load_module_state_dict(state_dict=checkpoint['module'],
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1389, in load_module_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/model/distributed.py", line 90, in load_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/fp16/fp16.py", line 71, in load_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
172.24.241.143: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
172.24.241.143: RuntimeError: Error(s) in loading state_dict for EncDecModel:
172.24.241.143: size mismatch for lm_head.weight: copying a param with shape torch.Size([6560, 1024]) from checkpoint, the shape in current model is torch.Size([1640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.4.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).

zzy14 · 2023-03-08T09:13:54Z

可以参考CPM-1的修改MP的代码转换一下checkpoint。

samulew · 2023-03-08T09:48:44Z

可以参考CPM-1的修改MP的代码转换一下checkpoint。

感谢回答。我是直接用的CPM-1的修改MP的代码转换后出现的size 不匹配问题。
你的意思是，CPM-1修改MP的代码只能作为参考，不能直接使用，还是要自己修改一下的是吗？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change mp size 后，训练会出现 size missmatch 的错误 #49

change mp size 后，训练会出现 size missmatch 的错误 #49

samulew commented Mar 7, 2023

zzy14 commented Mar 8, 2023

samulew commented Mar 8, 2023

change mp size 后，训练会出现 size missmatch 的错误 #49

change mp size 后，训练会出现 size missmatch 的错误 #49

Comments

samulew commented Mar 7, 2023

zzy14 commented Mar 8, 2023

samulew commented Mar 8, 2023