New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Support unified checkpoint for expert_parallel #8591

Closed

DesmonDay wants to merge 9 commits into PaddlePaddle:develop from DesmonDay:moe_add_uc

Contributor

DesmonDay commented Jun 12, 2024

PR types

New features

PR changes

Others

Description

Support unified checkpoint for expert_parallel.

DesmonDay added 5 commits

June 5, 2024 17:19


          Update sequence_parallel for predict

8fdeea3


          Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

14460cc

…nto develop


          Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

6e680cd

…nto develop


          Support saving model weight for expert parallel.

579477e


          Rename shard_file for expert_parallel.

65a8723

paddle-bot bot commented Jun 12, 2024

Thanks for your contribution!

codecov bot commented Jun 12, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 4.00000% with 120 lines in your changes missing coverage. Please review.

Project coverage is 55.74%. Comparing base (4e3f60d) to head (c555c30).
Report is 255 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/trainer/plugins/unified_checkpoint.py	4.00%	120 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8591      +/-   ##
===========================================
+ Coverage    53.86%   55.74%   +1.88%     
===========================================
  Files          620      620              
  Lines        97110    96741     -369     
===========================================
+ Hits         52306    53930    +1624     
+ Misses       44804    42811    -1993

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DesmonDay added 2 commits

June 13, 2024 20:07


          Support save fp32_weight in master_weights, support remove redundant …

b4cb180

…optimizer


          support optimizer load for fp32 weight

8a7cdd3

DesmonDay force-pushed the moe_add_uc branch from 227781f to c58001a Compare

June 17, 2024 09:11


          support ckpt load for expert parallel

697a05c

DesmonDay force-pushed the moe_add_uc branch 2 times, most recently from 6415e4c to 92d5432 Compare

June 25, 2024 08:20


          support multi-machine ckpt resume expert parallel

c555c30

DesmonDay force-pushed the moe_add_uc branch from 92d5432 to c555c30 Compare

June 26, 2024 02:54

DesmonDay changed the title ~~[WIP] Support unified checkpoint for expert_parallel~~ Support unified checkpoint for expert_parallel

ZHUI reviewed

View reviewed changes

paddlenlp/trainer/plugins/unified_checkpoint.py

@@ @@ -22,6 +22,7 @@ @@
               import paddle
               import paddle.distributed as dist
               from paddle.distributed import fleet
+              from paddle.framework import core

Collaborator

ZHUI Jul 10, 2024

这个最好加个测试，框架可能需要有个模型支持一下expert_parallel

paddlenlp/trainer/plugins/unified_checkpoint.py

+                          if model_state_dict[key_name[0]].dtype != core.VarDesc.VarType.FP32:
+                              key_name = "_".join([static_name, FP32_MASTER, key_name[1]])
+                          else:
+                              key_name = "_".join([static_name, key_name[1]])

Collaborator

ZHUI Jul 10, 2024

FP32 情况，加单测

paddlenlp/trainer/plugins/unified_checkpoint.py

                       need_files = set()
                       state_dict = get_expected_state_dict(model)
                       for key in state_dict.keys():
                           filename = index["weight_map"][key]
+                          # When using expert parallel, there's no need to check tensors with `no_sync=False` when dp_rank > 0.
+                          if args.use_expert_parallel and dp_rank > 0 and not getattr(state_dict[key], "no_sync", False):

Collaborator

ZHUI Jul 10, 2024

这里是跳过 no_sync 参数吗？

paddlenlp/trainer/plugins/unified_checkpoint.py

                   if master_weights is not None:
                       for key in list(master_weights.keys()):
                           master_weights[static2struct_name_mappings[key]] = master_weights.pop(key)
+                      master_weights.update(fp32_weight)

Collaborator

ZHUI Jul 10, 2024

这个加载的时候，会pop吗？

paddlenlp/trainer/plugins/unified_checkpoint.py

+                      else:
+                          shard_file = file_name.replace(
+                              ".pdparams",
+                              f"-{args.logical_process_index + 1:05d}-of-{args.world_size//args.dataset_world_size:05d}.pdparams",

Collaborator

ZHUI Jul 10, 2024

之前说的序号不对问题，还有吗？

paddlenlp/trainer/plugins/unified_checkpoint.py

+                          )
+                          shard_file = shard_file.replace(
+                              ".safetensors",
+                              f"-{args.logical_process_index + 1:05d}-of-{args.world_size//sd_degree:05d}.safetensors",

Collaborator

ZHUI Jul 10, 2024

这里太长了，合并一下吧，简化一下代码吧

github-actions bot commented Sep 9, 2024

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

github-actions bot added the stale label

DesmonDay closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels