Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Semi Auto] Refactor Completion Mechanism (Part1) #57447

Merged
merged 21 commits into from Sep 26, 2023

Conversation

JZ-LIANG
Copy link
Contributor

@JZ-LIANG JZ-LIANG commented Sep 18, 2023

PR types

Function optimization

PR changes

Others

Description

Pcard-70448

Refactor the Op internal dims mapping update Logic:

  • Enable New SPMD Rule in static mode Completion, by FLAG: FLAGS_infer_spmd_enable=1 (default=1)
  • Deprecate the original update_dims_mapping and is_xxx_compatible methods used in Completion (matmul, elementwise, reduction, layer_norm are adapted).
  • Re-mapping DistOp: New SPMD Rule may result an parallel pattern which has NOT corresponding DistOperatorImpl for it. We re-mapping the op to be replicated when it happen.

the new logic would has two main branch:
left: orignal logic for ops that has not new spmd rule
right: new logic for ops that has new spmd rule implemented.
image

How we plan to refactor static mode Semi Auto Parallel:

Currently:
image
there are 3 problems:

  1. Limited Coverage: each DistOperatorImpl is hard code for a special parallel pattern of the operator (both spmd_rule and parallel_transform), and only limited parallel pattern are supported.
  2. Code Redundancy: There are plenty of redundant logic between DistOperatorImpls of same op (in spmd_rule and parallel_transform)
  3. Not Compatible with Semi Auto Parallel in Dynamic mode.

After Refactor Completion (This and next PRs):
image

  • Each Operator will only have one spmd_rule which will be general for all parallel patterns.
  • Resulted parallel pattern from SpmdRule is mapped to DistOperatorImpl. If there is not corresponding impl, it will be reverted to replicated.

After Refactor DistOperatorImpl (Future PRs):
image

  • LocalReshard module will integraled to unify all DistOperatorImpls of same operator and support transformation need by any parallel patterns.

@paddle-bot
Copy link

paddle-bot bot commented Sep 18, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@JZ-LIANG JZ-LIANG changed the title [Semi Auto] Refactor Completion Mechanism [Semi Auto] Refactor Completion Mechanism I Sep 21, 2023
@JZ-LIANG JZ-LIANG changed the title [Semi Auto] Refactor Completion Mechanism I [Semi Auto] Refactor Completion Mechanism Part1 Sep 21, 2023
@JZ-LIANG JZ-LIANG changed the title [Semi Auto] Refactor Completion Mechanism Part1 [Semi Auto] Refactor Completion Mechanism (Part1) Sep 21, 2023
@JZ-LIANG JZ-LIANG changed the title [Semi Auto] Refactor Completion Mechanism (Part1) [Semi Auto] Refactor Completion Mechanism (Part1) Sep 21, 2023
Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhiqiu zhiqiu merged commit 8bbd558 into PaddlePaddle:develop Sep 26, 2023
27 checks passed
Frida-a pushed a commit to Frida-a/Paddle that referenced this pull request Oct 14, 2023
* first commit

* framework

* matmul done

* elementwise done

* adapt done

* polish code

* revise logging

* revise log

* update doc

* enable LN unitest

* precommit

* bugfix reduce_sum

* bugfix assign

* bugfix for print program

* enable rule for dropout

* bugfix for dist op
jiahy0825 pushed a commit to jiahy0825/Paddle that referenced this pull request Oct 16, 2023
* first commit

* framework

* matmul done

* elementwise done

* adapt done

* polish code

* revise logging

* revise log

* update doc

* enable LN unitest

* precommit

* bugfix reduce_sum

* bugfix assign

* bugfix for print program

* enable rule for dropout

* bugfix for dist op
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
* first commit

* framework

* matmul done

* elementwise done

* adapt done

* polish code

* revise logging

* revise log

* update doc

* enable LN unitest

* precommit

* bugfix reduce_sum

* bugfix assign

* bugfix for print program

* enable rule for dropout

* bugfix for dist op
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants