Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reshard] Support reshard on nd mesh with same placement #57432

Merged
merged 1 commit into from Sep 19, 2023

Conversation

LiYuRio
Copy link
Contributor

@LiYuRio LiYuRio commented Sep 18, 2023

PR types

New features

PR changes

Others

Description

Pcard-73145

本PR主要支持了高维但相同mesh下不同状态间的转换,用于支持DP+MP混合并行场景。单测只为覆盖率,因为两卡不能测试全高维mesh,本地测过8卡。工作原理:

  • 从后向前,找到两个dims_mapping间开始有差异的第一个tensor维度;
  • 从该维度开始向前,把input中所有非replicated的维度都转换成replicated的;
    • 复用一维同mesh的状态转换函数,需要提取出需要的子ProcessMesh用来降维;
    • 降维后,为保证正确性,需要将低维dist_attr复原成高维;
  • 将转换后的input按维度依次从replicated转成output需要的状态;
    • 同样复用一维同mesh的状态转换函数,需要提取出需要的子ProcessMesh用来降维;

TODO:

  • 优化shard到replicated的状态转换,在低维连续转换时,可以用reshape代替split和concat;

顺便,修复高版本python下(3.9),launch的log重复打印两次的问题。

@LiYuRio LiYuRio force-pushed the dev_reshard branch 2 times, most recently from 4d20565 to b54fe48 Compare September 18, 2023 07:31
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

for (int64_t i = 0; i < shape_of_axis; ++i) {
coord[axis] = i;
int64_t rank = coord.back();
for (int64_t j = coord.size() - 2; j >= 0; --j) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果传入的mesh是乱序的,这里的倍数关系也没问题吗,比如[[2,5], [6,1], [4, 7], [3, 0]]

Copy link
Contributor Author

@LiYuRio LiYuRio Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实有点问题,缺了一步根据下标再去mesh里取对应的id。在这个例子里,假如当前全局rank是7,用GetCurRankCoordInMesh函数获取到它在mesh中的坐标是(2, 1),如果要获取的是第0维的子mesh,那和它在一个组的是(0, 1), (1, 1), (2, 1), (3, 1);如果要获取第1维的子mesh,和它在同一个组的是(2, 0)和(2, 1),然后根据下标和mesh的shape,相乘得到对应的下标,然后再取下标对应的process_id

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for set_tests_properties(test_reshard_nd_mesh PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 100)

@LiYuRio LiYuRio merged commit 89013ee into PaddlePaddle:develop Sep 19, 2023
27 checks passed
Frida-a pushed a commit to Frida-a/Paddle that referenced this pull request Oct 14, 2023
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants