Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prepare for MultiProcess xpu] unified gen nccl id, refine imperative reducer #30455

Merged
merged 3 commits into from Jan 19, 2021

Conversation

wangxicoding
Copy link
Contributor

@wangxicoding wangxicoding commented Jan 14, 2021

PR types

Others

PR changes

Others

Describe

支持昆仑动态图多卡的准备PR(下个PR提昆仑动态图多卡支持),主要改动:
1、统一动态图和静态图gen_nccl_id,以及bkcl id的广播。方便下个PR支持昆仑动态图多卡以及静态图多进程多卡模式。
2、动态图ParallelContext抽取到parallel_context.h文件中,方便作为NCCLParallelContext、BKCLParallelContext、GLOOParallelContext等通信库的基类。
添加WaitCompute()WaitComm()接口,用于抽取通信等待计算、计算等待通信的逻辑,保持Reducer中代码尽量与设备通信库无关。
3、调整动态图reducer部分代码,尽量移除与设备相关代码

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -12,7 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */

#include "paddle/fluid/operators/collective/gen_nccl_id_op_helper.h"
#ifdef PADDLE_WITH_NCCL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如何兼容多种通信库,如NCCL、BKCL,尤其是当多个通信库需要同时使用时怎么保持兼容。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BKCL和NCCL接口基本一致,这个文件只给类似nccl的使用,后面用了模板

#ifdef PADDLE_WITH_NCCL
INSTANT_TEMPLATE(ncclUniqueId)
#endif
#ifdef PADDLE_WITH_XPU_BKCL
INSTANT_TEMPLATE(bkclUniqueId)
#endif

->stream();
auto comm_stream =
platform::NCCLCommContext::Instance().Get(ring_id, place_)->stream();
auto event = compute_events_[ring_id].get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add assert for ring_id < compute_events_.size()? Same as WaitComm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. add assert ring_id >=0 and ring_id < compute_events_.size

int local_rank_{0};
std::vector<std::string> trainer_endpoints_{};
std::string current_endpoint_{""};
// TODO(shenliang03): support multi stream communication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help shenliang to remove this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

PADDLE_ENFORCE_CUDA_SUCCESS(
platform::dynload::ncclGetUniqueId(&(*nccl_ids)[i]));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是也可以和nccl_context.h那里生成nccl id的函数维持成一个。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

理论是的,准备后面再抽象一下,放到nccl helper或者nccl comm里面。

Copy link
Member

@ForFishes ForFishes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangxicoding wangxicoding changed the title [Prepare for xpu] unified gen nccl id, refine imperative reducer [Prepare for MultiProcess xpu] unified gen nccl id, refine imperative reducer Jan 19, 2021
@wangxicoding wangxicoding merged commit 572c466 into PaddlePaddle:develop Jan 19, 2021
@wangxicoding wangxicoding deleted the unified_gen_nccl_id branch January 21, 2021 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants