[Prepare for MultiProcess xpu] unified gen nccl id, refine imperative reducer #30455

wangxicoding · 2021-01-14T12:18:30Z

PR types

Others

PR changes

Others

Describe

支持昆仑动态图多卡的准备PR（下个PR提昆仑动态图多卡支持），主要改动：
1、统一动态图和静态图gen_nccl_id，以及bkcl id的广播。方便下个PR支持昆仑动态图多卡以及静态图多进程多卡模式。
2、动态图ParallelContext抽取到parallel_context.h文件中，方便作为NCCLParallelContext、BKCLParallelContext、GLOOParallelContext等通信库的基类。
添加WaitCompute()和WaitComm()接口，用于抽取通信等待计算、计算等待通信的逻辑，保持Reducer中代码尽量与设备通信库无关。
3、调整动态图reducer部分代码，尽量移除与设备相关代码

paddle-bot-old · 2021-01-14T12:20:34Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

sandyhouse · 2021-01-15T04:17:39Z

paddle/fluid/platform/gen_comm_id_helper.cc

@@ -12,7 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#include "paddle/fluid/operators/collective/gen_nccl_id_op_helper.h"
+#ifdef PADDLE_WITH_NCCL


如何兼容多种通信库，如NCCL、BKCL，尤其是当多个通信库需要同时使用时怎么保持兼容。

BKCL和NCCL接口基本一致，这个文件只给类似nccl的使用，后面用了模板

#ifdef PADDLE_WITH_NCCL INSTANT_TEMPLATE(ncclUniqueId) #endif #ifdef PADDLE_WITH_XPU_BKCL INSTANT_TEMPLATE(bkclUniqueId) #endif

hutuxian · 2021-01-17T13:25:26Z

paddle/fluid/imperative/nccl_context.cc

+                            ->stream();
+  auto comm_stream =
+      platform::NCCLCommContext::Instance().Get(ring_id, place_)->stream();
+  auto event = compute_events_[ring_id].get();


add assert for ring_id < compute_events_.size()? Same as WaitComm

Done. add assert ring_id >=0 and ring_id < compute_events_.size

hutuxian · 2021-01-17T13:26:03Z

paddle/fluid/imperative/parallel_context.h

+  int local_rank_{0};
+  std::vector<std::string> trainer_endpoints_{};
+  std::string current_endpoint_{""};
+  // TODO(shenliang03): support multi stream communication


Please help shenliang to remove this line

ForFishes · 2021-01-18T04:03:05Z

paddle/fluid/operators/collective/gen_nccl_id_op.cc

+    PADDLE_ENFORCE_CUDA_SUCCESS(
+        platform::dynload::ncclGetUniqueId(&(*nccl_ids)[i]));
+  }
+}


这里是不是也可以和nccl_context.h那里生成nccl id的函数维持成一个。

理论是的，准备后面再抽象一下，放到nccl helper或者nccl comm里面。

ForFishes

LGTM

wangxicoding force-pushed the unified_gen_nccl_id branch from 2c2dcd6 to 82dc202 Compare January 14, 2021 12:39

wangxicoding requested review from ForFishes, hutuxian, gongweibao and sandyhouse January 14, 2021 12:40

sandyhouse reviewed Jan 15, 2021

View reviewed changes

wangxicoding added 2 commits January 15, 2021 11:50

unified gen nccl id, refine imperative reducer

62b843e

ci coverage

e5f4e61

wangxicoding force-pushed the unified_gen_nccl_id branch from 82dc202 to e5f4e61 Compare January 15, 2021 11:50

hutuxian reviewed Jan 17, 2021

View reviewed changes

Add ring id check

d89fc9e

wangxicoding requested a review from hutuxian January 18, 2021 02:55

ForFishes reviewed Jan 18, 2021

View reviewed changes

ForFishes approved these changes Jan 19, 2021

View reviewed changes

wangxicoding changed the title ~~[Prepare for xpu] unified gen nccl id, refine imperative reducer~~ [Prepare for MultiProcess xpu] unified gen nccl id, refine imperative reducer Jan 19, 2021

wangxicoding merged commit 572c466 into PaddlePaddle:develop Jan 19, 2021

wangxicoding deleted the unified_gen_nccl_id branch January 21, 2021 10:38

This was referenced Jan 24, 2021

【kunlun】dygraph supports multi xpu card training #30671

Merged

fix test_gen_nccl_id_op failed #30686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prepare for MultiProcess xpu] unified gen nccl id, refine imperative reducer #30455

[Prepare for MultiProcess xpu] unified gen nccl id, refine imperative reducer #30455

wangxicoding commented Jan 14, 2021 •

edited

paddle-bot-old bot commented Jan 14, 2021

sandyhouse Jan 15, 2021

wangxicoding Jan 15, 2021

hutuxian Jan 17, 2021

wangxicoding Jan 17, 2021

hutuxian Jan 17, 2021

wangxicoding Jan 17, 2021

ForFishes Jan 18, 2021

wangxicoding Jan 18, 2021

ForFishes left a comment

[Prepare for MultiProcess xpu] unified gen nccl id, refine imperative reducer #30455

[Prepare for MultiProcess xpu] unified gen nccl id, refine imperative reducer #30455

Conversation

wangxicoding commented Jan 14, 2021 • edited

PR types

PR changes

Describe

paddle-bot-old bot commented Jan 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

wangxicoding commented Jan 14, 2021 •

edited