[Kunlun]add collective ops for multi XPU cards training and add Kunlun multi XPU cards CI #32302

vslyu · 2021-04-15T08:12:12Z

PR types

New features

PR changes

OPs

Describe

add collective ops for multi XPU cards training
add allreduce_sum/prod/max/min and reduce_sum/prod/max/min ops for XPU.

example:
fleetrun --xpus 0,1 example.py

import numpy as np
import paddle
from paddle.distributed import init_parallel_env

paddle.set_device('xpu:%d'%paddle.distributed.ParallelEnv().dev_id)
init_parallel_env()
if paddle.distributed.ParallelEnv().local_rank == 0:
    np_data = np.array([[4, 5, 6], [4, 5, 6]]).astype(np.float32)
else:
    np_data = np.array([[1, 2, 3], [1, 2, 3]]).astype(np.float32)
data = paddle.to_tensor(np_data)
paddle.distributed.reduce(data, 0)
out = data.numpy()
# [[5, 7, 9], [5, 7, 9]]

add Kunlun multi XPU cards CI
add dist xpu tests in python/paddle/fluid/tests/unittests/CMakeLists.txt.
UT name must contain "_xpu" to be executed, redefined example as following:

# dist xpu tests: 
if (WITH_XPU_BKCL)
    py_test(test_collective_reduce_api_xpu SRCS "test_collective_reduce_api.py")
    py_test(test_collective_allreduce_api_xpu SRCS "test_collective_allreduce_api.py")
endif()

…dev/add_xpu_colletive_ops1

paddle-bot-old · 2021-04-15T08:12:15Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wangxicoding · 2021-04-20T05:17:47Z

paddle/fluid/operators/collective/c_allreduce_op.h

 #include "paddle/fluid/platform/collective_helper.h"
+#endif
+
+#if (defined PADDLE_WITH_NCCL)


#if defined(PADDLE_WITH_NCCL) || defined(PADDLE_WITH_RCCL)

wangxicoding · 2021-04-20T05:22:24Z

paddle/fluid/operators/collective/c_reduce_op.h

 #include "paddle/fluid/platform/collective_helper.h"
 #include "paddle/fluid/platform/nccl_helper.h"
 #endif
+
+#if defined(PADDLE_WITH_XPU_BKCL)


#if defined(PADDLE_WITH_NCCL) || defined(PADDLE_WITH_RCCL)

QingshuChen

LGTM

vslyu added 9 commits February 9, 2021 08:33

add print log in broadcast_op_xpu.cc

ad254e6

add allreduce_sum api for xpu

4a9bd3a

merge origin dev/add_xpu_multimachine, add xpu_colletive_ops init

63ead73

add allreduce_sum api for xpu

5bace53

Merge remote-tracking branch 'upstream' into dev/add_xpu_colletive_ops

bc2bc9b

fix

2c6d536

add c_allreduce_op for xpu

648f490

add c_reduce_op for xpu

b975c06

Merge remote-tracking branch 'origin/dev/add_xpu_colletive_ops' into …

9473419

…dev/add_xpu_colletive_ops1

vslyu added 4 commits April 19, 2021 06:30

add ut fof xpu collective ops

a2c8443

fix ut fof xpu collective ops

e51f000

add CI for xpu dist test case

ca9914b

fix ut rules, test=kunlun

26bfd67

vslyu force-pushed the dev/add_xpu_colletive_ops1 branch from 09406ed to 26bfd67 Compare April 19, 2021 15:17

wangxicoding reviewed Apr 20, 2021

View reviewed changes

vslyu changed the title ~~[Kunlun]add collective ops for multi XPU cards training~~ [Kunlun]add collective ops for multi XPU cards training and add kunlun multicards CI Apr 20, 2021

vslyu changed the title ~~[Kunlun]add collective ops for multi XPU cards training and add kunlun multicards CI~~ [Kunlun]add collective ops for multi XPU cards training and add Kunlun multi XPU cards CI Apr 20, 2021

vslyu added 3 commits April 20, 2021 09:57

notest, test=rocm

2ab374e

notest, test=rocm

e1b90cb

fix,test=develop

fcbec44

QingshuChen approved these changes Apr 21, 2021

View reviewed changes

wangxicoding approved these changes Apr 21, 2021

View reviewed changes

wangxicoding merged commit 2194ad1 into PaddlePaddle:develop Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kunlun]add collective ops for multi XPU cards training and add Kunlun multi XPU cards CI #32302

[Kunlun]add collective ops for multi XPU cards training and add Kunlun multi XPU cards CI #32302

vslyu commented Apr 15, 2021 •

edited

paddle-bot-old bot commented Apr 15, 2021

wangxicoding Apr 20, 2021

vslyu Apr 20, 2021 •

edited

wangxicoding Apr 20, 2021

vslyu Apr 20, 2021 •

edited

QingshuChen left a comment

Navigation Menu

[Kunlun]add collective ops for multi XPU cards training and add Kunlun multi XPU cards CI #32302

[Kunlun]add collective ops for multi XPU cards training and add Kunlun multi XPU cards CI #32302

Conversation

vslyu commented Apr 15, 2021 • edited

PR types

PR changes

Describe

paddle-bot-old bot commented Apr 15, 2021

wangxicoding Apr 20, 2021

Choose a reason for hiding this comment

vslyu Apr 20, 2021 • edited

Choose a reason for hiding this comment

wangxicoding Apr 20, 2021

Choose a reason for hiding this comment

vslyu Apr 20, 2021 • edited

Choose a reason for hiding this comment

QingshuChen left a comment

Choose a reason for hiding this comment

vslyu commented Apr 15, 2021 •

edited

vslyu Apr 20, 2021 •

edited

vslyu Apr 20, 2021 •

edited