custom allreduce cuda kernel #20703

wangyems · 2024-05-16T19:58:02Z

Description

Conditionally route to custom AllReduce kernel when buffer size and gpu numbers meet certain requirements. Otherwise, keep using NCCL's AllReduce.

Motivation and Context

onnxruntime/test/python/onnxruntime_test_collective.py

onnxruntime/core/providers/js/operators/conv.h

orttraining/orttraining/core/optimizer/compute_optimizer/padding_elimination.cc

onnxruntime/contrib_ops/cuda/collective/sharded_moe.h

onnxruntime/contrib_ops/cuda/collective/custom_reduce_impl.h

yuslepukhin

🕐

onnxruntime/contrib_ops/cuda/collective/custom_reduce_impl.h

…ustom_reduce

onnxruntime/contrib_ops/cuda/collective/custom_reduce_impl.cu

…ustom_reduce

onnxruntime/contrib_ops/cuda/collective/sharded_moe.cc

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h

onnxruntime/contrib_ops/cuda/collective/sharded_moe.h

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc

onnxruntime/contrib_ops/cuda/collective/sharded_moe.h

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc

This reverts commit db61ff9.

…ustom_reduce

yuslepukhin · 2024-06-07T19:03:42Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h

+#if defined(USE_MPI) || defined(USE_NCCL)
+
+struct CudaDeleter {
+  void operator()(void* ptr) {


void operator()(void* ptr) {

const noexcept qualifier

yuslepukhin · 2024-06-07T19:03:52Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h

+};
+
+struct IpcDeleter {
+  void operator()(void* ptr) {


()(void* ptr)

const noexcept qualifier

yuslepukhin · 2024-06-07T19:06:31Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h

+// A global resource pack for IPC memory used in custom reduce kernel.
+// Resource retrieval and deserialization are made atomic to thread safety of accessing it.
+struct IPCMemoryResourcePack {
+  InlinedVector<std::shared_ptr<IpcMemory>> m_ipc_momery_handles;


std::shared_ptr

Which entities share IpcMemory?

yuslepukhin · 2024-06-07T19:07:33Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc

+
+Status IpcMemory::AllocateIpcMemory() {
+  CUDA_RETURN_IF_ERROR(cudaMalloc(&m_buffer_ptr_, mbuffer_size_));
+  m_buffer_uptr_ = std::move(CudaMemPtrT{m_buffer_ptr_});


std::move

std::move is redundant

yuslepukhin · 2024-06-07T19:11:50Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc

+}
+
+Status IpcMemory::AllocateIpcMemory() {
+  CUDA_RETURN_IF_ERROR(cudaMalloc(&m_buffer_ptr_, mbuffer_size_));


buffer_ptr

what is the purpose of storing it temporarily in the member var?

yuslepukhin · 2024-06-07T19:12:54Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h

+  int world_size_;
+  InlinedVector<void*> m_comm_ptrs_;
+  std::size_t mbuffer_size_;
+  void* m_buffer_ptr_{nullptr};


void* m_buffer_ptr_{nullptr};

This defeats the purpose of unique_ptr.

yuslepukhin · 2024-06-07T19:13:31Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc

+
+  for (size_t node_id = 0; node_id < handles.size(); node_id++) {
+    if ((int)node_id == rank_) {
+      m_comm_ptrs_[node_id] = m_buffer_ptr_;


m_comm_ptrs_[node_id] = m_buffer_ptr_;

I am still not clear about the purpose of storing different types of ptrs in the same array.

yuslepukhin · 2024-06-07T19:15:29Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc

+      uint8_t* foreign_buffer;
+      CUDA_RETURN_IF_ERROR(cudaIpcOpenMemHandle(
+          reinterpret_cast<void**>(&foreign_buffer), handles[node_id], cudaIpcMemLazyEnablePeerAccess));
+      m_ipc_uptrs_.emplace_back(IpcMemPtrT{foreign_buffer});


_.emplace_back(

The purpose of emplace() is to construct in place, not to construct and copy which is the same as push_back.
You only need to pass arguments to the constructor, so the construction takes place in the place where it is supposed to reside.

yuslepukhin · 2024-06-07T19:20:48Z

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc

+
+Status IpcMemory::AllocateIpcMemory() {
+  CUDA_RETURN_IF_ERROR(cudaMalloc(&m_buffer_ptr_, mbuffer_size_));
+  m_buffer_uptr_ = std::move(CudaMemPtrT{m_buffer_ptr_});


CudaMemPtrT{m_buffer_ptr_

Here and below.

use std::make_unique

for unique_ptr with custom deletors you need to pass an instance of a deletor as a second arg to the constructor.

wangyems added 6 commits May 16, 2024 19:57

checkin custom reduce

be8e676

suppress Windows nvcc warnings

49461b1

fix misspell

bace947

update

e7d37cb

rocm

c7471b6

rocm

435f1d3

wangyems marked this pull request as ready for review May 17, 2024 04:22

github-advanced-security bot found potential problems May 17, 2024

View reviewed changes

onnxruntime/test/python/onnxruntime_test_collective.py Fixed Show fixed Hide fixed

wangyems requested a review from a team as a code owner May 17, 2024 21:32

update

9ba3637

wangyems force-pushed the wangye/custom_reduce branch from ead5e90 to 9ba3637 Compare May 21, 2024 04:20

yuslepukhin reviewed May 22, 2024

View reviewed changes

onnxruntime/core/providers/js/operators/conv.h Outdated Show resolved Hide resolved

yuslepukhin reviewed May 22, 2024

View reviewed changes

orttraining/orttraining/core/optimizer/compute_optimizer/padding_elimination.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed May 22, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/sharded_moe.h Show resolved Hide resolved

yuslepukhin reviewed May 22, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/custom_reduce_impl.h Outdated Show resolved Hide resolved

yuslepukhin requested changes May 22, 2024

View reviewed changes

yuslepukhin reviewed May 22, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/custom_reduce_impl.h Outdated Show resolved Hide resolved

wangyems added 2 commits May 22, 2024 21:33

update

46a1eb7

Merge branch 'main' of github.com:microsoft/onnxruntime into wangye/c…

ec1c605

…ustom_reduce

tianleiwu reviewed May 23, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/custom_reduce_impl.cu Outdated Show resolved Hide resolved

tianleiwu reviewed May 23, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/custom_reduce_impl.cu Outdated Show resolved Hide resolved

tianleiwu reviewed May 23, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/custom_reduce_impl.cu Outdated Show resolved Hide resolved

review comments

a478025

wangyems requested review from tianleiwu and yuslepukhin May 28, 2024 16:21

wangyems added the release:1.18.1 label May 28, 2024

tianleiwu previously approved these changes May 28, 2024

View reviewed changes

protect rank_to_experts_start_index_

799f58e

wangyems dismissed tianleiwu’s stale review via 799f58e May 28, 2024 19:13

Merge branch 'main' of github.com:microsoft/onnxruntime into wangye/c…

7eabb61

…ustom_reduce

wangyems requested a review from tianleiwu June 3, 2024 21:38

tianleiwu reviewed Jun 3, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/sharded_moe.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Jun 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h Outdated Show resolved Hide resolved

check functor return type

3286353

tianleiwu reviewed Jun 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/sharded_moe.h Outdated Show resolved Hide resolved

singleton

c7ea0bc

tianleiwu previously approved these changes Jun 4, 2024

View reviewed changes

yuslepukhin reviewed Jun 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h Outdated Show resolved Hide resolved

yuslepukhin reviewed Jun 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/ipc_utils.h Outdated Show resolved Hide resolved

yuslepukhin reviewed Jun 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc Show resolved Hide resolved

yuslepukhin reviewed Jun 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/sharded_moe.h Outdated Show resolved Hide resolved

review comments

c135d98

wangyems dismissed tianleiwu’s stale review via c135d98 June 4, 2024 20:03

yuslepukhin reviewed Jun 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/ipc_utils.cc Show resolved Hide resolved

Your Name added 6 commits June 4, 2024 22:33

move expert start idx sync to ctor

f561263

raii ipc ptrs

14a4827

update

db61ff9

Revert "update"

1660b5a

This reverts commit db61ff9.

Merge branch 'main' of github.com:microsoft/onnxruntime into wangye/c…

8f773d3

…ustom_reduce

update

24b7769

wangyems requested a review from yuslepukhin June 6, 2024 02:36

yuslepukhin reviewed Jun 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom allreduce cuda kernel #20703

custom allreduce cuda kernel #20703

wangyems commented May 16, 2024 •

edited

yuslepukhin left a comment

yuslepukhin Jun 7, 2024 •

edited

yuslepukhin Jun 7, 2024 •

edited

yuslepukhin Jun 7, 2024

yuslepukhin Jun 7, 2024

yuslepukhin Jun 7, 2024

yuslepukhin Jun 7, 2024

yuslepukhin Jun 7, 2024

yuslepukhin Jun 7, 2024

yuslepukhin Jun 7, 2024

custom allreduce cuda kernel #20703

Are you sure you want to change the base?

custom allreduce cuda kernel #20703

Conversation

wangyems commented May 16, 2024 • edited

Description

Motivation and Context

yuslepukhin left a comment

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024 • edited

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024 • edited

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024

Choose a reason for hiding this comment

yuslepukhin Jun 7, 2024

Choose a reason for hiding this comment

wangyems commented May 16, 2024 •

edited

yuslepukhin Jun 7, 2024 •

edited

yuslepukhin Jun 7, 2024 •

edited