Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom allreduce cuda kernel #20703

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

custom allreduce cuda kernel #20703

wants to merge 23 commits into from

Conversation

wangyems
Copy link
Member

@wangyems wangyems commented May 16, 2024

Description

Conditionally route to custom AllReduce kernel when buffer size and gpu numbers meet certain requirements. Otherwise, keep using NCCL's AllReduce.

Motivation and Context

@wangyems wangyems marked this pull request as ready for review May 17, 2024 04:22
@wangyems wangyems requested a review from a team as a code owner May 17, 2024 21:32
Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

tianleiwu
tianleiwu previously approved these changes May 28, 2024
@wangyems wangyems requested a review from tianleiwu June 3, 2024 21:38
tianleiwu
tianleiwu previously approved these changes Jun 4, 2024
@wangyems wangyems requested a review from yuslepukhin June 6, 2024 02:36
#if defined(USE_MPI) || defined(USE_NCCL)

struct CudaDeleter {
void operator()(void* ptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void operator()(void* ptr) {

const noexcept qualifier

};

struct IpcDeleter {
void operator()(void* ptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

()(void* ptr)

const noexcept qualifier

// A global resource pack for IPC memory used in custom reduce kernel.
// Resource retrieval and deserialization are made atomic to thread safety of accessing it.
struct IPCMemoryResourcePack {
InlinedVector<std::shared_ptr<IpcMemory>> m_ipc_momery_handles;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::shared_ptr

Which entities share IpcMemory?


Status IpcMemory::AllocateIpcMemory() {
CUDA_RETURN_IF_ERROR(cudaMalloc(&m_buffer_ptr_, mbuffer_size_));
m_buffer_uptr_ = std::move(CudaMemPtrT{m_buffer_ptr_});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::move

std::move is redundant

}

Status IpcMemory::AllocateIpcMemory() {
CUDA_RETURN_IF_ERROR(cudaMalloc(&m_buffer_ptr_, mbuffer_size_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_ptr

what is the purpose of storing it temporarily in the member var?

int world_size_;
InlinedVector<void*> m_comm_ptrs_;
std::size_t mbuffer_size_;
void* m_buffer_ptr_{nullptr};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void* m_buffer_ptr_{nullptr};

This defeats the purpose of unique_ptr.


for (size_t node_id = 0; node_id < handles.size(); node_id++) {
if ((int)node_id == rank_) {
m_comm_ptrs_[node_id] = m_buffer_ptr_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_comm_ptrs_[node_id] = m_buffer_ptr_;

I am still not clear about the purpose of storing different types of ptrs in the same array.

uint8_t* foreign_buffer;
CUDA_RETURN_IF_ERROR(cudaIpcOpenMemHandle(
reinterpret_cast<void**>(&foreign_buffer), handles[node_id], cudaIpcMemLazyEnablePeerAccess));
m_ipc_uptrs_.emplace_back(IpcMemPtrT{foreign_buffer});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_.emplace_back(

The purpose of emplace() is to construct in place, not to construct and copy which is the same as push_back.
You only need to pass arguments to the constructor, so the construction takes place in the place where it is supposed to reside.


Status IpcMemory::AllocateIpcMemory() {
CUDA_RETURN_IF_ERROR(cudaMalloc(&m_buffer_ptr_, mbuffer_size_));
m_buffer_uptr_ = std::move(CudaMemPtrT{m_buffer_ptr_});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CudaMemPtrT{m_buffer_ptr_

Here and below.

  1. use std::make_unique
  2. for unique_ptr with custom deletors you need to pass an instance of a deletor as a second arg to the constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants