Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global lock for multiple communicators in one process #1174

Open
akhoroshev opened this issue Feb 7, 2024 · 2 comments
Open

Global lock for multiple communicators in one process #1174

akhoroshev opened this issue Feb 7, 2024 · 2 comments

Comments

@akhoroshev
Copy link

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#using-multiple-nccl-communicators-concurrently

Operations on different communicators should therefore be used at different epochs with a locking mechanism, and applications should ensure operations are submitted in the same order across ranks.

Does this mean that we should wait for any nccl operation to complete (by calling ncclStreamSynchronize) before starting a new one on a different thread on a different communicator?

Or we can start new operation in other thread immediately after the previous operation was enqueued (without ncclStreamSynchronize call under mutex)?

@akhoroshev
Copy link
Author

My case is 4 gpu devices: 2 x tp2 inference engines in one process

@akhoroshev
Copy link
Author

I found same question here #195 (comment) (my case is launch the allreduce between GPU0 and GPU2 and GPU1 and GPU3 concurrently)

Is it safe to launch concurrent allreduce on communicators with different GPUs? For example, lets say we launch the allreduce on all 4 GPUs and wait for it to complete. Then we launch the allreduce between GPU0 and GPU2 and GPU1 and GPU3 concurrently. Would this be safe since the GPUs used are distinct?

And answer is #195 (comment)

Yes, that is safe.

So if I don’t have dependencies between groups (0,2 GPU and 1,3 GPU), then I don’t need any synchronization at all, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant