New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL error when running backward #363
Comments
Thanks for filing the issue. Could you provide the output of |
|
Need to add In BTW we are also working on a DDP compatible API. After #312 gets merged, it should be a matter of |
Got it, but I still don't know why setting cuda device is necessary to init the NCCL communicator, at least in horovod and torch DDP, there is no such constrain, is there some considerations in this? Also, I noticed bagua invoked torch's init_process_group in its own init_process_group, what's this for? # TODO remove the dependency on torch process group
if not dist.is_initialized():
torch.distributed.init_process_group(
backend="nccl",
store=_default_store,
rank=get_rank(),
world_size=get_world_size(),
) # fmt: off
_default_pg = new_group(stream=torch.cuda.Stream(priority=-1)) |
That's the requirement for Will eventually remove this dependency in future release. |
I ran a very simply example and got error:
I used nccl-2.10.3 and cuda-10.2, I'm using local nccl, but same error will encounter when i install nccl using bagua_core.install_deps, and everything works fine if I use DDP.
here's my code:
The text was updated successfully, but these errors were encountered: