Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL error when running backward #363

Closed
ProHuper opened this issue Nov 4, 2021 · 5 comments · Fixed by #365
Closed

NCCL error when running backward #363

ProHuper opened this issue Nov 4, 2021 · 5 comments · Fixed by #365

Comments

@ProHuper
Copy link

ProHuper commented Nov 4, 2021

I ran a very simply example and got error:

WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Bootstrap : Using eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Bootstrap : Using eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Using network IB
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 00/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 01/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 02/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Setting affinity for GPU 1 to 3f,07ff0000,003e07ff
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 03/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Setting affinity for GPU 0 to 3f,07ff0000,003e07ff
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 00 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 01 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 03 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 02 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 03 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Connected all rings
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Connected all trees
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Connected all rings
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Connected all trees
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO comm 0x55bc8aee70c0 rank 1 nranks 2 cudaDev 1 busId 3d000 - Init COMPLETE
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO comm 0x555f0e926110 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
2021-11-04T14:16:06.243214Z  WARN bagua_core_internal: Parameter autotuning service not detected. Enabling it may further improve the performance. See https://tutorials.baguasys.com/performance-autotuning/ for more details.
2021-11-04T14:16:06.243246Z  WARN bagua_core_internal: Parameter autotuning service not detected. Enabling it may further improve the performance. See https://tutorials.baguasys.com/performance-autotuning/ for more details.
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Launch mode Parallel

ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [0] enqueue.cc:329 NCCL WARN Cuda failure 'invalid resource handle'
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [0] NCCL INFO enqueue.cc:1047 -> 1
fatal runtime error: Rust cannot catch foreign exceptions
Killing subprocess 93207
Killing subprocess 93208
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 342, in <module>
    main()
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 327, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 290, in sigkill_handler
    raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', 'train.py']' died with <Signals.SIGABRT: 6>.

I used nccl-2.10.3 and cuda-10.2, I'm using local nccl, but same error will encounter when i install nccl using bagua_core.install_deps, and everything works fine if I use DDP.

here's my code:

import torch
from torch.nn.modules.loss import CrossEntropyLoss
from torch.utils.data.dataloader import DataLoader
from LAMB import LAMB
from bagua.torch_api.contrib.fuse.optimizer import fuse_optimizer
import torch.nn as nn
import torch.optim
from torch.utils.data import Dataset, DataLoader
import bagua.torch_api as bagua
from bagua.torch_api.algorithms import gradient_allreduce

from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import argparse

class MyDataset(Dataset):
    def __init__(self) -> None:
        self.input = torch.randn(10000, 10)
        self.laebl = torch.randn(10000, 1)

    def __getitem__(self, index):
        return self.input[index], self.laebl[index]

    def __len__(self):
        return  10000


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int, default=-1)
    args = parser.parse_args()
    # dist.init_process_group(backend='nccl')
    bagua.init_process_group()

    model = nn.Sequential(
        nn.Linear(10, 5),
        nn.Linear(5, 2),
        nn.Linear(2, 1),
    )   

    optimizer = torch.optim.Adam(
        params=model.parameters(),
        lr=0.1,
        betas=(0.9, 0.999),
        eps=1e-06,
        weight_decay=0
    )

    algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
    model.to(bagua.get_local_rank())
    # model.to(args.local_rank)
    # model = DDP(model, device_ids=[args.local_rank])
    model = model.with_bagua(
        [optimizer],
        algorithm
    )
    dataset = MyDataset()
    dataloader = DataLoader(dataset, batch_size=5)

    for i in range(10):
        for x, y in dataloader:
            # x = x.to(args.local_rank)
            # y = y.to(args.local_rank)
            x = x.to(bagua.get_local_rank())
            y = y.to(bagua.get_local_rank())
            optimizer.zero_grad()
            output = model(x)
            loss = (output - y).pow(2).sum()
            loss.backward()
            optimizer.step()
@NOBLES5E
Copy link
Member

NOBLES5E commented Nov 4, 2021

Thanks for filing the issue. Could you provide the output of python -c "import bagua_core; bagua_core.show_version()" to check the actually NCCL version used?

@ProHuper
Copy link
Author

ProHuper commented Nov 5, 2021

(base) [root@ts-fadc083f9f7d443e933cc3b7e98478a7-launcher ~]# python -c "import bagua_core; bagua_core.show_version()"
WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
project_name: bagua-core-internal
is_debug: false
version:
pkg_version:0.1.2
branch:master
commit_hash:5228e756
build_time:2021-11-02 16:44:13 +00:00
build_env:rustc 1.56.1 (59eed8a2a 2021-11-01),stable-x86_64-unknown-linux-gnu (default)
tag:
commit_hash: 5228e756b5fac9ed242f05bb1c6ce3edfa201a2f
commit_date: 2021-11-02 16:27:10 +00:00
build_os: linux-x86_64
rust_version: rustc 1.56.1 (59eed8a2a 2021-11-01)
build_time: 2021-11-02 16:44:13 +00:00
NCCL version: 21003

@wangraying
Copy link
Member

wangraying commented Nov 5, 2021

Need to add torch.cuda.set_device(bagua.get_local_rank()) before bagua.init_process_group()

In bagua.init_process_group() we init NCCL communicator, thus it is needed to set CUDA device before we call it.


BTW we are also working on a DDP compatible API. After #312 gets merged, it should be a matter of from bagua.torch_api.data_parallel import DistributedDataParallel as DDP to migrate from DDP to bagua.

@NOBLES5E NOBLES5E linked a pull request Nov 5, 2021 that will close this issue
@ProHuper
Copy link
Author

ProHuper commented Nov 5, 2021

Got it, but I still don't know why setting cuda device is necessary to init the NCCL communicator, at least in horovod and torch DDP, there is no such constrain, is there some considerations in this?

Also, I noticed bagua invoked torch's init_process_group in its own init_process_group, what's this for?

# TODO remove the dependency on torch process group
    if not dist.is_initialized():
        torch.distributed.init_process_group(
            backend="nccl",
            store=_default_store,
            rank=get_rank(),
            world_size=get_world_size(),
        )  # fmt: off

    _default_pg = new_group(stream=torch.cuda.Stream(priority=-1))

@wangraying
Copy link
Member

That's the requirement for ncclCommInitRank.

Will eventually remove this dependency in future release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants