NCCL error when running backward #363

ProHuper · 2021-11-04T14:20:04Z

I ran a very simply example and got error:

WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Bootstrap : Using eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Bootstrap : Using eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth1:11.214.158.37<0>
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Using network IB
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 00/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 01/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 02/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Setting affinity for GPU 1 to 3f,07ff0000,003e07ff
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 03/04 :    0   1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Setting affinity for GPU 0 to 3f,07ff0000,003e07ff
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 00 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 01 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Channel 03 : 0[1a000] -> 1[3d000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 02 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Channel 03 : 1[3d000] -> 0[1a000] via P2P/IPC
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Connected all rings
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO Connected all trees
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Connected all rings
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Connected all trees
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [1] NCCL INFO comm 0x55bc8aee70c0 rank 1 nranks 2 cudaDev 1 busId 3d000 - Init COMPLETE
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO comm 0x555f0e926110 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
2021-11-04T14:16:06.243214Z  WARN bagua_core_internal: Parameter autotuning service not detected. Enabling it may further improve the performance. See https://tutorials.baguasys.com/performance-autotuning/ for more details.
2021-11-04T14:16:06.243246Z  WARN bagua_core_internal: Parameter autotuning service not detected. Enabling it may further improve the performance. See https://tutorials.baguasys.com/performance-autotuning/ for more details.
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93207:93207 [0] NCCL INFO Launch mode Parallel

ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [0] enqueue.cc:329 NCCL WARN Cuda failure 'invalid resource handle'
ts-fadc083f9f7d443e933cc3b7e98478a7-launcher:93208:93208 [0] NCCL INFO enqueue.cc:1047 -> 1
fatal runtime error: Rust cannot catch foreign exceptions
Killing subprocess 93207
Killing subprocess 93208
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 342, in <module>
    main()
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 327, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/root/anaconda3/lib/python3.8/site-packages/bagua/distributed/launch.py", line 290, in sigkill_handler
    raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', 'train.py']' died with <Signals.SIGABRT: 6>.

I used nccl-2.10.3 and cuda-10.2, I'm using local nccl, but same error will encounter when i install nccl using bagua_core.install_deps, and everything works fine if I use DDP.

here's my code:

import torch
from torch.nn.modules.loss import CrossEntropyLoss
from torch.utils.data.dataloader import DataLoader
from LAMB import LAMB
from bagua.torch_api.contrib.fuse.optimizer import fuse_optimizer
import torch.nn as nn
import torch.optim
from torch.utils.data import Dataset, DataLoader
import bagua.torch_api as bagua
from bagua.torch_api.algorithms import gradient_allreduce

from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import argparse

class MyDataset(Dataset):
    def __init__(self) -> None:
        self.input = torch.randn(10000, 10)
        self.laebl = torch.randn(10000, 1)

    def __getitem__(self, index):
        return self.input[index], self.laebl[index]

    def __len__(self):
        return  10000


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int, default=-1)
    args = parser.parse_args()
    # dist.init_process_group(backend='nccl')
    bagua.init_process_group()

    model = nn.Sequential(
        nn.Linear(10, 5),
        nn.Linear(5, 2),
        nn.Linear(2, 1),
    )   

    optimizer = torch.optim.Adam(
        params=model.parameters(),
        lr=0.1,
        betas=(0.9, 0.999),
        eps=1e-06,
        weight_decay=0
    )

    algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
    model.to(bagua.get_local_rank())
    # model.to(args.local_rank)
    # model = DDP(model, device_ids=[args.local_rank])
    model = model.with_bagua(
        [optimizer],
        algorithm
    )
    dataset = MyDataset()
    dataloader = DataLoader(dataset, batch_size=5)

    for i in range(10):
        for x, y in dataloader:
            # x = x.to(args.local_rank)
            # y = y.to(args.local_rank)
            x = x.to(bagua.get_local_rank())
            y = y.to(bagua.get_local_rank())
            optimizer.zero_grad()
            output = model(x)
            loss = (output - y).pow(2).sum()
            loss.backward()
            optimizer.step()

The text was updated successfully, but these errors were encountered:

NOBLES5E · 2021-11-04T22:23:13Z

Thanks for filing the issue. Could you provide the output of python -c "import bagua_core; bagua_core.show_version()" to check the actually NCCL version used?

ProHuper · 2021-11-05T02:17:28Z

(base) [root@ts-fadc083f9f7d443e933cc3b7e98478a7-launcher ~]# python -c "import bagua_core; bagua_core.show_version()"
WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
project_name: bagua-core-internal
is_debug: false
version:
pkg_version:0.1.2
branch:master
commit_hash:5228e756
build_time:2021-11-02 16:44:13 +00:00
build_env:rustc 1.56.1 (59eed8a2a 2021-11-01),stable-x86_64-unknown-linux-gnu (default)
tag:
commit_hash: 5228e756b5fac9ed242f05bb1c6ce3edfa201a2f
commit_date: 2021-11-02 16:27:10 +00:00
build_os: linux-x86_64
rust_version: rustc 1.56.1 (59eed8a2a 2021-11-01)
build_time: 2021-11-02 16:44:13 +00:00
NCCL version: 21003

wangraying · 2021-11-05T02:57:37Z

Need to add torch.cuda.set_device(bagua.get_local_rank()) before bagua.init_process_group()

In bagua.init_process_group() we init NCCL communicator, thus it is needed to set CUDA device before we call it.

BTW we are also working on a DDP compatible API. After #312 gets merged, it should be a matter of from bagua.torch_api.data_parallel import DistributedDataParallel as DDP to migrate from DDP to bagua.

ProHuper · 2021-11-05T08:45:30Z

Got it, but I still don't know why setting cuda device is necessary to init the NCCL communicator, at least in horovod and torch DDP, there is no such constrain, is there some considerations in this？

Also, I noticed bagua invoked torch's init_process_group in its own init_process_group, what's this for?

# TODO remove the dependency on torch process group
    if not dist.is_initialized():
        torch.distributed.init_process_group(
            backend="nccl",
            store=_default_store,
            rank=get_rank(),
            world_size=get_world_size(),
        )  # fmt: off

    _default_pg = new_group(stream=torch.cuda.Stream(priority=-1))

wangraying · 2021-11-17T06:17:39Z

That's the requirement for ncclCommInitRank.

Will eventually remove this dependency in future release.

NOBLES5E linked a pull request Nov 5, 2021 that will close this issue

docs(python): add docs for init_process_group #365

Merged

wangraying closed this as completed Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL error when running backward #363

NCCL error when running backward #363

ProHuper commented Nov 4, 2021 •

edited

NOBLES5E commented Nov 4, 2021

ProHuper commented Nov 5, 2021

wangraying commented Nov 5, 2021 •

edited by NOBLES5E

ProHuper commented Nov 5, 2021

wangraying commented Nov 17, 2021

NCCL error when running backward #363

NCCL error when running backward #363

Comments

ProHuper commented Nov 4, 2021 • edited

NOBLES5E commented Nov 4, 2021

ProHuper commented Nov 5, 2021

wangraying commented Nov 5, 2021 • edited by NOBLES5E

ProHuper commented Nov 5, 2021

wangraying commented Nov 17, 2021

ProHuper commented Nov 4, 2021 •

edited

wangraying commented Nov 5, 2021 •

edited by NOBLES5E