Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Unable to register memory #563

Open
chaoyanghe opened this issue Sep 6, 2021 · 11 comments
Open

NCCL Unable to register memory #563

chaoyanghe opened this issue Sep 6, 2021 · 11 comments

Comments

@chaoyanghe
Copy link

chaoyanghe commented Sep 6, 2021

I was stuck with this problem for many days. May I know how to debug such issues?

Currently, I am using all_to_all NCCL MPI (world_size=64). The error is as follows.

10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] register_mr_buffers:465 NCCL WARN NET/OFI Unable to register memory (type = 1) for device 0. RC: 12, Error: Cannot allocate memory
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] NCCL INFO include/net.h:23 -> 2
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] NCCL INFO transport/net.cc:223 -> 2
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] NCCL INFO transport.cc:111 -> 2
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] NCCL INFO group.cc:137 -> 2 [Async thread]
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:149 [5] NCCL INFO group.cc:210 -> 2
10.0.87.143: 10.0.87.143: Traceback (most recent call last):
10.0.87.143: 10.0.87.143:   File "main_moe.py", line 187, in <module>
10.0.87.143: 10.0.87.143:     main()
10.0.87.143: 10.0.87.143:   File "main_moe.py", line 158, in main
10.0.87.143: 10.0.87.143:     output = model(x)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
10.0.87.143: 10.0.87.143:     return forward_call(*input, **kwargs)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 801, in forward
10.0.87.143: 10.0.87.143:     output = self.module(*inputs, **kwargs)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
10.0.87.143: 10.0.87.143:     return forward_call(*input, **kwargs)
10.0.87.143: 10.0.87.143:   File "/home/deepspeed/.local/lib/python3.6/site-packages/moe_layer/model/standard_moe_layer.py", line 45, in forward
10.0.87.143: 10.0.87.143:     h = self.moe_layer(hidden_states)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
10.0.87.143: 10.0.87.143:     return forward_call(*input, **kwargs)
10.0.87.143: 10.0.87.143:   File "/home/deepspeed/.local/lib/python3.6/site-packages/moe_layer/model/switch_transformer_layers.py", line 170, in forward
10.0.87.143: 10.0.87.143:     expert_inputs = _AllToAll.apply(self.process_group, torch.cat(route_inputs))
10.0.87.143: 10.0.87.143:   File "/home/deepspeed/.local/lib/python3.6/site-packages/moe_layer/model/switch_transformer_layers.py", line 25, in forward
10.0.87.143: 10.0.87.143:     dist.all_to_all_single(output, input, group=group)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 2386, in all_to_all_single
10.0.87.143: 10.0.87.143:     work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
10.0.87.143: 10.0.87.143: RuntimeError: NCCL error in: /tmp/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled system error, NCCL version 21.0.3
10.0.87.143: 10.0.87.143: ncclSystemError: System call (socket, malloc, munmap, etc) failed.
10.0.87.143: 10.0.78.113: ip-10-0-78-113:69:69 [2] bootstrap.cc:528 NCCL WARN Unexpected connections are not empty
10.0.87.143: 10.0.78.113: ip-10-0-78-113:69:69 [2] NCCL INFO init.cc:202 -> 3
10.0.87.143: 10.0.78.113: ip-10-0-78-113:69:69 [2] NCCL INFO init.cc:990 -> 3
10.0.87.143: 10.0.78.113: terminate called after throwing an instance of 'std::runtime_error'
10.0.87.143: 10.0.78.113:   what():  NCCL error in: /tmp/pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:158, internal error, NCCL version 21.0.3
10.0.87.143: 10.0.78.113: ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
10.0.87.143: 10.0.72.60:     output = model(x)
10.0.87.143: 10.0.72.60:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
@chaoyanghe chaoyanghe reopened this Sep 6, 2021
@wzamazon
Copy link

wzamazon commented Sep 6, 2021

It looks like you are using EFA because register_mr_buffers is a function defined in aws-ofi-nccl plugin.

If it is the case, maybe open an issue in https://github.com/aws/aws-ofi-nccl.

@wzamazon
Copy link

wzamazon commented Sep 6, 2021

Also try to run dmesg to see if there are kernel level error message.

@chaoyanghe
Copy link
Author

chaoyanghe commented Sep 7, 2021

dmesg result:

[    7.333531] nvidia: module license 'NVIDIA' taints kernel.                                                                                                                                                                        [64/1940]
[    7.333531] Disabling lock debugging due to kernel taint
[    7.333532] nvidia: module license 'NVIDIA' taints kernel.
[    7.333533] nvidia: module license 'NVIDIA' taints kernel.
[    7.345808] ena 0000:00:05.0: Elastic Network Adapter (ENA) found at mem fd014000, mac addr 0e:e2:cf:7d:1a:ff
[    7.408229] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    7.449617] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
[    7.457237] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
[    7.504615] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
[    7.599766] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
[    7.755746] mousedev: PS/2 mouse device common for all mice
[    7.828257] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.80.02  Wed Sep 23 01:13:39 UTC 2020
[    7.924954] Elastic Fabric Adapter (EFA) v1.10.2g
[    8.038953] efa 0000:00:06.0: Setup irq:0xffff88ddba9459a0 vector:123 name:efa-mgmnt@pci:0000:00:06.0
[    8.048374] infiniband efa_0: IB device registered
[    8.150111] Neuron Driver Started with Version:1.1.2.0
[   15.356571] RPC: Registered named UNIX socket transport module.
[   15.365231] RPC: Registered udp transport module.
[   15.369147] RPC: Registered tcp transport module.
[   15.373190] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   15.484024] random: crng init done
[   15.484026] random: 7 urandom warning(s) missed due to ratelimiting
[   21.265449] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   21.271986] Bridge firewalling registered
[   21.301224] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[   22.417880] Initializing XFRM netlink socket
[   22.429731] Netfilter messages via NETLINK v0.30.
[   22.440908] ctnetlink v0.93: registering with nfnetlink.
[   24.627980] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[   27.914135] NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
[  541.962521] libcfs: module is from the staging directory, the quality is unknown, you have been warned.
[  541.973476] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
[  541.977826] alg: No test for adler32 (adler32-zlib)
[  542.744119] lnet: module is from the staging directory, the quality is unknown, you have been warned.
[  542.771058] obdclass: module is from the staging directory, the quality is unknown, you have been warned.
[  542.788107] Lustre: Lustre: Build Version: 2.10.8
[  542.811380] ptlrpc: module is from the staging directory, the quality is unknown, you have been warned.
[  542.838343] ksocklnd: module is from the staging directory, the quality is unknown, you have been warned.
[  542.848721] LNet: Added LNI 10.0.87.143@tcp [8/256/0/180]
[  542.851520] LNet: Accept secure, port 988
[  542.867378] lov: module is from the staging directory, the quality is unknown, you have been warned.
[  542.880697] fid: module is from the staging directory, the quality is unknown, you have been warned.
[  542.897622] mdc: module is from the staging directory, the quality is unknown, you have been warned.
[  542.913663] fld: module is from the staging directory, the quality is unknown, you have been warned.
[  542.930234] lmv: module is from the staging directory, the quality is unknown, you have been warned.
[  542.955365] lustre: module is from the staging directory, the quality is unknown, you have been warned.
[  542.976986] mgc: module is from the staging directory, the quality is unknown, you have been warned.
[  543.010859] osc: module is from the staging directory, the quality is unknown, you have been warned.
[  543.221195] Lustre: Mounted 4vozvbmv-client
[193073.072231] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[193073.190398] nvidia-uvm: Loaded the UVM driver, major device number 243.
[193073.219362] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  450.80.02  Wed Sep 23 00:48:09 UTC 2020
[669689.122792] Lustre: 10177:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583759/real 1630583759]  req@ffff893d76464b00 x1709089496410816/t0(0) o400->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 224/224 e 0 to 1 dl 1630583766 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669689.122812] Lustre: 4vozvbmv-MDT0000-mdc-ffff88ddb2378000: Connection to 4vozvbmv-MDT0000 (at 10.0.205.151@tcp) was lost; in progress operations using this service will wait for recovery to complete
[669689.144092] Lustre: 10177:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669689.165176] LustreError: 166-1: MGC10.0.205.151@tcp: Connection to MGS (at 10.0.205.151@tcp) was lost; in progress operations using this service will fail
[669694.242785] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583766/real 1630583766]  req@ffff8938c5082700 x1709089496416160/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb
2378000@10.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583772 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669695.266788] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583767/real 1630583767]  req@ffff8938c5080f00 x1709089496416176/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583773 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669720.866793] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583792/real 1630583792]  req@ffff8938c5083600 x1709089496416208/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb
2378000@10.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583798 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669725.986801] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583792/real 1630583792]  req@ffff8938c5080c00 x1709089496416192/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583803 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669752.610835] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583819/real 0]  req@ffff8938cdda9500 x1709089496421552/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583803 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1                                                                                                                                                  [0/1940]
[669752.610835] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583819/real 0]  req@ffff8938cdda9500 x1709089496421552/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583830 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669832.482918] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583899/real 0]  req@ffff89357abe0600 x1709089496437520/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583910 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669832.504005] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669890.850961] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583952/real 0]  req@ffff89389755f800 x1709089496448176/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583968 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669890.873292] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669958.434964] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630584005/real 0]  req@ffff8938cbc7e000 x1709089496458816/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.151@tcp:2
6/25 lens 520/544 e 0 to 1 dl 1630584036 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669958.454443] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[670108.963185] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630584165/real 0]  req@ffff8938cbb37200 x1709089496490768/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630584186 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[670108.984529] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[670381.155547] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1630584404/real 1630584458]  req@ffff8938ad64f200 x1709089496538720/t0(0) o250->MGC10.0.205.151@tcp@10.0.2
05.151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630584459 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[670381.176961] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
[670937.156108] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1630584964/real 1630585014]  req@ffff893d83ed0900 x1709089496650656/t0(0) o250->MGC10.0.205.151@tcp@10.0.2
05.151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630585019 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[670937.177572] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
[671099.175718] Lustre: Evicted from MGS (at 198.19.54.206@tcp1) after server handle changed from 0x5bc0911ee77e15fb to 0x1694543d7262fec
[671099.184834] LustreError: 10110:0:(client.c:3009:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88dd66d7ad00 x1709089395116832/t95103249569(95103249569) o101->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@10.0.205.151@tcp:12/10 
lens 992/544 e 0 to 0 dl 1630585348 ref 2 fl Interpret:RP/4/0 rc 301/301
[671099.184923] Lustre: MGC10.0.205.151@tcp: Connection restored to 198.19.54.206@tcp1 (at 10.0.205.151@tcp)
[671521.495128] Lustre: 4vozvbmv-MDT0000-mdc-ffff88ddb2378000: Connection restored to 10.0.205.151@tcp (at 10.0.205.151@tcp)
[678853.930111] Lustre: 10171:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630592924/real 1630592924]  req@ffff8938aefeb900 x1709089498252000/t0(0) o400->4vozvbmv-OST00e3-osc-ffff88dd
b2378000@10.0.220.229@tcp:28/4 lens 224/224 e 0 to 1 dl 1630592931 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[678853.930135] Lustre: 4vozvbmv-OST00e2-osc-ffff88ddb2378000: Connection to 4vozvbmv-OST00e2 (at 10.0.220.229@tcp) was lost; in progress operations using this service will wait for recovery to complete
[678853.952118] Lustre: 10171:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
[678944.042170] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630593010/real 1630593010]  req@ffff893d27b93600 x1709089498268048/t0(0) o8->4vozvbmv-OST00e2-osc-ffff88ddb2
378000@10.0.220.229@tcp:28/4 lens 520/544 e 0 to 1 dl 1630593021 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[678944.063418] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
[679238.188040] Lustre: 4vozvbmv-OST00e3-osc-ffff88ddb2378000: Connection restored to 198.19.32.207@tcp1 (at 10.0.220.229@tcp)
[1038619.687627] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.695161] infiniband efa_0: Failed to register mr [-12]
[1038619.695196] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.700624] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.707090] infiniband efa_0: Failed to register mr [-12]
[1038619.707145] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.707147] infiniband efa_0: Failed to register mr [-12]
[1038619.714600] infiniband efa_0: Failed to register mr [-12]
[1038619.714602] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.714604] infiniband efa_0: Failed to register mr [-12]
[1038619.726444] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.726511] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.726512] infiniband efa_0: Failed to register mr [-12]
[1038619.766218] infiniband efa_0: Failed to register mr [-12]
[1088198.846130] python3[41849]: segfault at 10 ip 00007f6ac5b5a1af sp 00007ffdd32cdd80 error 4 in libc-2.27.so[7f6ac5ac5000+1e7000]
[1090746.882487] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.889954] infiniband efa_0: Failed to register mr [-12]
[1090746.889990] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.895063] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.901576] infiniband efa_0: Failed to register mr [-12]
[1090746.901584] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.901587] infiniband efa_0: Failed to register mr [-12]
[1090746.909000] infiniband efa_0: Failed to register mr [-12]
[1090746.909007] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.914505] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920860] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920904] infiniband efa_0: Failed to register mr [-12]
[1090746.920939] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920941] infiniband efa_0: Failed to register mr [-12]
[1090746.925258] infiniband efa_0: Failed to register mr [-12]
[1090746.972127] infiniband efa_0: Failed to register mr [-12]

@chaoyanghe
Copy link
Author

export OMP_NUM_THREADS=8
export NCCL_NSOCKS_PERTHREAD=8
export NCCL_SOCKET_NTHREADS=8

when world_size = 64, is this setting too big?

@chaoyanghe
Copy link
Author

NCCL_NSOCKS_PERTHREAD=8 -x NCCL_SOCKET_NTHREADS=8 in your command line is causing each connection to create 64 sockets (each socket using one file), hence

64 sockets per connection *
8 processes per node *
(64-8) remote ranks *
2 (send/receive)
= 57344

@wzamazon
Copy link

wzamazon commented Sep 7, 2021

Hi, I am a little confused how sockets play in this scenario. If you are using EFA, then you are not using sockets. Or Am I missing something?

@chaoyanghe
Copy link
Author

Hi, I am a little confused how sockets play in this scenario. If you are using EFA, then you are not using sockets. Or Am I missing something?

I see. I checked this by setting them to 1, the same error happened.

@chaoyanghe
Copy link
Author

@wzamazon I also created an issue at EFA's github:

aws/aws-ofi-nccl#70

If you are in Amazon, can we create a channel to debug? I know EFA's engineer. What's your internal ID at Amazon?

@chaoyanghe
Copy link
Author

NVIDIA/nccl-tests#78

@chaoyanghe
Copy link
Author

chaoyanghe commented Sep 7, 2021

@chaoyanghe
Copy link
Author

this issue is finally resolved by setting NCCL_BUFFSIZE=2097152.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants