-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 21938, vendor err 81 - GPU Direct RDMA error when running NCCL-tests allreduce with nv_peer_mem #214
Comments
@sjeaugey should I cross post this to nv_peer_mem or rdma-core? |
No need to post it elsewhere for now. I have a couple of questions :
|
Both in container and on bare metal.
It was enabled, but after disabling it via the BIOS we still get the same error 11 and error 5 in ENV 1 and now instead of error 4 in ENV 2 I get the same error 11 + error 5 in both ENV 1 and ENV 2.
Those are dual port NICs. SR-IOV is not enabled. |
Could you double check there is no ACS enabled on either of the nodes ?
|
Both nodes show this but let me get the actual vendors names printed:
|
It would be good to make sure all show |
This fixes it. Thank you Sylvain. |
These errors occur when running NCCL-tests with allreduce bandwidth test
nv_peer_mem
kernel module loaded. When thenv_peer_mem
kernel module` is not loaded it completes with ~20 GB/s bandwidth (about half of what we should see if we enable GPU Direct RDMA). I have run it in two separate software environments and with two separate versions of NCCL. Hoping to find a solution to this.Hardware Environment
Two nodes:
Env 1 (Got completion with error 11, opcode 2, len 1048576, vendor err 137)
Command
Error Excerpt:
Full Error Log:
https://gist.github.com/wavesj/258782634523281d238e99c4c4a79990
Env 2 (Got completion with error 4, opcode 1, len 21938, vendor err 81 node-8x-v100-nvlink-2:17939:18094 [0] NCCL INFO include/net.h:34 -> 2)
Command
Error Excerpt:
Full Error Log:
https://gist.github.com/wavesj/c8be90e8716a6dc77cba1228bff53347
OpenMPI config things (for those trying to reproduce)
Hostfile looks like this:
SSH config looks like this (~/.ssh/config):
Additional Notes
Mellanox said that vendor error 0x81 is timeout and transport error counter exceeded. RDMAMojo's description on the
ibv_poll_cq
function suggests that IBV_WC_LOC_PROT_ERR is the error we're getting back when it's code 4.The text was updated successfully, but these errors were encountered: