New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ucx_perftest -t am_bw works; ucp_am_bw retries and fails #9711
Comments
I've encountered this problem too - on first glance it looks like this is the offending commit: ed2011b (@evgeny-leksikov for attention) Specifically, I observed it during MPI_Finalize, when UCX seems to wait a long time for endpoints (auxiliary w.r.t. to wireup endpoints) do not close properly. It looks to me like UD endpoints remain alive during tear-down even if unused, and for some reason are "detached" and not just closed. |
@alex--m do you also see "ucx_perftest -t ucp_am_bw" doesn't work? |
@yosefe I'm seeing this with CX-6, and I'm not sure it's a HW-related issue. I'll run ucx_perftest too, I see this on osu_bw. Right now it looks like this UD ep was never used (RC was), and it takes ~20 minutes for it to timeout and die on it's own (since it's not a hang - it might have been missed in CI?). |
Confirmed also in
|
@alex--m I've tried to run same ucx_perftest command on ConnectX 6 (master and v1.15.x) and there was no issue. |
Describe the bug
I am not sure this is a UCX bug. Hopefully someone can give me ideas about next steps.
I have two identical servers with BCM57414 hardware. They are connected by a direct-attach cable (no switch).
As far as I can tell, the network connection works fine. Both "ib_write_bw" from the perftest package and "ucx_perftest -t am_bw" are working fine and with reasonable speed.
However I have been unable to get the ucp_am_bw test to work.
I compared the attached log with a run from a working pair of servers (different hardware but also bnxt_re driver), and things seem to go awry here:
Steps to Reproduce
On destination server: "ucx_perftest"
On originating server: "env UCX_LOG_LEVEL=data ucx_perftest gpu03-pp -t ucp_am_bw"
UCX 1.15.1 (from AlmaLinux 9.3 distro)
UCX 1.16.0 RC3 (compiled from source earlier today)
Doesn't seem to matter. I ran with UCX_LOG_LEVEL=data to capture a full log (attached).
Setup and versions
AlmaLinux 9.3 x86-64
rdma-core-46.0-1.el9.x86_64
ibstat
oribv_devinfo -vv
command(attached)
N/A (not trying to use GPUDirect yet)
ibv_devinfo.log
ucx_perftest.log
Any help or suggestions would be appreciated. Thank you!
The text was updated successfully, but these errors were encountered: