Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ucx_perftest -t am_bw works; ucp_am_bw retries and fails #9711

Open
plopresti opened this issue Feb 23, 2024 · 5 comments
Open

ucx_perftest -t am_bw works; ucp_am_bw retries and fails #9711

plopresti opened this issue Feb 23, 2024 · 5 comments
Labels

Comments

@plopresti
Copy link

Describe the bug

I am not sure this is a UCX bug. Hopefully someone can give me ideas about next steps.

I have two identical servers with BCM57414 hardware. They are connected by a direct-attach cable (no switch).

As far as I can tell, the network connection works fine. Both "ib_write_bw" from the perftest package and "ucx_perftest -t am_bw" are working fine and with reasonable speed.

However I have been unable to get the ucp_am_bw test to work.

I compared the attached log with a run from a working pair of servers (different hardware but also bnxt_re driver), and things seem to go awry here:

[1708728616.025791] [gpu01:16495:0]           flush.c:381  UCX  DEBUG flush_worker ep 0x7f8e11715000
[1708728616.025800] [gpu01:16495:0]           ud_ep.c:93   UCX  DEBUG ep: 0x55cde0e59c10 ca drop@cwnd = 2 in flight: 1
[1708728616.025802] [gpu01:16495:0]           ud_ep.c:1424 UCX  DEBUG ep(0x55cde0e59c10): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1708728616.025804] [gpu01:16495:0]           ud_ep.c:1430 UCX  DEBUG ep(0x55cde0e59c10): resending completed
[1708728616.088720] [gpu01:16495:0]           ud_ep.c:93   UCX  DEBUG ep: 0x55cde0e59c10 ca drop@cwnd = 2 in flight: 1
[1708728616.088721] [gpu01:16495:0]           ud_ep.c:1424 UCX  DEBUG ep(0x55cde0e59c10): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1708728616.088722] [gpu01:16495:0]           ud_ep.c:1430 UCX  DEBUG ep(0x55cde0e59c10): resending completed
[1708728616.248118] [gpu01:16495:0]           ud_ep.c:93   UCX  DEBUG ep: 0x55cde0e59c10 ca drop@cwnd = 2 in flight: 1
(etc.)

Steps to Reproduce

On destination server: "ucx_perftest"

On originating server: "env UCX_LOG_LEVEL=data ucx_perftest gpu03-pp -t ucp_am_bw"

  • UCX version used

UCX 1.15.1 (from AlmaLinux 9.3 distro)
UCX 1.16.0 RC3 (compiled from source earlier today)

  • Any UCX environment variables used

Doesn't seem to matter. I ran with UCX_LOG_LEVEL=data to capture a full log (attached).

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)

AlmaLinux 9.3 x86-64

  • For RDMA/IB/RoCE related issues:
    • Driver version:
      rdma-core-46.0-1.el9.x86_64
    • HW information from ibstat or ibv_devinfo -vv command
      (attached)
  • For GPU related issues:
    N/A (not trying to use GPUDirect yet)

ibv_devinfo.log
ucx_perftest.log

Any help or suggestions would be appreciated. Thank you!

@plopresti plopresti added the Bug label Feb 23, 2024
@alex--m
Copy link
Contributor

alex--m commented Apr 21, 2024

I've encountered this problem too - on first glance it looks like this is the offending commit: ed2011b (@evgeny-leksikov for attention)

Specifically, I observed it during MPI_Finalize, when UCX seems to wait a long time for endpoints (auxiliary w.r.t. to wireup endpoints) do not close properly. It looks to me like UD endpoints remain alive during tear-down even if unused, and for some reason are "detached" and not just closed.

@yosefe
Copy link
Contributor

yosefe commented Apr 21, 2024

@alex--m do you also see "ucx_perftest -t ucp_am_bw" doesn't work?
The original issue seems to happen with Broadcom NIC, it's possible some assumptions we did in UD/RC transports are not correct for this NIC. Would be great is someone with access to Broadcom NIC could help fix this issue.
Regarding UD, then yes, when we close UD ep for UCT API it's kept on the iface in case the remove UD endpoint is still alive.

@alex--m
Copy link
Contributor

alex--m commented Apr 21, 2024

@yosefe I'm seeing this with CX-6, and I'm not sure it's a HW-related issue. I'll run ucx_perftest too, I see this on osu_bw. Right now it looks like this UD ep was never used (RC was), and it takes ~20 minutes for it to timeout and die on it's own (since it's not a hang - it might have been missed in CI?).

@alex--m
Copy link
Contributor

alex--m commented Apr 21, 2024

Confirmed also in ucx_perftest CX:

UCX_NET_DEVICES=mlx5_0:1 UCX_LOG_LEVEL=req /shared/alexm/applications/ucx_perftest/ucx_perftest -t ucp_am_bw
[1713684825.142475] [srv25-r203:480403:0]           debug.c:1152 UCX  DEBUG using signal stack 0x7fbdeb05d000 size 148160
[1713684825.157998] [srv25-r203:480403:0]            init.c:121  UCX  DEBUG /home/alexm/environments/rhel-8.9/home/lib/libucs.so.0 loaded at 0x7fbded824000
[1713684825.158034] [srv25-r203:480403:0]            init.c:122  UCX  DEBUG cmd line: /shared/alexm/applications/ucx_perftest/ucx_perftest -t ucp_am_bw 
[1713684825.158060] [srv25-r203:480403:0]          module.c:72   UCX  DEBUG ucs library path: /home/alexm/environments/rhel-8.9/home/lib/libucs.so.0
[1713684825.158076] [srv25-r203:480403:0]          module.c:280  UCX  DEBUG loading modules for ucs
[1713684825.159911] [srv25-r203:480403:0]          module.c:280  UCX  DEBUG loading modules for ucx_perftest
[1713684825.160081] [srv25-r203:480403:0]        perftest.c:789  UCX  WARN  CPU affinity is not set (bound to 96 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.12.4.253:39734
+----------------------------------------------------------------------------------------------------------+
| API:          protocol layer                                                                             |
| Test:         am bandwidth / message rate                                                                |
| Data layout:  (automatic)                                                                                |
| Send memory:  host                                                                                       |
| Recv memory:  host                                                                                       |
| Message size: 8                                                                                          |
| AM header size: 0                                                                        |
+----------------------------------------------------------------------------------------------------------+
[1713684826.788977] [srv25-r203:480403:0]         libperf.c:219  UCX  DEBUG set send allocator by send mem type host
[1713684826.788994] [srv25-r203:480403:0]         libperf.c:223  UCX  DEBUG set recv allocator by recv mem type host
[1713684826.789016] [srv25-r203:480403:0]            time.c:22   UCX  DEBUG arch clock frequency: 2100000000.00 Hz
[1713684826.790377] [srv25-r203:480403:0]     ucp_context.c:2131 UCX  INFO  Version 1.15.0 (loaded from /home/alexm/environments/rhel-8.9/home/lib/libucp.so.0)
[1713684826.790418] [srv25-r203:480403:0]     ucp_context.c:1899 UCX  DEBUG estimated number of endpoints is 1
[1713684826.790429] [srv25-r203:480403:0]     ucp_context.c:1906 UCX  DEBUG estimated number of endpoints per node is 1
[1713684826.790444] [srv25-r203:480403:0]     ucp_context.c:1916 UCX  DEBUG estimated bcopy bandwidth is 6081740800.000000
[1713684826.791306] [srv25-r203:480403:0]     ucp_context.c:1975 UCX  DEBUG allocation method[0] is md 'sysv'
[1713684826.791319] [srv25-r203:480403:0]     ucp_context.c:1975 UCX  DEBUG allocation method[1] is md 'posix'
[1713684826.791344] [srv25-r203:480403:0]     ucp_context.c:1987 UCX  DEBUG allocation method[2] is 'huge'
[1713684826.791357] [srv25-r203:480403:0]     ucp_context.c:1987 UCX  DEBUG allocation method[3] is 'thp'
[1713684826.791369] [srv25-r203:480403:0]     ucp_context.c:1975 UCX  DEBUG allocation method[4] is md '*'
[1713684826.791381] [srv25-r203:480403:0]     ucp_context.c:1987 UCX  DEBUG allocation method[5] is 'mmap'
[1713684826.791394] [srv25-r203:480403:0]     ucp_context.c:1987 UCX  DEBUG allocation method[6] is 'heap'
[1713684826.791451] [srv25-r203:480403:0]          module.c:280  UCX  DEBUG loading modules for uct
[1713684826.796318] [srv25-r203:480403:0]          module.c:280  UCX  DEBUG loading modules for uct_ib
[1713684826.817388] [srv25-r203:480403:0]            topo.c:790  UCX  DEBUG /sys/class/infiniband/mlx5_0: PF sysfs path is '/sys/devices/pci0000:4a/0000:4a:02.0/0000:4b:00.0'
[1713684826.817412] [srv25-r203:480403:0]            topo.c:240  UCX  DEBUG added sys_dev 0 for bus id 4b:00.0
[1713684826.817423] [srv25-r203:480403:0]            topo.c:473  UCX  DEBUG mlx5_0: bdf_name 0000:4b:00.0 sys_dev 0
[1713684826.817458] [srv25-r203:480403:0]       ib_device.c:487  UCX  DEBUG mlx5_0: vendor_id 0x15b3 device_id 4125
[1713684826.817557] [srv25-r203:480403:0]           async.c:232  UCX  DEBUG added async handler 0x7fbdeb094660 [id=5 ref 1] ???() to hash
[1713684826.818248] [srv25-r203:480403:0]           async.c:494  UCX  DEBUG listening to async event fd 5 events 0x1 mode thread_spinlock
[1713684826.818262] [srv25-r203:480403:0]       ib_device.c:586  UCX  DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1713684826.818281] [srv25-r203:480403:0]           ib_md.c:1738 UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1713684826.818294] [srv25-r203:480403:0]           ib_md.c:1738 UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1713684826.818313] [srv25-r203:480403:0]           ib_md.c:1759 UCX  DEBUG mlx5_0: ibv_reg_dmabuf_mr(fd=-1) returned Protocol not supported, dmabuf is not supported
[1713684826.820628] [srv25-r203:480403:0]           ib_md.c:1726 UCX  DEBUG mlx5_0: relaxed order memory access is disabled
[1713684826.820641] [srv25-r203:480403:0]           ib_md.c:1404 UCX  DEBUG mlx5_0: rcache disabled in builds with NextSilicon support
[1713684829.216752] [srv25-r203:480403:0]           ib_md.c:1448 UCX  DEBUG mlx5_0: using odp global key
[1713684829.216789] [srv25-r203:480403:0]           ib_md.c:1663 UCX  DEBUG mlx5_0: md open by 'uct_ib_verbs_md_ops' is successful
[1713684829.245835] [srv25-r203:480403:0]            topo.c:790  UCX  DEBUG /sys/class/infiniband/mlx5_1: PF sysfs path is '/sys/devices/pci0000:4a/0000:4a:02.0/0000:4b:00.1'
[1713684829.245856] [srv25-r203:480403:0]            topo.c:240  UCX  DEBUG added sys_dev 1 for bus id 4b:00.1
[1713684829.245864] [srv25-r203:480403:0]            topo.c:473  UCX  DEBUG mlx5_1: bdf_name 0000:4b:00.1 sys_dev 1
[1713684829.245892] [srv25-r203:480403:0]       ib_device.c:487  UCX  DEBUG mlx5_1: vendor_id 0x15b3 device_id 4125
[1713684829.245973] [srv25-r203:480403:0]           async.c:232  UCX  DEBUG added async handler 0x7fbdeb099010 [id=10 ref 1] ???() to hash
[1713684829.245992] [srv25-r203:480403:0]           async.c:494  UCX  DEBUG listening to async event fd 10 events 0x1 mode thread_spinlock
[1713684829.246001] [srv25-r203:480403:0]       ib_device.c:586  UCX  DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1713684829.246016] [srv25-r203:480403:0]           ib_md.c:1738 UCX  DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1713684829.246029] [srv25-r203:480403:0]           ib_md.c:1738 UCX  DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1713684829.246044] [srv25-r203:480403:0]           ib_md.c:1759 UCX  DEBUG mlx5_1: ibv_reg_dmabuf_mr(fd=-1) returned Protocol not supported, dmabuf is not supported
[1713684829.248362] [srv25-r203:480403:0]           ib_md.c:1726 UCX  DEBUG mlx5_1: relaxed order memory access is disabled
[1713684829.248373] [srv25-r203:480403:0]           ib_md.c:1404 UCX  DEBUG mlx5_1: rcache disabled in builds with NextSilicon support
[1713684831.598712] [srv25-r203:480403:0]           ib_md.c:1448 UCX  DEBUG mlx5_1: using odp global key
[1713684831.598749] [srv25-r203:480403:0]           ib_md.c:1663 UCX  DEBUG mlx5_1: md open by 'uct_ib_verbs_md_ops' is successful
[1713684831.605325] [srv25-r203:480403:0]     ucp_context.c:1563 UCX  DEBUG closing md mlx5_1 because it has no selected transport resources
[1713684831.605416] [srv25-r203:480403:0]       ib_device.c:605  UCX  DEBUG destroying ib device mlx5_1
[1713684831.605446] [srv25-r203:480403:0]           async.c:157  UCX  DEBUG removed async handler 0x7fbdeb099010 [id=10 ref 1] ???() from hash
[1713684831.605459] [srv25-r203:480403:0]           async.c:547  UCX  DEBUG removing async handler 0x7fbdeb099010 [id=10 ref 1] ???()
[1713684831.605477] [srv25-r203:480403:0]           async.c:172  UCX  DEBUG release async handler 0x7fbdeb099010 [id=10 ref 0] ???()
[1713684831.628532] [srv25-r203:480403:0]            topo.c:790  UCX  DEBUG /sys/class/infiniband/mlx5_2: PF sysfs path is '/sys/devices/pci0000:b0/0000:b0:04.0/0000:b1:00.0'
[1713684831.628569] [srv25-r203:480403:0]            topo.c:240  UCX  DEBUG added sys_dev 2 for bus id b1:00.0
[1713684831.628581] [srv25-r203:480403:0]            topo.c:473  UCX  DEBUG mlx5_2: bdf_name 0000:b1:00.0 sys_dev 2
[1713684831.628626] [srv25-r203:480403:0]       ib_device.c:487  UCX  DEBUG mlx5_2: vendor_id 0x15b3 device_id 4115
[1713684831.628728] [srv25-r203:480403:0]           async.c:232  UCX  DEBUG added async handler 0x7fbdeb09a390 [id=10 ref 1] ???() to hash
[1713684831.628755] [srv25-r203:480403:0]           async.c:494  UCX  DEBUG listening to async event fd 10 events 0x1 mode thread_spinlock
[1713684831.628767] [srv25-r203:480403:0]       ib_device.c:586  UCX  DEBUG initialized device 'mlx5_2' (InfiniBand channel adapter) with 1 ports
[1713684831.628789] [srv25-r203:480403:0]           ib_md.c:1738 UCX  DEBUG mlx5_2: cuda GPUDirect RDMA is disabled
[1713684831.628807] [srv25-r203:480403:0]           ib_md.c:1738 UCX  DEBUG mlx5_2: rocm GPUDirect RDMA is disabled
[1713684831.628829] [srv25-r203:480403:0]           ib_md.c:1759 UCX  DEBUG mlx5_2: ibv_reg_dmabuf_mr(fd=-1) returned Protocol not supported, dmabuf is not supported
[1713684831.631250] [srv25-r203:480403:0]           ib_md.c:1726 UCX  DEBUG mlx5_2: relaxed order memory access is disabled
[1713684831.631260] [srv25-r203:480403:0]           ib_md.c:1404 UCX  DEBUG mlx5_2: rcache disabled in builds with NextSilicon support
[1713684834.291070] [srv25-r203:480403:0]           ib_md.c:1448 UCX  DEBUG mlx5_2: using odp global key
[1713684834.291108] [srv25-r203:480403:0]           ib_md.c:1663 UCX  DEBUG mlx5_2: md open by 'uct_ib_verbs_md_ops' is successful
[1713684834.296268] [srv25-r203:480403:0]       ib_device.c:1052 UCX  DEBUG no compatible IB ports found for flags 0x0
[1713684834.296297] [srv25-r203:480403:0]          uct_md.c:131  UCX  DEBUG failed to query rc_verbs resources: No such device
[1713684834.296309] [srv25-r203:480403:0]       ib_device.c:1052 UCX  DEBUG no compatible IB ports found for flags 0x0
[1713684834.296320] [srv25-r203:480403:0]          uct_md.c:131  UCX  DEBUG failed to query ud_verbs resources: No such device
[1713684834.296333] [srv25-r203:480403:0]     ucp_context.c:1111 UCX  DEBUG No tl resources found for md mlx5_2
[1713684834.296343] [srv25-r203:480403:0]     ucp_context.c:1563 UCX  DEBUG closing md mlx5_2 because it has no selected transport resources
[1713684834.296405] [srv25-r203:480403:0]       ib_device.c:605  UCX  DEBUG destroying ib device mlx5_2
[1713684834.296423] [srv25-r203:480403:0]           async.c:157  UCX  DEBUG removed async handler 0x7fbdeb09a390 [id=10 ref 1] ???() from hash
[1713684834.296435] [srv25-r203:480403:0]           async.c:547  UCX  DEBUG removing async handler 0x7fbdeb09a390 [id=10 ref 1] ???()
[1713684834.296453] [srv25-r203:480403:0]           async.c:172  UCX  DEBUG release async handler 0x7fbdeb09a390 [id=10 ref 0] ???()
[1713684834.326479] [srv25-r203:480403:0]            topo.c:790  UCX  DEBUG /sys/class/infiniband/mlx5_3: PF sysfs path is '/sys/devices/pci0000:b0/0000:b0:04.0/0000:b1:00.1'
[1713684834.326501] [srv25-r203:480403:0]            topo.c:240  UCX  DEBUG added sys_dev 3 for bus id b1:00.1
[1713684834.326508] [srv25-r203:480403:0]            topo.c:473  UCX  DEBUG mlx5_3: bdf_name 0000:b1:00.1 sys_dev 3
[1713684834.326539] [srv25-r203:480403:0]       ib_device.c:487  UCX  DEBUG mlx5_3: vendor_id 0x15b3 device_id 4115
[1713684834.326614] [srv25-r203:480403:0]           async.c:232  UCX  DEBUG added async handler 0x7fbdeb0dc640 [id=10 ref 1] ???() to hash
[1713684834.326632] [srv25-r203:480403:0]           async.c:494  UCX  DEBUG listening to async event fd 10 events 0x1 mode thread_spinlock
[1713684834.326641] [srv25-r203:480403:0]       ib_device.c:586  UCX  DEBUG initialized device 'mlx5_3' (InfiniBand channel adapter) with 1 ports
[1713684834.326657] [srv25-r203:480403:0]           ib_md.c:1738 UCX  DEBUG mlx5_3: cuda GPUDirect RDMA is disabled
[1713684834.326669] [srv25-r203:480403:0]           ib_md.c:1738 UCX  DEBUG mlx5_3: rocm GPUDirect RDMA is disabled
[1713684834.326685] [srv25-r203:480403:0]           ib_md.c:1759 UCX  DEBUG mlx5_3: ibv_reg_dmabuf_mr(fd=-1) returned Protocol not supported, dmabuf is not supported
[1713684834.329728] [srv25-r203:480403:0]           ib_md.c:1726 UCX  DEBUG mlx5_3: relaxed order memory access is disabled
[1713684834.329739] [srv25-r203:480403:0]           ib_md.c:1404 UCX  DEBUG mlx5_3: rcache disabled in builds with NextSilicon support
[1713684838.355815] [srv25-r203:480403:0]           ib_md.c:1448 UCX  DEBUG mlx5_3: using odp global key
[1713684838.355852] [srv25-r203:480403:0]           ib_md.c:1663 UCX  DEBUG mlx5_3: md open by 'uct_ib_verbs_md_ops' is successful
[1713684838.361103] [srv25-r203:480403:0]       ib_device.c:1052 UCX  DEBUG no compatible IB ports found for flags 0x0
[1713684838.361125] [srv25-r203:480403:0]          uct_md.c:131  UCX  DEBUG failed to query rc_verbs resources: No such device
[1713684838.361139] [srv25-r203:480403:0]       ib_device.c:1052 UCX  DEBUG no compatible IB ports found for flags 0x0
[1713684838.361149] [srv25-r203:480403:0]          uct_md.c:131  UCX  DEBUG failed to query ud_verbs resources: No such device
[1713684838.361162] [srv25-r203:480403:0]     ucp_context.c:1111 UCX  DEBUG No tl resources found for md mlx5_3
[1713684838.361172] [srv25-r203:480403:0]     ucp_context.c:1563 UCX  DEBUG closing md mlx5_3 because it has no selected transport resources
[1713684838.361230] [srv25-r203:480403:0]       ib_device.c:605  UCX  DEBUG destroying ib device mlx5_3
[1713684838.361247] [srv25-r203:480403:0]           async.c:157  UCX  DEBUG removed async handler 0x7fbdeb0dc640 [id=10 ref 1] ???() from hash
[1713684838.361260] [srv25-r203:480403:0]           async.c:547  UCX  DEBUG removing async handler 0x7fbdeb0dc640 [id=10 ref 1] ???()
[1713684838.361277] [srv25-r203:480403:0]           async.c:172  UCX  DEBUG release async handler 0x7fbdeb0dc640 [id=10 ref 0] ???()
[1713684838.363029] [srv25-r203:480403:0]            topo.c:477  UCX  DEBUG eth3: system device unknown
[1713684838.364595] [srv25-r203:480403:0]            topo.c:477  UCX  DEBUG eth2: system device unknown
[1713684838.365573] [srv25-r203:480403:0]            topo.c:477  UCX  DEBUG eth0: system device unknown
[1713684838.366528] [srv25-r203:480403:0]            topo.c:477  UCX  DEBUG lo: system device unknown
[1713684838.366564] [srv25-r203:480403:0]     ucp_context.c:1563 UCX  DEBUG closing md tcp because it has no selected transport resources
[1713684838.366625] [srv25-r203:480403:0]      uct_md_vfs.c:134  UCX  DEBUG failed to query md attributes
[1713684838.366738] [srv25-r203:480403:0]      uct_md_vfs.c:134  UCX  DEBUG failed to query md attributes
[1713684838.366897] [srv25-r203:480403:0]     ucp_context.c:1622 UCX  DEBUG register host memory on: mlx5_0, self, cma
[1713684838.366910] [srv25-r203:480403:0]     ucp_context.c:1622 UCX  DEBUG register cuda memory on: 
[1713684838.366928] [srv25-r203:480403:0]     ucp_context.c:1622 UCX  DEBUG register cuda-managed memory on: 
[1713684838.366937] [srv25-r203:480403:0]     ucp_context.c:1622 UCX  DEBUG register rocm memory on: 
[1713684838.366946] [srv25-r203:480403:0]     ucp_context.c:1622 UCX  DEBUG register rocm-managed memory on: 
[1713684838.367018] [srv25-r203:480403:0]     ucp_context.c:2222 UCX  DEBUG created ucp context perftest 0x7fbdeb08cde0 [3 mds 4 tls] features 0x40 tl bitmap 0xf 0x0
[1713684838.367070] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367082] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367093] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367205] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367226] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367261] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367271] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367280] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367304] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367324] [srv25-r203:480403:0]         uct_mem.c:301  UCX  DEBUG   could not allocate memory with any of the provided methods
[1713684838.367600] [srv25-r203:480403:0]        ib_iface.c:907  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1/RoCE
[1713684838.367801] [srv25-r203:480403:0]       ib_device.c:916  UCX  DEBUG mlx5_0:1 using gid_index 3
[1713684838.367889] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.369907] [srv25-r203:480403:0]        ib_iface.c:1433 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 91 data_sz 8256
[1713684838.369989] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8355
[1713684838.370005] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1713684838.370248] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1713684838.371257] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1713684838.371273] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool rc_verbs_short_desc: align 64, maxelems 4294967295, elemsize 192
[1713684838.372679] [srv25-r203:480403:0]        ib_iface.c:1032 UCX  DEBUG iface=0x7fbdeb131150: created RC QP 0x53c7 on mlx5_0:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1713684838.373661] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.375259] [srv25-r203:480403:0]      ucp_worker.c:1400 UCX  DEBUG created interface[0]=0x7fbdeb131150 using rc_verbs/mlx5_0:1 on worker 0x7fbdeb0e84c0
[1713684838.375326] [srv25-r203:480403:0]        ib_iface.c:907  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1/RoCE
[1713684838.375399] [srv25-r203:480403:0]       ib_device.c:916  UCX  DEBUG mlx5_0:1 using gid_index 3
[1713684838.375430] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.376213] [srv25-r203:480403:0]        ib_iface.c:1433 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4136
[1713684838.376922] [srv25-r203:480403:0]        ib_iface.c:1032 UCX  DEBUG iface=0x7fbdeb08b2d0: created UD QP 0x53c8 on mlx5_0:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1713684838.378509] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4188
[1713684838.378552] [srv25-r203:480403:0]        memtrack.c:541  UCX  DEBUG ucs_try_to_mark_as_unmigratable request 7fbddd313000 85000
[1713684838.378560] [srv25-r203:480403:0]        memtrack.c:544  UCX  DEBUG Asking for memory protection without the loader is present
[1713684838.378699] [srv25-r203:480403:0]           mpool.c:282  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7fbddd313018 of 544744 bytes with 128 elements
[1713684838.378971] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4208
[1713684838.379000] [srv25-r203:480403:0]        ud_iface.c:383  UCX  DEBUG iface 0x7fbdeb08b2d0: adding gid fe80::ac0:ebff:fea2:c542 to hash on device mlx5_0 port 1 index 0)
[1713684838.379018] [srv25-r203:480403:0]        ud_iface.c:383  UCX  DEBUG iface 0x7fbdeb08b2d0: adding gid fe80::ac0:ebff:fea2:c542 to hash on device mlx5_0 port 1 index 1)
[1713684838.379034] [srv25-r203:480403:0]        ud_iface.c:383  UCX  DEBUG iface 0x7fbdeb08b2d0: adding gid ::ffff:10.12.28.13 to hash on device mlx5_0 port 1 index 2)
[1713684838.379050] [srv25-r203:480403:0]        ud_iface.c:383  UCX  DEBUG iface 0x7fbdeb08b2d0: adding gid ::ffff:10.12.28.13 to hash on device mlx5_0 port 1 index 3)
[1713684838.381247] [srv25-r203:480403:0]     timer_wheel.c:41   UCX  DEBUG high res timer created log=23 resolution=3994.575238 usec wanted: 2500.000000 usec
[1713684838.381263] [srv25-r203:480403:0]           async.c:232  UCX  DEBUG added async handler 0x7fbdeb0980c0 [id=10 ref 1] ???() to hash
[1713684838.381277] [srv25-r203:480403:0]           async.c:494  UCX  DEBUG listening to async event fd 10 events 0x5 mode thread_spinlock
[1713684838.381310] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381345] [srv25-r203:480403:0]      ucp_worker.c:1400 UCX  DEBUG created interface[1]=0x7fbdeb08b2d0 using ud_verbs/mlx5_0:1 on worker 0x7fbdeb0e84c0
[1713684838.381368] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool self_msg_desc: align 64, maxelems 4294967295, elemsize 8200
[1713684838.381375] [srv25-r203:480403:0]            self.c:228  UCX  DEBUG created self iface id 0xc03b6c74001c153c send_size 8192
[1713684838.381385] [srv25-r203:480403:0]      ucp_worker.c:1400 UCX  DEBUG created interface[2]=0x7fbdeb0e0510 using self/memory on worker 0x7fbdeb0e84c0
[1713684838.381424] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool uct_scopy_iface_tx_mp: align 64, maxelems 4294967295, elemsize 736
[1713684838.381453] [srv25-r203:480403:0]      ucp_worker.c:1400 UCX  DEBUG created interface[3]=0x7fbdeb12e770 using cma/memory on worker 0x7fbdeb0e84c0
[1713684838.381466] [srv25-r203:480403:0]      ucp_worker.c:1135 UCX  DEBUG selected scalable tl bitmap: 0xf 0x0 (4 tls)
[1713684838.381488] [srv25-r203:480403:0]           async.c:232  UCX  DEBUG added async handler 0x7fbdeb223ae0 [id=9 ref 1] ???() to hash
[1713684838.381498] [srv25-r203:480403:0]           async.c:494  UCX  DEBUG listening to async event fd 9 events 0x0 mode thread_spinlock
[1713684838.381526] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381544] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381560] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381577] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381594] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381610] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381626] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381643] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381659] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381675] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381713] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381729] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381746] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381762] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381778] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381796] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381813] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381829] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381846] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.381864] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684838.382114] [srv25-r203:480403:0]           async.c:232  UCX  DEBUG added async handler 0x7fbdeb097d90 [id=11 ref 1] ???() to hash
[1713684838.382125] [srv25-r203:480403:0]           async.c:494  UCX  DEBUG listening to async event fd 11 events 0x1 mode thread_spinlock
[1713684838.382132] [srv25-r203:480403:0]       rdmacm_cm.c:981  UCX  DEBUG created rdmacm_cm 0x7fbdeb08bbe0 with event_channel 0x7fbdeb090570 (fd=11)
[1713684838.382150] [srv25-r203:480403:0]      tcp_sockcm.c:224  UCX  DEBUG created tcp_sockcm 0x7fbdeb130120
[1713684838.383151] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 280
[1713684838.383159] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 104
[1713684838.383167] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216
[1713684838.383183] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153
[1713684838.383191] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1113
[1713684838.383198] [srv25-r203:480403:0]           mpool.c:138  UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 8344
[1713684838.383206] [srv25-r203:480403:0]       mpool_set.c:130  UCX  DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000440, largest size 8255, mpools num 3
[1713684838.385224] [srv25-r203:480403:0]          parser.c:2047 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=req UCX_NET_DEVICES=mlx5_0:1
[1713684840.021136] [srv25-r203:480403:0]          ucp_ep.c:403  UCX  DEBUG created ep 0x7fbddd59a000 to <no debug data> from api call
[1713684840.021417] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.021463] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.021488] [srv25-r203:480403:0]      ucp_worker.c:1855 UCX  INFO    perftest inter-node cfg#0 am(rc_verbs/mlx5_0:1)
[1713684840.021506] [srv25-r203:480403:0]          wireup.c:1212 UCX  DEBUG   ep 0x7fbddd59a000: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x1
[1713684840.021527] [srv25-r203:480403:0]          wireup.c:1222 UCX  DEBUG   ep 0x7fbddd59a000: lane[0]:  0:rc_verbs/mlx5_0:1.0 md[0]     -> addr[0].md[0]/ib/sysdev[255] rma_bw#0 am am_bw#0 wireup
[1713684840.021539] [srv25-r203:480403:0]          wireup.c:1226 UCX  DEBUG   ep 0x7fbddd59a000: err mode 0, flags 0x0
[1713684840.022422] [srv25-r203:480403:0]        ib_iface.c:1032 UCX  DEBUG   iface=0x7fbdeb131150: created RC QP 0x53c9 on mlx5_0:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1713684840.022447] [srv25-r203:480403:0]           rc_ep.c:165  UCX  DEBUG   created rc ep 0x7fbdeb08c690
[1713684840.022778] [srv25-r203:480403:0]        memtrack.c:541  UCX  DEBUG   ucs_try_to_mark_as_unmigratable request 7fbddac00000 2600000
[1713684840.022791] [srv25-r203:480403:0]        memtrack.c:544  UCX  DEBUG   Asking for memory protection without the loader is present
[1713684840.023370] [srv25-r203:480403:0]           mpool.c:282  UCX  DEBUG   mpool rc_recv_desc: allocated chunk 0x7fbddac00018 of 39845864 bytes with 4752 elements
[1713684840.029748] [srv25-r203:480403:0]       wireup_ep.c:489  UCX  DEBUG   ep 0x7fbddd59a000: wireup_ep 0x7fbddd3a1340 created next_ep 0x7fbdeb08c690 to <no debug data> using rc_verbs/mlx5_0:1
[1713684840.029795] [srv25-r203:480403:0]        ib_iface.c:797  UCX  DEBUG   iface 0x7fbdeb08b2d0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.12.29.148 sgid_index=3 traffic_class=106
[1713684840.029859] [srv25-r203:480403:0]        ib_iface.c:797  UCX  DEBUG   iface 0x7fbdeb08b2d0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.12.29.148 sgid_index=3 traffic_class=106
[1713684840.029885] [srv25-r203:480403:0]           ud_ep.c:414  UCX  DEBUG   created ep ep=0x7fbdeb0e3110 iface=0x7fbdeb08b2d0 id=0
[1713684840.029900] [srv25-r203:480403:0]           ud_ep.c:523  UCX  DEBUG   mlx5_0:1/RoCE lid 0 qpn 0x53c8 epid 0 ep 0x7fbdeb0e3110 connected to IFACE ::ffff:10.12.29.148pkey 0xffff  qpn 0x1c07
[1713684840.029911] [srv25-r203:480403:0]        ib_iface.c:797  UCX  DEBUG   iface 0x7fbdeb08b2d0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.12.29.148 sgid_index=3 traffic_class=106
[1713684840.029930] [srv25-r203:480403:0]        ib_iface.c:797  UCX  DEBUG   iface 0x7fbdeb08b2d0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.12.29.148 sgid_index=3 traffic_class=106
[1713684840.029993] [srv25-r203:480403:0]        memtrack.c:541  UCX  DEBUG   ucs_try_to_mark_as_unmigratable request 7fbdda400000 600000
[1713684840.030001] [srv25-r203:480403:0]        memtrack.c:544  UCX  DEBUG   Asking for memory protection without the loader is present
[1713684840.030347] [srv25-r203:480403:0]           mpool.c:282  UCX  DEBUG   mpool ud_tx_skb: allocated chunk 0x7fbdda400018 of 6291432 bytes with 1489 elements
[1713684840.031099] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031129] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031155] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031181] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031206] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031231] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031257] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031282] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031307] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031334] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031359] [srv25-r203:480403:0]           async.c:232  UCX  DEBUG   added async handler 0x7fbddd3a1b80 [id=1000007 ref 1] ???() to hash
[1713684840.031413] [srv25-r203:480403:0]       wireup_ep.c:262  UCX  DEBUG   ep 0x7fbddd59a000: wireup_ep 0x7fbddd3a1340 created aux_ep 0x7fbdeb0e3110 to <no debug data> using ud_verbs/mlx5_0:1
[1713684840.031443] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031468] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031492] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031516] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031539] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031563] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031586] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031610] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031633] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031657] [srv25-r203:480403:0]       ib_device.c:1287 UCX  DEBUG   RoCE LAG level on mlx5_0:1 (eth2) is 1
[1713684840.031682] [srv25-r203:480403:0]          wireup.c:1643 UCX  DEBUG ep 0x7fbddd59a000: send wireup request (flags=0x40)
[1713684840.031761] [srv25-r203:480403:0]           mpool.c:282  UCX  DEBUG mpool ucp_requests: allocated chunk 0x7fbddd459084 of 41044 bytes with 128 elements
[1713684840.031834] [srv25-r203:480403:0]        memtrack.c:541  UCX  DEBUG ucs_try_to_mark_as_unmigratable request 7fbdd8e00000 1400000
[1713684840.031842] [srv25-r203:480403:0]        memtrack.c:544  UCX  DEBUG Asking for memory protection without the loader is present
[1713684840.032217] [srv25-r203:480403:0]           mpool.c:282  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7fbdd8e00018 of 20971496 bytes with 4964 elements
[1713684840.034894] [srv25-r203:480403:0]           flush.c:377  UCX  DEBUG flush_worker ep 0x7fbddd59a000
480403: ucp_wireup_ep_flush(ep=0x7fbddd3a1340, aux=0x7fbdeb0e3110, flags=0)
480403: ucp_wireup_ep_flush(ep=0x7fbddd3a1340, aux=0x7fbdeb0e3110, flags=0)
[1713684840.062960] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684840.062969] [srv25-r203:480403:0]           ud_ep.c:1445 UCX  DEBUG ep(0x7fbdeb0e3110): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1713684840.062975] [srv25-r203:480403:0]           ud_ep.c:1448 UCX  DEBUG ep(0x7fbdeb0e3110): resending completed
[1713684840.103011] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 3 in flight: 1
[1713684840.123036] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1

 480403:ucp_wireup_msg_handler(1)!
[1713684840.144893] [srv25-r203:480403:0]        ib_iface.c:797  UCX  DEBUG   iface 0x7fbdeb131150: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.12.29.148 sgid_index=3 traffic_class=106
[1713684840.145376] [srv25-r203:480403:0]        rc_iface.c:934  UCX  DEBUG   connected rc qp 0x53c9 on mlx5_0:1/RoCE to lid 49152(+0) sl 0 remote_qp 0x1c08 mtu 4096 timer 18x7 rnr 13x7 rd_atom 16
480403: ucp_wireup_ep_flush(ep=0x7fbddd3a1340, aux=0x7fbdeb0e3110, flags=0)
480403: ucp_wireup_ep_flush(ep=0x7fbddd3a1340, aux=0x7fbdeb0e3110, flags=0)
[1713684840.163086] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684840.243202] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684840.243226] [srv25-r203:480403:0]           ud_ep.c:1445 UCX  DEBUG ep(0x7fbdeb0e3110): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1713684840.243231] [srv25-r203:480403:0]           ud_ep.c:1448 UCX  DEBUG ep(0x7fbdeb0e3110): resending completed
[1713684840.403387] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684840.723766] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684841.364557] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684842.646028] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684842.646048] [srv25-r203:480403:0]           ud_ep.c:1445 UCX  DEBUG ep(0x7fbdeb0e3110): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1713684842.646052] [srv25-r203:480403:0]           ud_ep.c:1448 UCX  DEBUG ep(0x7fbdeb0e3110): resending completed
[1713684845.208977] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684849.305591] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684853.402251] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684857.498916] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684861.595562] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1
[1713684865.692226] [srv25-r203:480403:0]           ud_ep.c:94   UCX  DEBUG ep: 0x7fbdeb0e3110 ca drop@cwnd = 2 in flight: 1


^C

@yosefe
Copy link
Contributor

yosefe commented Apr 21, 2024

@alex--m I've tried to run same ucx_perftest command on ConnectX 6 (master and v1.15.x) and there was no issue.
Can you try also with UCX master branch without any modifications or debug prints?
Anyway I suggest to open a separate issue for this (since it fails on different HW) with all details according to the template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants