Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ib_mlx5: tansport retry count exceeded error #9590

Open
denisbertini opened this issue Jan 11, 2024 · 14 comments
Open

ib_mlx5: tansport retry count exceeded error #9590

denisbertini opened this issue Jan 11, 2024 · 14 comments
Labels

Comments

@denisbertini
Copy link

Environment

  • OS: Rocky linux 8.8
  • IB (HDR) fabric
  • openMPI 5.0.1
  • UCX 1.15.0

Error

I recently observed this kind of intermittent error :

lxbk1127:2330856:0:2330856] ib_mlx5_log.c:177  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[lxbk1127:2330856:0:2330856] ib_mlx5_log.c:177  DCI QP 0xd16 wqe[6168]: SEND s-e [rqpn 0x17de rlid 1037] [va 0x7f9f745cea80 len 5770 lkey 0x15de4] 
[lxbk1127:2330859:0:2330859] ib_mlx5_log.c:177  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[lxbk1127:2330859:0:2330859] ib_mlx5_log.c:177  DCI QP 0xd1e wqe[47519]: SEND s-e [rqpn 0x17db rlid 1037] [inl len 61] 
==== backtrace (tid:2330856) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x294) [0x7fada58370a4]
 1  /usr/local/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xb0) [0x7fada5834070]
 2  /usr/local/ucx/lib/libucs.so.0(ucs_log_default_handler+0xf09) [0x7fada5838bd9]
 3  /usr/local/ucx/lib/libucs.so.0(ucs_log_dispatch+0xdc) [0x7fada5838f8c]
 4  /usr/local/ucx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x24c) [0x7fa59879061c]
 5  /usr/local/ucx/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_handle_failure+0xc7) [0x7fa5987c0517]
 6  /usr/local/ucx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x35) [0x7fa5987923b5]
 7  /usr/local/ucx/lib/ucx/libuct_ib.so.0(+0x57ad7) [0x7fa5987c2ad7]
 8  /usr/local/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fada5cf76aa]
 9  /usr/local/lib/libopen-pal.so.80(opal_progress+0x2c) [0x7fada5fb00bc]
10  /usr/local/lib/libopen-pal.so.80(ompi_sync_wait_mt+0x125) [0x7fada5fe2275]
11  /usr/local/lib/libmpi.so.40(ompi_request_default_wait_all+0x13c) [0x7fadb0afa6bc]
12  /usr/local/lib/libmpi.so.40(PMPI_Waitall+0x6f) [0x7fadb0b4aaff]

At the moment i do not know if this kind of error is linked to UCX or the underlying IB fabric.
Any idea(s)?

@brminich
Copy link
Contributor

can you try to check basic ib perf test? Like
server: ib_send_lat -d mlx5_0 -c DC
client: ib_send_lat -d mlx5_0 -c DC $server_ip_addr

@denisbertini
Copy link
Author

Performing the ib_send_lat tests:

root@server:~# ib_send_lat           

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send Latency Test
 Dual-port       : OFF          Device         : mlx4_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 RX depth        : 512
 Mtu             : 2048[B]
 Link type       : IB
 Max inline data : 236[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x26d QPN 0x020a PSN 0xf7bdf4
 remote address: LID 0x07 QPN 0x0046 PSN 0xf1afb7
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
Conflicting CPU frequency values detected: 1199.951000 != 2499.847000. CPU Frequency is not max.
Conflicting CPU frequency values detected: 1199.951000 != 2499.847000. CPU Frequency is not max.
 2       1000          0.00           50.67        5.53                5.69             2.62           7.24                 74.21  
---------------------------------------------------------------------------------------

root@client:~# ib_send_lat  server.gsi.de
---------------------------------------------------------------------------------------
                    Send Latency Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 1
 Mtu             : 2048[B]
 Link type       : IB
 Max inline data : 236[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x07 QPN 0x0046 PSN 0xf1afb7
 remote address: LID 0x26d QPN 0x020a PSN 0xf7bdf4
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
Conflicting CPU frequency values detected: 1499.569000 != 1466.477000. CPU Frequency is not max.
Conflicting CPU frequency values detected: 1498.379000 != 1544.324000. CPU Frequency is not max.
 2       1000          1.33           18.59        1.39                1.40             0.08            1.83                 18.59  
---------------------------------------------------------------------------------------

Instead to DC, RC is used ... (DC is not supported for any reason )

@brminich
Copy link
Contributor

Dc is used in the original error. What is the application and its command line parameters?

@denisbertini
Copy link
Author

which application?, the one which triggers the errors?

@denisbertini
Copy link
Author

This problem does not depend on the application but is linked to the openMPI+UCX layer. We observe this problem on different independent codes already.

This can be related to this old issue:
#6669

@brminich
Copy link
Contributor

yes, i meant the app which thriggers the error.
transport retry count exceeded is very generic symptom, not sure it is really related to #6669.
E.g. does it happen in 100% of the cases and what is the scale?

@denisbertini
Copy link
Author

The simulation which triggers this errors was running on 8 separate nodes, each containing 8 AMD GPU MI 100 devices.
It uses MPI+UCX to communicate between the GPUs intra- and inter -node, one MPI rank being associated to one GPU devices.
If you suspect DC transport to cause the problem, one can remove it using UCX_TLS right?

@denisbertini
Copy link
Author

BTW a particularity of our IB fabric is that we use everywhere HDR splitter cable and all HDR switch are turn on full splitting mode.
Do you think this can be related somehow?

@brminich
Copy link
Contributor

you can try to run the app without dc by setting UCX_TLS=^dc
btw, why is ib_send_lat test not working with dc in your environment?

@denisbertini
Copy link
Author

No idea for the moment!

@denisbertini
Copy link
Author

I did not do the tests ib_send_lat myself but i asked our sys. admin colleague to do the tests.

@gabrieleiannetti
Copy link

gabrieleiannetti commented Jan 17, 2024

Hi @brminich,

indeed we have faced a problem with DC on a pair of nodes in the HDR fabric:

# SERVER
$ ib_send_lat --all --CPU-freq --perform_warm_up --connection=DC --rdma_cm
---------------------------------------------------------------------------------------
 DC does not support RDMA_CM

We haven't solved that yet.

So we have executed that test on other nodes on a different leaf switch with few idle nodes connected to it via single cable instead of splitter-cables.

# SERVER
$ ib_send_lat --all --CPU-freq --comm_rdma_cm --perform_warm_up --connection=DC --iters=1000000

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send Latency Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 2            Transport type : IB
 Connection type : DC           Using SRQ      : ON
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 150[B]
 rdma_cm QPs     : OFF
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0x418 QPN 0x0048 PSN 0xc8d251
 local address: LID 0x418 QPN 0x12ca PSN 0x602f9e
 remote address: LID 0x415 QPN 0x0048 PSN 0x692bd2
 remote address: LID 0x415 QPN 0x12ca PSN 0xfc318b
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       1000000          1.18           71.69        1.25             1.26             0.17            1.35                    1.91
 4       1000000          1.18           8.04         1.25             1.26             0.02            1.35                    1.82
 8       1000000          1.18           7.36         1.25             1.26             0.02            1.35                    1.80
 16      1000000          1.18           7.51         1.25             1.26             0.02            1.35                    1.83
 32      1000000          1.18           7.16         1.26             1.27             0.02            1.34                    1.86
 64      1000000          1.19           8.34         1.27             1.27             0.02            1.36                    1.84
 128     1000000          1.20           6.86         1.28             1.29             0.02            1.39                    1.84
 256     1000000          1.67           8.63         1.76             1.78             0.02            1.90                    2.31
 512     1000000          1.70           7.54         1.78             1.80             0.02            1.93                    2.37
 1024    1000000          1.76           8.26         1.84             1.87             0.02            1.97                    2.44
 2048    1000000          1.89           8.03         1.97             1.98             0.02            2.10                    2.55
 4096    1000000          2.21           8.33         2.30             2.32             0.02            2.44                    2.85
 8192    1000000          2.44           9.12         2.59             2.60             0.02            2.74                    3.15
 16384   1000000          2.96           9.80         3.08             3.09             0.02            3.21                    3.53
 32768   1000000          3.81           10.81        3.90             3.91             0.02            4.06                    4.45
 65536   1000000          5.38           11.84        5.50             5.51             0.02            5.64                    6.00
 131072  1000000          8.05           14.63        8.15             8.17             0.03            8.32                    8.79
 262144  1000000          13.44          20.21        13.56            13.58            0.03            13.70                   14.12
 524288  1000000          24.12          30.95        24.23            24.24            0.04            24.39                   24.84
 1048576 1000000          45.45          52.37        45.64            45.65            0.05            45.81                   46.30
 2097152 1000000          88.12          94.85        88.26            88.28            0.06            88.46                   88.99
 4194304 1000000          173.41         180.36       173.53           173.54           0.08            173.74                  174.36
 8388608 1000000          343.92         350.71       344.06           344.08           0.10            344.32                  345.18
---------------------------------------------------------------------------------------

# CLIENT
ib_send_lat --all --CPU-freq --comm_rdma_cm --perform_warm_up --connection=DC --iters=1000000 ${SERVER}

---------------------------------------------------------------------------------------
                    Send Latency Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 2            Transport type : IB
 Connection type : DC           Using SRQ      : ON
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 150[B]
 rdma_cm QPs     : OFF
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0x415 QPN 0x0048 PSN 0x692bd2
 local address: LID 0x415 QPN 0x12ca PSN 0xfc318b
 remote address: LID 0x418 QPN 0x0048 PSN 0xc8d251
 remote address: LID 0x418 QPN 0x12ca PSN 0x602f9e
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       1000000          1.18           71.67        1.25             1.26             0.17            1.34                    1.92
 4       1000000          1.18           8.02         1.25             1.26             0.02            1.34                    1.82
 8       1000000          1.18           7.39         1.25             1.26             0.02            1.34                    1.81
 16      1000000          1.18           7.46         1.25             1.26             0.02            1.34                    1.83
 32      1000000          1.18           7.14         1.25             1.27             0.02            1.34                    1.86
 64      1000000          1.19           8.40         1.27             1.27             0.02            1.36                    1.84
 128     1000000          1.21           6.90         1.28             1.29             0.02            1.38                    1.84
 256     1000000          1.68           8.67         1.77             1.78             0.02            1.90                    2.33
 512     1000000          1.70           7.63         1.78             1.80             0.02            1.93                    2.38
 1024    1000000          1.76           8.23         1.84             1.87             0.02            1.97                    2.44
 2048    1000000          1.89           7.99         1.97             1.98             0.02            2.10                    2.56
 4096    1000000          2.21           8.37         2.30             2.32             0.02            2.44                    2.87
 8192    1000000          2.46           9.11         2.59             2.60             0.02            2.71                    3.17
 16384   1000000          2.95           9.82         3.08             3.09             0.02            3.20                    3.62
 32768   1000000          3.82           10.84        3.90             3.91             0.02            4.03                    4.46
 65536   1000000          5.38           11.82        5.50             5.51             0.02            5.64                    6.00
 131072  1000000          8.04           14.63        8.15             8.17             0.03            8.30                    8.79
 262144  1000000          13.45          20.12        13.56            13.58            0.03            13.70                   14.11
 524288  1000000          24.12          31.00        24.23            24.24            0.04            24.39                   24.84
 1048576 1000000          45.47          52.29        45.64            45.65            0.05            45.82                   46.30
 2097152 1000000          88.12          94.74        88.27            88.28            0.06            88.47                   88.99
 4194304 1000000          173.41         180.40       173.53           173.54           0.08            173.75                  174.35
 8388608 1000000          343.91         350.65       344.06           344.08           0.10            344.33                  345.19
---------------------------------------------------------------------------------------

@brminich
Copy link
Contributor

ok thanks for the confirmation. Can you please clarify the following items:

  1. Does the error always happen or it is seldom occasion?
  2. If it's reproducible, can you pls try to run relevant simulation with UCX_TLS=^dc env var? This would help to check whether it is just DC related problem or not.
  3. Can you please reproduce the original issue with UCX_LOG_LEVEL=debug and provide me the output?

@gabrieleiannetti
Copy link

gabrieleiannetti commented Jan 17, 2024

Hi @brminich,

thanks for looking at it.

What do you think about the posted runtimes...

Is it a problem if we got a significant diff between t_max[usec] and t_min[usec] or is the t_stdev relevant here only? So if the t_stdev is near 0 we have no problem regarding the latency?

Example:

#bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
2       1000000          1.18           71.69        1.25             1.26             0.17            1.35                    1.91
4       1000000          1.18           8.04         1.25             1.26             0.02            1.35                    1.82

Best
Gabriele

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants