New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ib_mlx5: tansport retry count exceeded error #9590
Comments
can you try to check basic ib perf test? Like |
Performing the
Instead to DC, RC is used ... (DC is not supported for any reason ) |
Dc is used in the original error. What is the application and its command line parameters? |
which application?, the one which triggers the errors? |
This problem does not depend on the application but is linked to the openMPI+UCX layer. We observe this problem on different independent codes already. This can be related to this old issue: |
yes, i meant the app which thriggers the error. |
The simulation which triggers this errors was running on 8 separate nodes, each containing 8 AMD GPU MI 100 devices. |
BTW a particularity of our IB fabric is that we use everywhere HDR splitter cable and all HDR switch are turn on full splitting mode. |
you can try to run the app without dc by setting |
No idea for the moment! |
I did not do the tests |
Hi @brminich, indeed we have faced a problem with DC on a pair of nodes in the HDR fabric: # SERVER
$ ib_send_lat --all --CPU-freq --perform_warm_up --connection=DC --rdma_cm
---------------------------------------------------------------------------------------
DC does not support RDMA_CM We haven't solved that yet. So we have executed that test on other nodes on a different leaf switch with few idle nodes connected to it via single cable instead of splitter-cables. # SERVER
$ ib_send_lat --all --CPU-freq --comm_rdma_cm --perform_warm_up --connection=DC --iters=1000000
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
Send Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 2 Transport type : IB
Connection type : DC Using SRQ : ON
PCIe relax order: ON
ibv_wr* API : ON
RX depth : 512
Mtu : 4096[B]
Link type : IB
Max inline data : 150[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0x418 QPN 0x0048 PSN 0xc8d251
local address: LID 0x418 QPN 0x12ca PSN 0x602f9e
remote address: LID 0x415 QPN 0x0048 PSN 0x692bd2
remote address: LID 0x415 QPN 0x12ca PSN 0xfc318b
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000000 1.18 71.69 1.25 1.26 0.17 1.35 1.91
4 1000000 1.18 8.04 1.25 1.26 0.02 1.35 1.82
8 1000000 1.18 7.36 1.25 1.26 0.02 1.35 1.80
16 1000000 1.18 7.51 1.25 1.26 0.02 1.35 1.83
32 1000000 1.18 7.16 1.26 1.27 0.02 1.34 1.86
64 1000000 1.19 8.34 1.27 1.27 0.02 1.36 1.84
128 1000000 1.20 6.86 1.28 1.29 0.02 1.39 1.84
256 1000000 1.67 8.63 1.76 1.78 0.02 1.90 2.31
512 1000000 1.70 7.54 1.78 1.80 0.02 1.93 2.37
1024 1000000 1.76 8.26 1.84 1.87 0.02 1.97 2.44
2048 1000000 1.89 8.03 1.97 1.98 0.02 2.10 2.55
4096 1000000 2.21 8.33 2.30 2.32 0.02 2.44 2.85
8192 1000000 2.44 9.12 2.59 2.60 0.02 2.74 3.15
16384 1000000 2.96 9.80 3.08 3.09 0.02 3.21 3.53
32768 1000000 3.81 10.81 3.90 3.91 0.02 4.06 4.45
65536 1000000 5.38 11.84 5.50 5.51 0.02 5.64 6.00
131072 1000000 8.05 14.63 8.15 8.17 0.03 8.32 8.79
262144 1000000 13.44 20.21 13.56 13.58 0.03 13.70 14.12
524288 1000000 24.12 30.95 24.23 24.24 0.04 24.39 24.84
1048576 1000000 45.45 52.37 45.64 45.65 0.05 45.81 46.30
2097152 1000000 88.12 94.85 88.26 88.28 0.06 88.46 88.99
4194304 1000000 173.41 180.36 173.53 173.54 0.08 173.74 174.36
8388608 1000000 343.92 350.71 344.06 344.08 0.10 344.32 345.18
---------------------------------------------------------------------------------------
# CLIENT
ib_send_lat --all --CPU-freq --comm_rdma_cm --perform_warm_up --connection=DC --iters=1000000 ${SERVER}
---------------------------------------------------------------------------------------
Send Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 2 Transport type : IB
Connection type : DC Using SRQ : ON
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 150[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0x415 QPN 0x0048 PSN 0x692bd2
local address: LID 0x415 QPN 0x12ca PSN 0xfc318b
remote address: LID 0x418 QPN 0x0048 PSN 0xc8d251
remote address: LID 0x418 QPN 0x12ca PSN 0x602f9e
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000000 1.18 71.67 1.25 1.26 0.17 1.34 1.92
4 1000000 1.18 8.02 1.25 1.26 0.02 1.34 1.82
8 1000000 1.18 7.39 1.25 1.26 0.02 1.34 1.81
16 1000000 1.18 7.46 1.25 1.26 0.02 1.34 1.83
32 1000000 1.18 7.14 1.25 1.27 0.02 1.34 1.86
64 1000000 1.19 8.40 1.27 1.27 0.02 1.36 1.84
128 1000000 1.21 6.90 1.28 1.29 0.02 1.38 1.84
256 1000000 1.68 8.67 1.77 1.78 0.02 1.90 2.33
512 1000000 1.70 7.63 1.78 1.80 0.02 1.93 2.38
1024 1000000 1.76 8.23 1.84 1.87 0.02 1.97 2.44
2048 1000000 1.89 7.99 1.97 1.98 0.02 2.10 2.56
4096 1000000 2.21 8.37 2.30 2.32 0.02 2.44 2.87
8192 1000000 2.46 9.11 2.59 2.60 0.02 2.71 3.17
16384 1000000 2.95 9.82 3.08 3.09 0.02 3.20 3.62
32768 1000000 3.82 10.84 3.90 3.91 0.02 4.03 4.46
65536 1000000 5.38 11.82 5.50 5.51 0.02 5.64 6.00
131072 1000000 8.04 14.63 8.15 8.17 0.03 8.30 8.79
262144 1000000 13.45 20.12 13.56 13.58 0.03 13.70 14.11
524288 1000000 24.12 31.00 24.23 24.24 0.04 24.39 24.84
1048576 1000000 45.47 52.29 45.64 45.65 0.05 45.82 46.30
2097152 1000000 88.12 94.74 88.27 88.28 0.06 88.47 88.99
4194304 1000000 173.41 180.40 173.53 173.54 0.08 173.75 174.35
8388608 1000000 343.91 350.65 344.06 344.08 0.10 344.33 345.19
--------------------------------------------------------------------------------------- |
ok thanks for the confirmation. Can you please clarify the following items:
|
Hi @brminich, thanks for looking at it. What do you think about the posted runtimes... Is it a problem if we got a significant diff between Example: #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000000 1.18 71.69 1.25 1.26 0.17 1.35 1.91
4 1000000 1.18 8.04 1.25 1.26 0.02 1.35 1.82 Best |
Environment
Error
I recently observed this kind of intermittent error :
At the moment i do not know if this kind of error is linked to UCX or the underlying IB fabric.
Any idea(s)?
The text was updated successfully, but these errors were encountered: