Legion + UCX network: slower compared to GASNet #1650

MoraruMaxim · 2024-03-14T15:25:04Z

Running Legion with UCX results in a significant slowdown (i.e. UCX is about 2 times slower than GASNet+ibv for our test case).

We run our test on multiple nodes (CPU only), each having 36 cores (2 sockets) and equipped with ConnectX-4 network cards. The 2x slowdown was also observed on single-node runs. Also, we tried different UCX configurations (e.g. with xpmem configured manually).

Below, an example of UCX configuration that we have tested:

#define UCX_CONFIGURE_FLAGS       "--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations  --with-verbs --with-mlx5-dv --enable-mt"

#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: knem

Are we missing some UCX configuration details?

The text was updated successfully, but these errors were encountered:

rohany · 2024-03-14T16:18:05Z

cc @SeyedMir

SeyedMir · 2024-03-14T18:08:46Z

Do you use -ll:bgworkpin 1? If not, can you please rerun with that option?

MoraruMaxim · 2024-03-14T19:35:33Z

I have just re-run with -ll:bgworkpin 1 and obtained similar results.

SeyedMir · 2024-03-14T19:42:08Z

What UCX version are you using? The output of ucx_info -v

MoraruMaxim · 2024-03-14T19:45:00Z

UCX 1.15.0

SeyedMir · 2024-03-14T19:52:48Z

Let's get the output with -level ucp=2 and also UCX logs by setting UCX_LOG_LEVEL=debug UCX_LOG_FILE=<some_path>/ucx_log.%h.%p UCX_PROTO_INFO=y.

MoraruMaxim · 2024-03-14T21:55:38Z

Here are the logs for a small run, single node:

ucp:

[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: bootstrapped UCP network module
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: UCX_ZCOPY_THRESH modified to 2048 for context 0x1ff2f30
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: UCX_IB_SEG_SIZE modified to 8192 for context 0x1ff2f30
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: initialized ucp context 0x1ff2f30 max_am_header 3945
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: initialized 1 ucp contexts
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: total num_eps 1
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: attached segments
[1 - 14f28eef5dc0]    0.002592 {4}{threads}: reservation ('utility proc 1d00010000000000') cannot be satisfied
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: bootstrapped UCP network module
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: UCX_ZCOPY_THRESH modified to 2048 for context 0x1b6d480
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: UCX_IB_SEG_SIZE modified to 8192 for context 0x1b6d480
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: initialized ucp context 0x1b6d480 max_am_header 3945
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: initialized 1 ucp contexts
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: total num_eps 1
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: attached segments
[0 - 14b82d78fdc0]    0.002654 {4}{threads}: reservation ('utility proc 1d00000000000000') cannot be satisfied
...
[1 - 14f28eef5dc0]  419.778561 {2}{ucp}: detaching segments
[1 - 14f28eef5dc0]  419.778599 {2}{ucp}: ended ucp pollers
[1 - 14f28eef5dc0]  419.870804 {2}{ucp}: unmapped ucp-mapped memory
[1 - 14f28eef5dc0]  420.408270 {2}{ucp}: finalized ucp contexts
[1 - 14f28eef5dc0]  420.411369 {2}{ucp}: finalized ucp bootstrap
[0 - 14b82d78fdc0]  419.753321 {2}{ucp}: detaching segments
[0 - 14b82d78fdc0]  419.753354 {2}{ucp}: ended ucp pollers
[0 - 14b82d78fdc0]  419.840665 {2}{ucp}: unmapped ucp-mapped memory
[0 - 14b82d78fdc0]  420.294596 {2}{ucp}: finalized ucp contexts
[0 - 14b82d78fdc0]  420.297354 {2}{ucp}: finalized ucp bootstrap

UCX:
ucx_log_1.log

SeyedMir · 2024-03-15T14:12:07Z

What version of Legion are you using? It seems like you're using a relatively old one.

MoraruMaxim · 2024-03-15T15:13:20Z

It is an older version, which corresponds to the following commit : 45afa8e658ae06cb19d8f0374de699b7fe4a197c

Do you believe a newer Legion version would improve the performance when running with UCX?

SeyedMir · 2024-03-15T15:18:29Z

Yes, let's test with the latest Legion (or at least something after 13d4101) and then take it from there.

MoraruMaxim · 2024-03-26T16:15:07Z

With the latest Legion I obtained a better performance. However, UCX is still slower on our test case (around 12% slower).

@SeyedMir Is there something else that I could test (e.g. a specific UCX configuration)?

MoraruMaxim · 2024-04-12T15:18:19Z

@SeyedMir would you have other suggestions to improve the Legion+UCX performance?

SeyedMir · 2024-04-12T15:54:23Z

Hard to say without profiling. Is this test/code something you can share with me so I can take a look?
Also, can you get UCX logs again and this time also set UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y?

MoraruMaxim · 2024-04-12T17:27:19Z

Let me re-run and obtain the logs.

Our test case is available on Github: https://github.com/flecsi/flecsi/tree/2/tutorial/standalone/poisson
However, it is not directly implemented in Legion. Instead it is implemented using FleCSI.

MoraruMaxim · 2024-04-16T20:32:45Z

Here are the logs for a run on two nodes
ucx_log_0.log
ucx_log_1.log

MoraruMaxim · 2024-04-24T18:28:41Z

By hand-tuning our runs (and using the new Legion release) I was able to obtain better results with UCX on a single node (around 15% better then Gasnet). However, when I try to run on multiple nodes I obtain the following error :

[cn355:11558:0:11678] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[cn355:11558:0:11678] ib_mlx5_log.c:171  RC QP 0x1cc0 wqe[1069]: SEND --e [inl len 84] [rqpn 0x1040 dlid=88 sl=0 port=1 src_path_bits=0]
[cn355:11561:0:11682] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[cn355:11561:0:11682] ib_mlx5_log.c:171  RC QP 0x1cd8 wqe[12485]: SEND --e [inl len 84] [rqpn 0x15f0 dlid=97 sl=0 port=1 src_path_bits=0]

Looks like there are too many requests and Infiniband is not able to handle them. Can I change the UCX configuration to avoid this error ?

SeyedMir · 2024-04-24T18:45:31Z

That signals an issue in the network. For some reason, packets are being dropped and the underlying network transport (i.e., RC in this case) reaches the maximum retry count and gives up. This is not a UCX or application issue. You can set UCX_RC_RETRY_COUNT to a higher value (the default is 7 I believe) and see if that helps. Though, a healthy network should not really need that.

SeyedMir · 2024-04-24T18:46:34Z

I'm curious what tuning helped you get better result.

MoraruMaxim · 2024-04-24T19:04:05Z

I will contact our cluster administrator to see if he can help. I think 7 is the maximum that we can set for UCX_RC_RETRY_COUNT. I am getting the following warning:

[1713985050.849256] [cn337:8455 :0]        rc_iface.c:526  UCX  WARN  using maximal value for RETRY_COUNT (7) instead of 20

MoraruMaxim · 2024-04-24T19:13:37Z

I'm curious what tuning helped you get better result.

Previously we were running with multiple colors per MPI process (launch multiple tasks that potentially would require more communication). Now we run with multiple threads per MPI process (usually we have one MPI process per socket). Each process launch OpenMP kernels.

We also increased the problem size for our tests and used the new Legion release (24.03.00).

MoraruMaxim closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legion + UCX network: slower compared to GASNet #1650

Legion + UCX network: slower compared to GASNet #1650

MoraruMaxim commented Mar 14, 2024 •

edited

Loading

rohany commented Mar 14, 2024

SeyedMir commented Mar 14, 2024

MoraruMaxim commented Mar 14, 2024

SeyedMir commented Mar 14, 2024

MoraruMaxim commented Mar 14, 2024

SeyedMir commented Mar 14, 2024

MoraruMaxim commented Mar 14, 2024 •

edited

Loading

SeyedMir commented Mar 15, 2024

MoraruMaxim commented Mar 15, 2024

SeyedMir commented Mar 15, 2024

MoraruMaxim commented Mar 26, 2024 •

edited

Loading

MoraruMaxim commented Apr 12, 2024

SeyedMir commented Apr 12, 2024

MoraruMaxim commented Apr 12, 2024

MoraruMaxim commented Apr 16, 2024

MoraruMaxim commented Apr 24, 2024 •

edited

Loading

SeyedMir commented Apr 24, 2024

SeyedMir commented Apr 24, 2024

MoraruMaxim commented Apr 24, 2024 •

edited

Loading

MoraruMaxim commented Apr 24, 2024 •

edited

Loading

Legion + UCX network: slower compared to GASNet #1650

Legion + UCX network: slower compared to GASNet #1650

Comments

MoraruMaxim commented Mar 14, 2024 • edited Loading

rohany commented Mar 14, 2024

SeyedMir commented Mar 14, 2024

MoraruMaxim commented Mar 14, 2024

SeyedMir commented Mar 14, 2024

MoraruMaxim commented Mar 14, 2024

SeyedMir commented Mar 14, 2024

MoraruMaxim commented Mar 14, 2024 • edited Loading

SeyedMir commented Mar 15, 2024

MoraruMaxim commented Mar 15, 2024

SeyedMir commented Mar 15, 2024

MoraruMaxim commented Mar 26, 2024 • edited Loading

MoraruMaxim commented Apr 12, 2024

SeyedMir commented Apr 12, 2024

MoraruMaxim commented Apr 12, 2024

MoraruMaxim commented Apr 16, 2024

MoraruMaxim commented Apr 24, 2024 • edited Loading

SeyedMir commented Apr 24, 2024

SeyedMir commented Apr 24, 2024

MoraruMaxim commented Apr 24, 2024 • edited Loading

MoraruMaxim commented Apr 24, 2024 • edited Loading

MoraruMaxim commented Mar 14, 2024 •

edited

Loading

MoraruMaxim commented Mar 14, 2024 •

edited

Loading

MoraruMaxim commented Mar 26, 2024 •

edited

Loading

MoraruMaxim commented Apr 24, 2024 •

edited

Loading

MoraruMaxim commented Apr 24, 2024 •

edited

Loading

MoraruMaxim commented Apr 24, 2024 •

edited

Loading