Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legion + UCX network: slower compared to GASNet #1650

Closed
MoraruMaxim opened this issue Mar 14, 2024 · 20 comments
Closed

Legion + UCX network: slower compared to GASNet #1650

MoraruMaxim opened this issue Mar 14, 2024 · 20 comments

Comments

@MoraruMaxim
Copy link

MoraruMaxim commented Mar 14, 2024

Running Legion with UCX results in a significant slowdown (i.e. UCX is about 2 times slower than GASNet+ibv for our test case).

We run our test on multiple nodes (CPU only), each having 36 cores (2 sockets) and equipped with ConnectX-4 network cards. The 2x slowdown was also observed on single-node runs. Also, we tried different UCX configurations (e.g. with xpmem configured manually).

Below, an example of UCX configuration that we have tested:

#define UCX_CONFIGURE_FLAGS       "--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations  --with-verbs --with-mlx5-dv --enable-mt"

#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: knem

Are we missing some UCX configuration details?

@rohany
Copy link
Contributor

rohany commented Mar 14, 2024

cc @SeyedMir

@SeyedMir
Copy link
Contributor

Do you use -ll:bgworkpin 1? If not, can you please rerun with that option?

@MoraruMaxim
Copy link
Author

I have just re-run with -ll:bgworkpin 1 and obtained similar results.

@SeyedMir
Copy link
Contributor

What UCX version are you using? The output of ucx_info -v

@MoraruMaxim
Copy link
Author

UCX 1.15.0

@SeyedMir
Copy link
Contributor

Let's get the output with -level ucp=2 and also UCX logs by setting UCX_LOG_LEVEL=debug UCX_LOG_FILE=<some_path>/ucx_log.%h.%p UCX_PROTO_INFO=y.

@MoraruMaxim
Copy link
Author

MoraruMaxim commented Mar 14, 2024

Here are the logs for a small run, single node:

ucp:

[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: bootstrapped UCP network module
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: UCX_ZCOPY_THRESH modified to 2048 for context 0x1ff2f30
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: UCX_IB_SEG_SIZE modified to 8192 for context 0x1ff2f30
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: initialized ucp context 0x1ff2f30 max_am_header 3945
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: initialized 1 ucp contexts
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: total num_eps 1
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: attached segments
[1 - 14f28eef5dc0]    0.002592 {4}{threads}: reservation ('utility proc 1d00010000000000') cannot be satisfied
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: bootstrapped UCP network module
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: UCX_ZCOPY_THRESH modified to 2048 for context 0x1b6d480
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: UCX_IB_SEG_SIZE modified to 8192 for context 0x1b6d480
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: initialized ucp context 0x1b6d480 max_am_header 3945
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: initialized 1 ucp contexts
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: total num_eps 1
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: attached segments
[0 - 14b82d78fdc0]    0.002654 {4}{threads}: reservation ('utility proc 1d00000000000000') cannot be satisfied
...
[1 - 14f28eef5dc0]  419.778561 {2}{ucp}: detaching segments
[1 - 14f28eef5dc0]  419.778599 {2}{ucp}: ended ucp pollers
[1 - 14f28eef5dc0]  419.870804 {2}{ucp}: unmapped ucp-mapped memory
[1 - 14f28eef5dc0]  420.408270 {2}{ucp}: finalized ucp contexts
[1 - 14f28eef5dc0]  420.411369 {2}{ucp}: finalized ucp bootstrap
[0 - 14b82d78fdc0]  419.753321 {2}{ucp}: detaching segments
[0 - 14b82d78fdc0]  419.753354 {2}{ucp}: ended ucp pollers
[0 - 14b82d78fdc0]  419.840665 {2}{ucp}: unmapped ucp-mapped memory
[0 - 14b82d78fdc0]  420.294596 {2}{ucp}: finalized ucp contexts
[0 - 14b82d78fdc0]  420.297354 {2}{ucp}: finalized ucp bootstrap

UCX:
ucx_log_1.log

@SeyedMir
Copy link
Contributor

What version of Legion are you using? It seems like you're using a relatively old one.

@MoraruMaxim
Copy link
Author

It is an older version, which corresponds to the following commit : 45afa8e658ae06cb19d8f0374de699b7fe4a197c

Do you believe a newer Legion version would improve the performance when running with UCX?

@SeyedMir
Copy link
Contributor

Yes, let's test with the latest Legion (or at least something after 13d4101) and then take it from there.

@MoraruMaxim
Copy link
Author

MoraruMaxim commented Mar 26, 2024

With the latest Legion I obtained a better performance. However, UCX is still slower on our test case (around 12% slower).

@SeyedMir Is there something else that I could test (e.g. a specific UCX configuration)?

@MoraruMaxim
Copy link
Author

@SeyedMir would you have other suggestions to improve the Legion+UCX performance?

@SeyedMir
Copy link
Contributor

Hard to say without profiling. Is this test/code something you can share with me so I can take a look?
Also, can you get UCX logs again and this time also set UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y?

@MoraruMaxim
Copy link
Author

Let me re-run and obtain the logs.

Our test case is available on Github: https://github.com/flecsi/flecsi/tree/2/tutorial/standalone/poisson
However, it is not directly implemented in Legion. Instead it is implemented using FleCSI.

@MoraruMaxim
Copy link
Author

Here are the logs for a run on two nodes
ucx_log_0.log
ucx_log_1.log

@MoraruMaxim
Copy link
Author

MoraruMaxim commented Apr 24, 2024

By hand-tuning our runs (and using the new Legion release) I was able to obtain better results with UCX on a single node (around 15% better then Gasnet). However, when I try to run on multiple nodes I obtain the following error :

[cn355:11558:0:11678] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[cn355:11558:0:11678] ib_mlx5_log.c:171  RC QP 0x1cc0 wqe[1069]: SEND --e [inl len 84] [rqpn 0x1040 dlid=88 sl=0 port=1 src_path_bits=0]
[cn355:11561:0:11682] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[cn355:11561:0:11682] ib_mlx5_log.c:171  RC QP 0x1cd8 wqe[12485]: SEND --e [inl len 84] [rqpn 0x15f0 dlid=97 sl=0 port=1 src_path_bits=0]

Looks like there are too many requests and Infiniband is not able to handle them. Can I change the UCX configuration to avoid this error ?

@SeyedMir
Copy link
Contributor

That signals an issue in the network. For some reason, packets are being dropped and the underlying network transport (i.e., RC in this case) reaches the maximum retry count and gives up. This is not a UCX or application issue. You can set UCX_RC_RETRY_COUNT to a higher value (the default is 7 I believe) and see if that helps. Though, a healthy network should not really need that.

@SeyedMir
Copy link
Contributor

I'm curious what tuning helped you get better result.

@MoraruMaxim
Copy link
Author

MoraruMaxim commented Apr 24, 2024

I will contact our cluster administrator to see if he can help. I think 7 is the maximum that we can set for UCX_RC_RETRY_COUNT. I am getting the following warning:

[1713985050.849256] [cn337:8455 :0]        rc_iface.c:526  UCX  WARN  using maximal value for RETRY_COUNT (7) instead of 20

@MoraruMaxim
Copy link
Author

MoraruMaxim commented Apr 24, 2024

I'm curious what tuning helped you get better result.

Previously we were running with multiple colors per MPI process (launch multiple tasks that potentially would require more communication). Now we run with multiple threads per MPI process (usually we have one MPI process per socket). Each process launch OpenMP kernels.

We also increased the problem size for our tests and used the new Legion release (24.03.00).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants