-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Legion + UCX network: slower compared to GASNet #1650
Comments
cc @SeyedMir |
Do you use |
I have just re-run with |
What UCX version are you using? The output of |
UCX |
Let's get the output with |
Here are the logs for a small run, single node: ucp:
UCX: |
What version of Legion are you using? It seems like you're using a relatively old one. |
It is an older version, which corresponds to the following commit : Do you believe a newer Legion version would improve the performance when running with UCX? |
Yes, let's test with the latest Legion (or at least something after |
With the latest Legion I obtained a better performance. However, UCX is still slower on our test case (around 12% slower). @SeyedMir Is there something else that I could test (e.g. a specific UCX configuration)? |
@SeyedMir would you have other suggestions to improve the Legion+UCX performance? |
Hard to say without profiling. Is this test/code something you can share with me so I can take a look? |
Let me re-run and obtain the logs. Our test case is available on Github: https://github.com/flecsi/flecsi/tree/2/tutorial/standalone/poisson |
Here are the logs for a run on two nodes |
By hand-tuning our runs (and using the new Legion release) I was able to obtain better results with UCX on a single node (around 15% better then Gasnet). However, when I try to run on multiple nodes I obtain the following error :
Looks like there are too many requests and Infiniband is not able to handle them. Can I change the UCX configuration to avoid this error ? |
That signals an issue in the network. For some reason, packets are being dropped and the underlying network transport (i.e., RC in this case) reaches the maximum retry count and gives up. This is not a UCX or application issue. You can set |
I'm curious what tuning helped you get better result. |
I will contact our cluster administrator to see if he can help. I think 7 is the maximum that we can set for
|
Previously we were running with multiple colors per MPI process (launch multiple tasks that potentially would require more communication). Now we run with multiple threads per MPI process (usually we have one MPI process per socket). Each process launch OpenMP kernels. We also increased the problem size for our tests and used the new Legion release (24.03.00). |
Running Legion with UCX results in a significant slowdown (i.e. UCX is about 2 times slower than GASNet+ibv for our test case).
We run our test on multiple nodes (CPU only), each having 36 cores (2 sockets) and equipped with ConnectX-4 network cards. The 2x slowdown was also observed on single-node runs. Also, we tried different UCX configurations (e.g. with
xpmem
configured manually).Below, an example of UCX configuration that we have tested:
Are we missing some UCX configuration details?
The text was updated successfully, but these errors were encountered: