-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3D freeze at large node counts #1657
Comments
Do these runs have the same bug that @syamajala found with incomplete partitions being marked complete? If so, that can lead to all sorts of undefined behavior in the runtime including hangs so I would make sure we don't have that bug in these runs at all before going about doing anymore debugging. |
The branch @elliottslaughter is running does not have those partitions. I have fixed the issue with partitions incorrectly being marked as complete, it has not solved my problem though. Will open a separate issue for that soon. |
For this issue I am running the same code as in #1653. In the initial investigation into that issue I ran a number of checks, including My freeze appears to be sensitive to |
I got a run to freeze at 4096 nodes with
One pattern I see in the backtraces is threads like this:
Otherwise I don't see much going on. |
This looks to me like the network locked up. There's multiple threads on pretty much every single node trying to push active messages into the network and they are all spinning on some lock in OFI inside a call to |
Since we're discussing a potential network hang, I will just include the network-related environment variables here for posterity:
(I think that And for @PHHargrove @bonachea's benefit, this is GASNet 2023.9.0 without memory kinds enabled. |
FWIW, I spot checked a few more of the threads in the same processes across the backtraces at different times and the threads are not making any progress. The |
Please try with |
Is this going to exacerbate the issue with the number of receive buffers and "unexpected" messages like we saw with the |
@elliottslaughter wrote:
With use of CXI's buggy multi-recv feature disabled via If the application is currently hitting the known bugs in multi-recv that arise under heavy load, then turning off that feature is probably the only surgical alternative from a correctness perspective. The only other (more intrusive) possibility that comes to mind is changing the communication pattern in the hopes of lowering the chances of breaking multi-recv.
The |
@lightsighter wrote:
First a clarification: Setting that aside, you are correct that Regardless of whether or not Moreover, it's worth noting that ofi-conduit does not currently have a native implementation of NPAM at all ( Coincidentlaly we've just been awarded funding that we hope can be used to improve both of these sub-optimal behaviors in ofi-conduit, but that work is still early stages and won't be ready for users for some time. |
Ok, I see where Realm is running a separate polling thread that always pulls active messages off the wire and that should be sufficient to ensure forward progress even if all the background worker threads get stuck in a commit call. I thought we had gotten rid of the polling thread in the GASNetEX module, but apparently not.
So to make sure I'm clear: this bug is sufficient to prevent forward progress? I would kind of hope that even if we were running short of buffers for sending and blocking in the commit calls, that eventually the active message polling thread can eventually pull a bunch of messages off the wire, drain the network, and thereby free up resources for doing the sends again and that would ensure forward progress. That is not possible here and something else is going wrong in OFI?
How is that an upshot? 😅
That is promising to hear! |
Thread partially continued in email... |
Several people have been asking me to run experiments, so I am going to report the results of those experiments here. Experiment 1:
|
I should also point out that I am hitting the following warning in my runs:
You can see my variables here: #1657 (comment) . I am indeed setting (S3D is a hybrid code and I believe that we do initialize MPI first.) |
I think it's worth noting that Frontier was upgraded to Slingshot 2.1.2 earlier this week (Mar 18 2024 if directory timestamps are to be believed). This upgrade included replacing the installed libfabric CXI provider, portions of Cray MPI and other components of the system stack. If you've observed changes in network behavior relative to runs preceding Mar 18 2024, then the new Slingshot stack should be high on the list of suspects (not that we can do much about it). Additionally, I'd like to ensure we are not "chasing phantoms" here. Because we are encountering problems, Please ensure you've rebuilt all objects and executables from scratch since Mar 18, especially GASNet and anything making MPI calls.
First, I should note for the record that all
This is NOT an endorsement by GASNet of guaranteed stability for this combination of CXI provider settings or suitability for all use cases, IIUC it's just a "better than nothing" default that we've found to be reasonable (especially for "pure GASNet" use cases where Cray MPI is not in use). ofi-conduit issues this warning if it finds one of these settings is already set, partially because they globally affect CXI provider behavior and MPI is known to set some of these even if the user did not (so the warning alerts you to the current settings in use). So in your case the appearance of the warning is expected (because you are explicitly setting Here is some documentation from the
The docs above go out of their way to emphasize that the value
and consult the generated warning message to confirm all three values "survived" to configure CXI. Once that's confirmed, then perhaps a larger run to see if it impacts the deadlocks.
The reason we currently default Based on reading the code and GASNet trace outputs, the per-process memory impact of
entail a total of about 3.5 MiB of GASNet-level receive buffers at each process. This excludes other sources of ofi-conduit buffer memory consumption (notably send buffers, which are controlled independently). |
Thanks. I'll run these tests and get back to you. I do need to rebuild my software. I'll just note that we have given up on the 8 PPN configuration of S3D and run exclusively in a 1 PPN configuration, which we intend to do for the foreseeable future (as we have other, unrelated issues at 8 PPN). Therefore, I think we could push these values even higher (if I understand correctly). |
Amendment to previous post:
In
I'm able to crank Frontier's |
Here is my status report as of tonight. I rebuilt the code from an entirely fresh checkout. This is the version that traces 1 timestep at a time (similar to Experiment 3 in #1657 (comment), but note this is a fresh build now), because that version had the shortest startup time. Experiment 4:
|
I was asked to do runs with libfabric logging enabled. The results are below. This is still the same application configuration as #1657 (comment) Experiment 10:
|
There are too many experiments to track at this point, so I have moved the tracking into a spreadsheet here: https://docs.google.com/spreadsheets/d/1OoDW9ie4uewUWaGHKQTOC8bIavGzQfxPNHE2EwJIY5Q/edit?usp=sharing |
To follow up on this, we did eventually reach a set of variables that allow S3D to run up to 8,192 nodes on Frontier. Having said that, it's not entirely obvious that these variables are actually necessary or sufficient. They may not be necessary because Seshu has been running recently without them, and got up to 8,192 nodes. They may not be sufficient because we have still seen issues: Seshu in #1683 and myself in #1696 I'm not sure what else to say. There is more to do to figure out what actually is necessary and sufficient for running on these networks. |
In the Realm meeting today I think we did identify a potential forward progress issue in the GASNetEX module. My investigation of the Realm code earlier asserting that Realm has an independent progress thread for pulling messages off of the wire is not actually correct:
Instead, my first assertion that we had gotten rid of this progress thread and instead relied on the background worker threads to service incoming active messages is actually right:
Realm's GASNet1 networking module does have such a progress thread, but the GASNetEX module has indeed removed this polling progress thread. Unfortunately this model of forward progress is only sound if calls to
What this means is that it is possible for all the background worker threads in Realm to get stuck trying to send active messages and not have any remaining threads available for polling incoming active messages to drain the network. The fact that we see this much more commonly on Slingshot than Infiniband aligns with the assertion that GASNet is much more precise in its responses about when blocking active messages are going to occur for Infiniband than Slingshot. I suspect that we'll need to alter the architecture of the Realm GASNetEX module to look more like the GASNet1 module if this is the case in order to guarantee forward progress. @manopapad for determining priorities for addressing this issue |
Checked with the UCX team and we should be fine with UCX. ucp_worker_progress will not block the calling thread. Of course, this is not applicable/related to Slingshot in any way as UCX does not have support for Slingshot. |
@bonachea Do you suspect that ofi communication calls ( |
@JiakunYan Together, @bonachea and I have looked previously at the libfabric documentation and we determined that it allows for the (unfortunate) behavior you describe. However, we do not know the behavior of any particular providers (such as |
@PHHargrove @bonachea Thanks! To be honest, I would be very surprised if any provider actually behaves in this way in practice. If this is the case, it seems to me the only way to avoid this issue is to always have dedicated progress threads. It would put most MPI programs at the risk of deadlock. |
@JiakunYan |
@PHHargrove Thanks! I see what you mean. I don't understand why they have this weird "retry" requirement. It would still be good to make sure it is what actually happens (stuck inside the libfabric, instead of the inifinite gasnetex send-retry-progress loop). FWIW, I did see libfabric progress was being called from @elliottslaughter freeze backtrace.
(http://sapling.stanford.edu/~eslaught/s3d_freeze_debug2/bt4096-3/bt_frontier00001_81054.log line 322) |
Here are the two earlier backtraces for the same thread on the same node:
It's going around a loop here somewhere. Not sure if that tells us anything or not. I wouldn't necessarily rule out part of the problem also being in the Slingshot driver/hardware not behaving as we would expect either. [edit:] Given that |
What the backtraces tell me is that Realm is not stuck in ofi send but in the gasnetex try-send-progress loop.
The |
Tracing down the code connected to your previous posting: The backtraces show a call to Quoting from
So, |
@lightsighter wrote (in part)
I agree that the "flow-control gets stuck" behavior we've seen and/or speculated about when speaking w/ HPE would likely lead to the observed lack of progress. |
@PHHargrove Thanks for the explanation! I don't want to trouble you with too many questions. I will just poke around one last possibility.
Just found it also polls the reply ep/cq when it is sending requests. Nevermind. |
@PHHargrove Do you think that if Realm had a separate progress thread here that did nothing but poll for incoming active messages and pull them off the wire would it make a difference or would it not matter? |
I think it would, at best, reduce the probability of the Slingshot network stack getting into "bad state". Same is true of using various environment variables to increase the buffering in the network stack. Would not care to speculate if that is "good enough" or not. |
If the concern is that all the threads get stuck in a pattern such as the one shown in the stack traces (and assuming that a Slingshot bug is not the root cause), then perhaps there is a better approach than dedicating a progress/polling thread. Imagine you maintain a counting semaphore with initial value one less than the number of threads. Before an operation which might block, you "try down" the semaphore. It that succeeds, then there must be at least one thread not attempting potentially blocking operations. So, you proceed with the (potentially) blocking operation and "up" the semaphore on completion. In the event you fail the "try down", you know that the caller is the "last" thread not to be in a potentially blocking operation. In this case that thread could just begin alternating poll/try-down until obtaining the semaphore indicates at least one other thread is now not in a potentially blocking operation. Alternatively, a failure to obtain the semaphore might be handled in the same manner as "immediate failure" (a return of |
Realm should come back around and call It's not clear to me if that is equivalent to the polling loop already happening in |
I don't believe there should be a difference, based on the dive I took into the code on Wed. |
That is what I would expect as well. That suggests that the fact that Realm does not have an independent progress thread is not actually a forward progress issue. It might cause some performance hiccups but we should not hang as a result since GASNet is going to do polling inside the active message commit to ensure forward progress as necessary. |
FWIW, I'm reasonably confident the failure to return from We were never able to isolate a reproducer we could report for the progress hang (which I'd still like us to do) but I'm aware of at least one internal ticket with a lot of similar characteristics with an MPI reproducer that was very recently fixed. I have no way to predict the timeline for that actually making it to the DoE systems though, and without a reproducer no way to promise that it actually is the same issue. I don't think we ever validated that processing the receives actually would have allowed progress to resume. It seems to me like figuring out how to test that would be a good first step. I'm a little worried that we're trying to architect a solution based on reasoning about a software bug that is not functioning as expected or in a reasonable way. I'd be more interested in Realm or GASNet detecting when progress has stopped or slowed to a crawl and giving as much information as they can about what they were doing. Even once the software bugs are fixed, the Legion communication patterns tend to be different enough from standard MPI applications that users might need to change some of the default CXI provider settings based on their workload to get reasonable performance. |
As suspected by Mike in our meeting today, we are seeing freezes in S3D at large node counts. This is a preliminary report because I just encountered the issue and am still getting data.
What I know so far:
rc
branch froze and 2048 and 8192 nodes, but ran at 4096 nodesI have backtraces from the 2048 node run I just did. Don't blame me for not having
-ll:force_kthreads
; I was not expecting the freeze and wasn't prepared to collect backtraces, so they are what they are. From an initial scan I haven't see anything interesting, so likely I will need to rerun with-ll:force_kthreads
.http://sapling.stanford.edu/~eslaught/s3d_freeze_debug2/bt2048/
The text was updated successfully, but these errors were encountered: