-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static Control Replication End-of-Life #1528
Comments
Here are the results of running Task Bench on Frontier up to 256 nodes: In the results, "Regent" means Regent with SCR and "Legion" means Legion with DCR. Legion "sr" refers to the Some overall observations:
I'm happy with these results (though there is still some room for improvement). That's my last blocker on my side. I am not aware of any reasons to avoid removing SCR. I would request that we schedule the SCR final removal for after the |
Would it be possible to get some profiles of the 4 rank/node cases with 64, 128, and 256 nodes? I'd just like to see what they look like and confirm that hypothesis as to the source of the slowdown. The barriers really only show up in the dependences for the meta-tasks and not for the actual trace we're capturing and replaying, so slowdowns like this would only be caused by not replaying the trace fast enough compared to the actual execution of the trace.
Good to confirm, but not surprising since trace capture and replay is the same in both branches.
Agreed. No need to do it before then. It does mean that I'll stop worrying about performance optimizations for things like must-epoch launches though, but they will remain functional so the CI continues to pass. |
Just looking at the difference between the 64 node and 256 node profiles, the primary difference seems to be in the copies on the channels. In both cases the tasks take around 1.4-1.5ms. The utility processors are underloaded doing the trace replays in both cases. However, in the 64 node case all the copies take <100us while in the 256 node case there are periodic copies that take 1-3ms to move 16B of data from one node to the other. The pattern repeats every so many iterations of the code so whatever is going wrong is deterministically bad but it only happens on some of the nodes (e.g. nodes 2 and 3 in the 256 node run). It's hard for me to explain why the copies are slower, but just on those two nodes (at least as far as I can see). @artempriakhin @muraj @streichler ideas of what to do next to investigate why specific copies are slow on those nodes? |
Is it possible that a non-scalable barrier is causing the copies to get backed up? After all, the barriers themselves won't show up, so if one node just gets inexplicably slow, that seems like a possible culprit. |
No, there won't be any non-scalable barriers in the generated Realm event graph. All the barriers used for sharded trace replays will be between adjacent shards here. The barriers that span all the nodes will be reflected in when the runtime is allowed to issue the trace replays, and clearly the runtime is well ahead in issuing the trace replays here. Also, if the barriers were the issue, then they would delay the copies before they even started, but that isn't what is happening here. The copies have had all their preconditions met and have actually started running according to Realm's own timing and are taking longer to run on some nodes than others. |
I'm not suggesting that the copies depend on the barriers. They're totally unrelated as you said, and the runtime is running ahead. But if they just happen to execute simultaneously, having one node spew O(Ranks) messages for barriers could easily create head of line blocking for completely unrelated DMAs, which could result in the behavior we're seeing. |
That is a indeed a fair point and certainly could be the cause of the problem. Do you think you can add some more of the small-numbered node profiles here (you can remove some of the larger-numbered nodes since they look fine)? I'd like to see if more of the small-numbered nodes exhibit similar issues. |
What do you mean by "small-numbered nodes"? In the 256-node profile, we should be capturing nodes 0-74 (remember, this is 4 ranks per node so this is ranks 0-299). I'm happy to render any subset of the nodes, but you're already looking at the lowest rank numbers. |
At least when I look at the 256-node profile it jumps from |
That may be a profiler bug. I passed I can try to massage the profile, maybe passing a subset of logs instead of the |
It also jumped from |
Hm, how do we assign node numbers in profiles? Do we just assume the first log file is node 0 and the second one is node 1 and so on? |
We assign file names (using the |
I was thinking more of the parser end of it, but I guess it comes in through a logger statement (and is therefore hard to mess up):
Anyway, here's a profile from manually passing the first 300 ranks into the profiler (and no |
That does look better. It does seem like only nodes 2 and 3 are the problem. |
@elliottslaughter How hard would that be to get some logs from nodes 2 and 3? I am still going through all the updates here. However, considering a possibility that it's none-barrier related we would need "-level xpath=1,xplan=1,dma=1,gpudma=1,barrier=1" to get started and see what's going on. |
I looked at the profiles more with @apryakhin today. It looks like the same issue occurs at 64 and 128 nodes too (also on nodes 2-3), but it is less pronounced (e.g. the copies take 800us on 64 nodes versus 3.5ms on 256 nodes). It should be possible to get profiles on fewer number of nodes to debug the problem, but I'll let @apryakhin say the smallest number of nodes he's comfortable with using to debug the problem. I also don't think you'll need |
64-node logs are here:
If you the corresponding Legion Prof, let me know. A 256-node job is still in the queue. |
Thanks Elliot, I think now the silly question is on my side - how do I access Frontier? Is there a guideline doc somewhere? |
Contact me privately and I'll discuss it with you. Short answer is that we might be able to do it, but we're talking something like a month of lead time. It's not quick. In the meantime, if there is other debugging that I can do, let me know. |
I didn't realize @apryakhin didn't have a Sapling account. Here is a public link to the logs: |
@elliottslaughter Would it be possible for me to give you either a small patch or custom branch to run an experiment on? Also did we run it with "-level barrier=1"? From the logs we have I am unable to determine the root cause. 16 byte 1d memcpys sysmem->sysmem do look okay from the dma perspective. There are lots of those 16 byte copies though, so we may need to look at the pipelining..e.g. to profile background work manager and probably active message handlers (this all need to be enabled with compile-time flags). |
Yes, feel free to provide patches to test. I ran the loggers you requested here: #1528 (comment) I don't see how the DMA system could be responsible. The communication pattern is inherently scalable, or should be. If you're seeing a growing communication volume out of a specific node, that would be concerning. But Legion should broadcast anything that needs that, so you shouldn't even see that. It would help to know more about what your concerned about if this is still a concern for you. |
On each iteration, each node should be issuing copies to gather data from it's nearest neighbors. Each node does this independently without communicating with the other nodes. As Elliott says, this should be inherently scalable. There's no reasons that nodes 2/3 should be any different than the rest. |
To be a little more explicit: what we're interested in is how communication changes between node counts. So if your logs indicate that in a 4 node run, node 2 sends (or receives) If you were to find such hot spots, that's a different conversation we'd need to have at that point. |
Also to be very clear, there is no evidence of this happening in either the Realm logs or the Legion profiles, so I wouldn't spend a lot of time investigating whether this is the case. Just a cursory check should be fine. The real problem is that despite each node performing the same number of copies to its nearest neighbors, for nodes 2 and 3 those copies are several orders of magnitude slower than copies done on any other nodes, and seem to get worse with the scale of the machine. |
It seems like we got quite off topic in this thread, but I posted a formal deprecation notice in: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1162 If there are still concerns with Task Bench scaling, we should probably move them elsewhere. |
MR here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1219 I have left the |
SCR has been removed. |
No users are currently using static control replication as they have all migrated over to using dynamic control replication. The cost to maintain static control replication is high for both Regent (compiler passes, data structures, and optimizations) and for Legion (must-epoch launches and their interactions with all sorts of things). @elliottslaughter and I are tentatively planning to end-of-life static control replication and its constituent features by the end of the year, contingent upon two requirements:
If there are concerns or other requirements for removing static control replication support please add them here. This will be discussed in a future Legion group meeting after the summer.
Side note: for users that currently rely on must-epoch launches for using collective libraries, they can receive the same safety guarantees needed for using collectives with lighter-weight concurrent index space task launches.
The text was updated successfully, but these errors were encountered: