Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static Control Replication End-of-Life #1528

Closed
lightsighter opened this issue Aug 22, 2023 · 30 comments
Closed

Static Control Replication End-of-Life #1528

lightsighter opened this issue Aug 22, 2023 · 30 comments
Assignees
Labels
enhancement Legion Issues pertaining to Legion Regent Issues pertaining to Regent

Comments

@lightsighter
Copy link
Contributor

No users are currently using static control replication as they have all migrated over to using dynamic control replication. The cost to maintain static control replication is high for both Regent (compiler passes, data structures, and optimizations) and for Legion (must-epoch launches and their interactions with all sorts of things). @elliottslaughter and I are tentatively planning to end-of-life static control replication and its constituent features by the end of the year, contingent upon two requirements:

  1. Dynamic control replication with tracing can be shown to have near comparable METG on Task Bench benchmarks with static control replication. (This should probably be a requirement for Control Replication Merge to Master #765).
  2. Dynamic control replication is merged into the master branch of Legion (Control Replication Merge to Master #765).

If there are concerns or other requirements for removing static control replication support please add them here. This will be discussed in a future Legion group meeting after the summer.

Side note: for users that currently rely on must-epoch launches for using collective libraries, they can receive the same safety guarantees needed for using collectives with lighter-weight concurrent index space task launches.

@lightsighter lightsighter added enhancement Legion Issues pertaining to Legion Regent Issues pertaining to Regent labels Aug 22, 2023
@elliottslaughter
Copy link
Contributor

Here are the results of running Task Bench on Frontier up to 256 nodes:

metg_stencil-1

(PDF format)

In the results, "Regent" means Regent with SCR and "Legion" means Legion with DCR. Legion "sr" refers to the shardrefine branch (522ef64). The other Regent/Legion/Realm results are on control_replication (3ff376d).

Some overall observations:

  1. At low node counts, Legion DCR METGs are better than Regent SCR (by a factor of about 2×). This is the opposite of what we used to see, so very welcome.
  2. Legion DCR METGs rise with node count, though less aggressively than I remember seeing in the past. This may potentially be explained by uses of non-scalable Realm barriers, which is something @artempriakhin is working to address.
  3. shardrefine vs control_replication makes no difference, the performance is near-identical.

I'm happy with these results (though there is still some room for improvement). That's my last blocker on my side. I am not aware of any reasons to avoid removing SCR.

I would request that we schedule the SCR final removal for after the control_replication -> master merge and subsequent stable release, unless there is a specific reason we want to remove must epochs before then. That way we have a known release with DCR + SCR all working before we rip it out.

@lightsighter
Copy link
Contributor Author

Legion DCR METGs rise with node count, though less aggressively than I remember seeing in the past. This may potentially be explained by uses of non-scalable Realm barriers, which is something @artempriakhin is working to address.

Would it be possible to get some profiles of the 4 rank/node cases with 64, 128, and 256 nodes? I'd just like to see what they look like and confirm that hypothesis as to the source of the slowdown. The barriers really only show up in the dependences for the meta-tasks and not for the actual trace we're capturing and replaying, so slowdowns like this would only be caused by not replaying the trace fast enough compared to the actual execution of the trace.

shardrefine vs control_replication makes no difference, the performance is near-identical.

Good to confirm, but not surprising since trace capture and replay is the same in both branches.

I would request that we schedule the SCR final removal for after the control_replication -> master merge and subsequent stable release, unless there is a specific reason we want to remove must epochs before then.

Agreed. No need to do it before then. It does mean that I'll stop worrying about performance optimizations for things like must-epoch launches though, but they will remain functional so the CI continues to pass.

@elliottslaughter
Copy link
Contributor

elliottslaughter commented Oct 24, 2023

Ok, here are the profiles. This is weak scaling, so problem size per node is constant. Note that on the 256-node run, I had to limit the visualization to about 75 nodes due to memory constraints on Sapling.

I picked a problem size that best demonstrates the fall off in efficiency. For this problem, at 64 nodes we achieve 71% efficiency, but that falls off to 34% efficiency by 256 nodes. Hopefully the difference should be visible.

@lightsighter
Copy link
Contributor Author

Just looking at the difference between the 64 node and 256 node profiles, the primary difference seems to be in the copies on the channels. In both cases the tasks take around 1.4-1.5ms. The utility processors are underloaded doing the trace replays in both cases. However, in the 64 node case all the copies take <100us while in the 256 node case there are periodic copies that take 1-3ms to move 16B of data from one node to the other. The pattern repeats every so many iterations of the code so whatever is going wrong is deterministically bad but it only happens on some of the nodes (e.g. nodes 2 and 3 in the 256 node run). It's hard for me to explain why the copies are slower, but just on those two nodes (at least as far as I can see).

@artempriakhin @muraj @streichler ideas of what to do next to investigate why specific copies are slow on those nodes?

@elliottslaughter
Copy link
Contributor

Is it possible that a non-scalable barrier is causing the copies to get backed up? After all, the barriers themselves won't show up, so if one node just gets inexplicably slow, that seems like a possible culprit.

@lightsighter
Copy link
Contributor Author

No, there won't be any non-scalable barriers in the generated Realm event graph. All the barriers used for sharded trace replays will be between adjacent shards here. The barriers that span all the nodes will be reflected in when the runtime is allowed to issue the trace replays, and clearly the runtime is well ahead in issuing the trace replays here.

Also, if the barriers were the issue, then they would delay the copies before they even started, but that isn't what is happening here. The copies have had all their preconditions met and have actually started running according to Realm's own timing and are taking longer to run on some nodes than others.

@elliottslaughter
Copy link
Contributor

I'm not suggesting that the copies depend on the barriers. They're totally unrelated as you said, and the runtime is running ahead. But if they just happen to execute simultaneously, having one node spew O(Ranks) messages for barriers could easily create head of line blocking for completely unrelated DMAs, which could result in the behavior we're seeing.

@lightsighter
Copy link
Contributor Author

That is a indeed a fair point and certainly could be the cause of the problem. Do you think you can add some more of the small-numbered node profiles here (you can remove some of the larger-numbered nodes since they look fine)? I'd like to see if more of the small-numbered nodes exhibit similar issues.

@elliottslaughter
Copy link
Contributor

What do you mean by "small-numbered nodes"? In the 256-node profile, we should be capturing nodes 0-74 (remember, this is 4 ranks per node so this is ranks 0-299). I'm happy to render any subset of the nodes, but you're already looking at the lowest rank numbers.

@lightsighter
Copy link
Contributor Author

At least when I look at the 256-node profile it jumps from n3 to n10. Not sure where the nodes in between went, unless these are physical node names.

@elliottslaughter
Copy link
Contributor

That may be a profiler bug. I passed --nodes 0-300 on the command line, so nodes 4-9 should be there.

I can try to massage the profile, maybe passing a subset of logs instead of the --nodes filter, to see what I get.

@lightsighter
Copy link
Contributor Author

It also jumped from n39 to n100, so I think it is showing the first 75 nodes sorted lexicographically...

@elliottslaughter
Copy link
Contributor

Hm, how do we assign node numbers in profiles? Do we just assume the first log file is node 0 and the second one is node 1 and so on?

@lightsighter
Copy link
Contributor Author

We assign file names (using the % operator in the filename) using the same scheme that Realm does which is by the "address space" ID (the same as the GASNet rank ID), but that does mean that the numbers count like 0, 1, 2, ... and you won't see anything like 000, 001, ... 123.

@elliottslaughter
Copy link
Contributor

I was thinking more of the parser end of it, but I guess it comes in through a logger statement (and is therefore hard to mess up):

MachineDesc { node_id: NodeID, num_nodes: u32 },

Anyway, here's a profile from manually passing the first 300 ranks into the profiler (and no --nodes filter). I think it has the nodes you're looking for.

https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~eslaught/frontier-task-bench/prof_legion_util_2_rank4_ngraphs_1_type_stencil_1d_nodes_256_s7_rep0_archive_n0-299/

@lightsighter
Copy link
Contributor Author

That does look better. It does seem like only nodes 2 and 3 are the problem.

@apryakhin
Copy link
Contributor

apryakhin commented Oct 26, 2023

@elliottslaughter How hard would that be to get some logs from nodes 2 and 3? I am still going through all the updates here. However, considering a possibility that it's none-barrier related we would need "-level xpath=1,xplan=1,dma=1,gpudma=1,barrier=1" to get started and see what's going on.

@lightsighter
Copy link
Contributor Author

I looked at the profiles more with @apryakhin today. It looks like the same issue occurs at 64 and 128 nodes too (also on nodes 2-3), but it is less pronounced (e.g. the copies take 800us on 64 nodes versus 3.5ms on 256 nodes). It should be possible to get profiles on fewer number of nodes to debug the problem, but I'll let @apryakhin say the smallest number of nodes he's comfortable with using to debug the problem.

I also don't think you'll need gpudma=1 since there are no GPUs in these runs.

@elliottslaughter
Copy link
Contributor

64-node logs are here:

/scratch/eslaught/frontier-task-bench/log_legion_util_2_rank4_ngraphs_1_type_stencil_1d_nodes_64_s7_rep0_logs

If you the corresponding Legion Prof, let me know.

A 256-node job is still in the queue.

@apryakhin
Copy link
Contributor

Thanks Elliot, I think now the silly question is on my side - how do I access Frontier? Is there a guideline doc somewhere?

@elliottslaughter
Copy link
Contributor

Contact me privately and I'll discuss it with you. Short answer is that we might be able to do it, but we're talking something like a month of lead time. It's not quick.

In the meantime, if there is other debugging that I can do, let me know.

@elliottslaughter
Copy link
Contributor

I didn't realize @apryakhin didn't have a Sapling account. Here is a public link to the logs:

64 nodes: http://sapling.stanford.edu/~eslaught/frontier-task-bench/log_legion_util_2_rank4_ngraphs_1_type_stencil_1d_nodes_64_s7_rep0_logs/

256 nodes: http://sapling.stanford.edu/~eslaught/frontier-task-bench/log_legion_util_2_rank4_ngraphs_1_type_stencil_1d_nodes_256_s7_rep0_logs/

@apryakhin
Copy link
Contributor

@elliottslaughter Would it be possible for me to give you either a small patch or custom branch to run an experiment on? Also did we run it with "-level barrier=1"?

From the logs we have I am unable to determine the root cause. 16 byte 1d memcpys sysmem->sysmem do look okay from the dma perspective. There are lots of those 16 byte copies though, so we may need to look at the pipelining..e.g. to profile background work manager and probably active message handlers (this all need to be enabled with compile-time flags).

@elliottslaughter
Copy link
Contributor

Yes, feel free to provide patches to test.

I ran the loggers you requested here: #1528 (comment)

I don't see how the DMA system could be responsible. The communication pattern is inherently scalable, or should be. If you're seeing a growing communication volume out of a specific node, that would be concerning. But Legion should broadcast anything that needs that, so you shouldn't even see that. It would help to know more about what your concerned about if this is still a concern for you.

@lightsighter
Copy link
Contributor Author

The communication pattern is inherently scalable, or should be.

On each iteration, each node should be issuing copies to gather data from it's nearest neighbors. Each node does this independently without communicating with the other nodes. As Elliott says, this should be inherently scalable. There's no reasons that nodes 2/3 should be any different than the rest.

@elliottslaughter
Copy link
Contributor

To be a little more explicit: what we're interested in is how communication changes between node counts. So if your logs indicate that in a 4 node run, node 2 sends (or receives) N messages, but at 8 nodes, node 2 sends (or receives) 2*N messages, that would be interesting (and concerning). However, based on the properties of the application, we would expect this NOT to be the case. We expect that the N will be relatively constant with each node count and you shouldn't find hot spots that are growing as we scale out.

If you were to find such hot spots, that's a different conversation we'd need to have at that point.

@lightsighter
Copy link
Contributor Author

So if your logs indicate that in a 4 node run, node 2 sends (or receives) N messages, but at 8 nodes, node 2 sends (or receives) 2*N messages, that would be interesting (and concerning). However, based on the properties of the application, we would expect this NOT to be the case.

Also to be very clear, there is no evidence of this happening in either the Realm logs or the Legion profiles, so I wouldn't spend a lot of time investigating whether this is the case. Just a cursory check should be fine.

The real problem is that despite each node performing the same number of copies to its nearest neighbors, for nodes 2 and 3 those copies are several orders of magnitude slower than copies done on any other nodes, and seem to get worse with the scale of the machine.

@elliottslaughter
Copy link
Contributor

It seems like we got quite off topic in this thread, but I posted a formal deprecation notice in: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1162

If there are still concerns with Task Bench scaling, we should probably move them elsewhere.

@elliottslaughter
Copy link
Contributor

MR here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1219

I have left the -fflow flag for the moment so we don't immediately break downstream users. -fflow 1 is a hard error. -fflow 0 will generate a warning explaining that the flag can be removed as it is no longer required. I would plan to remove the -fflow flag entirely once we can confirm all users have moved off of it.

@elliottslaughter
Copy link
Contributor

SCR has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Legion Issues pertaining to Legion Regent Issues pertaining to Regent
Projects
None yet
Development

No branches or pull requests

3 participants