-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize submission process for eager submission case #1054
Conversation
…to avoid additional latency when submitting even if profiling is disabled
… reduce number of event waits and purges of completed nodes
…sh_sync and every N submissions
@illuhad Thanks for the effort, this looks very interesting, I'll let you know when I had a chance to give it a try! |
Hi! We did some testing on MI250. This patch clearly improves things for small systems, up to 6% increase of the overall performance (here we're comparing the best results over a few tested configurations, including different values of Thank you! 👍 The CPU usage also clearly goes down. When GROMACS uses hipSYCL+ROCm, there are 5 extra threads spawned: one HSA worker thread, and four hipSYCL threads. The first two hipSYCL threads do nothing in our case, the third thread is a worker doing submission to the underlying runtime, while the fourth thread does GC (and perhaps other things). Sorry for the messy nomenclature. Even for the lazy submission case ( For the eager submission ( Note: the CPU utilization measurements are very approximate (looking at htop and choosing average-ish value), and the error-bars are my visual estimate of how bad at visual estimation I am :) Note 2: The CPU utilization by the HSA thread is not changed significantly; it's around 95% for small system and gradually drops to 60% for larger systems. This brings a follow-up question: can we assume that both hipSYCL threads would be happy to share the same core or PU? Can this CPU usage be further reduced? Even with this patch and for large system sizes, the CPU usage of hipSYCL runtime is not-insignificant. E.g., on LUMI, we have 8 cores per GPU, one is already reserved for HSA, and deciding how much we must further reserve for hipSYCL is kind-of a big deal. P.S.: Looking forward to your DevSummit talk! |
Thanks for the feedback! That sounds great :)
That's quite right, there's one submission thread and one GC thread. The CPU backend creates one thread as its OpenMP "execution queue". Not sure where the last thread is coming from from the top of my head. In any case I'd indeed expect two active threads (submission and GC) in your case.
That's good to know - I hadn't looked at this case and did not expect a change here :-)
Do you have an idea what the HSA thread does when it spikes, or if there are some particular patterns that trigger its high CPU utilization that we should avoid?
Try it :) In theory the GC thread should not have high demands, and might be fine running on the same core. In practice there is also some locking going on between submission and GC thread (submission thread produces new nodes, and GC thread removes them again), so especially in the eager case it is likely that there is little concurrency between them anyway.
Maybe. This PR contains the straight-forward optimizations that I was able to find in a couple of days of profiling and experimentation, but there are some tuning knobs. The first thing I'd recommend is trying to play with the environment variable Independently I think in the long term we need to look into the memory allocation behavior and reducing the required mallocs during a submission. That requires larger changes and migration to an object pool architecture, so is not a quick change. EDIT: I wonder what would happen if we dispatched submission and GC to the same thread. Semantically, I'm not 100% sure, but I think it should be fine. The GC thread currently also waits on all nodes to update the status in the state cache to avoid backend state queries. That would still need to be done in a separate thread. |
private: | ||
void purge_known_completed(); | ||
void copy_node_list(std::vector<dag_node_ptr>& out) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic question:
We have wait_for_all()
and wait_for_group(std::size_t node_group)
.
Since we have std::vector<dag_node_ptr> get_group(std::size_t node_group);
, it would, IMO, be more consistent to rename this function to std::vector<dat_node_ptr> get_all()
. Both functions can be const
.
Not sure which thread is which but (after the optimization in this PR) we saw 20-30% and 50-60% CPU utilization for the two hipSYCL runtime threads, respectively; any double-digit value I'd consider relatively high demand since I have the feeling that these runtime threads interfere with each-other and cause performance loss (and/or at performance inconsistency). Assigning both of these threads to the same core can be done as a manual hack at runtime (by setting affinities after application startup), but in general that's just a hack.
That's exactly what my thought was, purely based on the amount of work observed empirically, a single thread should be sufficient (especially if the GC trigger frequency is low).
The issue is that at least with MPI we need to wait quite frequently (several time per iteration) because communication has to be initiated from the CPU upon the readiness of data on the device. While the current data Andrey presented is single-GPU hence no MPI and very infrequent CPU waiting on GPU, any multi-GPU run will suffer from the GC triggered by |
Something else: do you have any experience with the impact of the HSA worker thread's placement relative to the hipSYCL worker threads e.g. same cores, same CCX, same NUMA, different NUMA and at least in the first 2-3 cases with/without contention/competition for resources? |
It's not spiking, it's more or less stably at 60-90% (depending on how much we're hammering it). Randomly sampling backtraces with |
Thinking more about this, I'm not sure to what extent this is meaningfully possible as GC without knowing that events have completed makes little sense. So the current couple between waiting on events (which needs to be done asynchronously) and GC might be difficult to remove.
Understood. We could potentially experiment with other heuristics for your use case, but for this we should first look at some data of how big the impact in the MPI case is in practice.
Sorry, I don't know. I also don't know what the HSA workers do exactly and how they relate to input data from HIP or hipSYCL layers.
Ok, that's unfortunate that we know so little about this. |
We should probably merge this for now, as it has been shown to improve things, and then figure out how to improve performance further from there. |
I agree. We are looking into performance in detail, performance is consistently improved in fast-iterating cases with fully GPU-resident mode of GROMACS where tens of iterations are launched without CPU involvement. We are going to look next at cases where there is CPU involvement per-iteration, including synchronization prior to MPI and will report back. |
* Avoid triggering event query when calling for_each_nonvirtual_requirement() * Instrumentation: Return to using spin lock instead of signal_channel to avoid additional latency when submitting even if profiling is disabled * Garbage collection: Insert new nodes at the beginning of the queue to reduce number of event waits and purges of completed nodes * RT garbage collection: Decouple GC from submission; trigger GC in flush_sync and every N submissions * Remove outdated comment
This PR optimizes a couple of current pain points when
HIPSYCL_RT_MAX_CACHED_NODES=0
(eager submission). With this PR, for a loop that submits empty kernels with hipSYCL coarse grained events on an inorder queue, I now achieve roughly 80% of the task throughput compared toHIPSYCL_RT_MAX_CACHED_NODES=100
.This corresponds to a 3x to 4x increase of task throughput in this scenario.
In more detail, this PR
signal_channel
to notify waiting user threads that instrumentations are ready, and uses a spin lock instead. This is not ideal, but solves a current problem: Even if a task does not use instrumentation, the signal_channel had to be set up so that we can inform waiting user threads that no instrumentations are there. As it turns out, thestd::promise
/std::future
mechanism thatsignal_channel
uses is costly to set up. So we pay for something in the submission of every task even if most tasks are not being instrumented. The spin lock wastes CPU cycles when instrumentations are actually needed, but is almost free to setup, and therefore performs better in the majority of cases when we don't need instrumentations.flush_sync()
(this will happen when the user waits on a queue or event, and so we know that submission bursts are likely over) or if after a submission the number of in-flight nodes has exceededHIPSYCL_RT_GC_TRIGGER_BATCH_SIZE
.@al42and @pszi1ard I could not yet try this with Gromacs to see to what extent the benefits translate to real world apps, but I imagine it could be interesting for you :)