Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize submission process for eager submission case #1054

Merged
merged 5 commits into from
Jun 26, 2023

Conversation

illuhad
Copy link
Collaborator

@illuhad illuhad commented May 31, 2023

This PR optimizes a couple of current pain points when HIPSYCL_RT_MAX_CACHED_NODES=0 (eager submission). With this PR, for a loop that submits empty kernels with hipSYCL coarse grained events on an inorder queue, I now achieve roughly 80% of the task throughput compared to HIPSYCL_RT_MAX_CACHED_NODES=100.

This corresponds to a 3x to 4x increase of task throughput in this scenario.

In more detail, this PR

  • Removes a backend event state query from the submission path that was overlooked. The query is now replaced with a lookup to the event state cache, which should be sufficient.
  • Removes the usage of signal_channel to notify waiting user threads that instrumentations are ready, and uses a spin lock instead. This is not ideal, but solves a current problem: Even if a task does not use instrumentation, the signal_channel had to be set up so that we can inform waiting user threads that no instrumentations are there. As it turns out, the std::promise/std::future mechanism that signal_channel uses is costly to set up. So we pay for something in the submission of every task even if most tasks are not being instrumented. The spin lock wastes CPU cycles when instrumentations are actually needed, but is almost free to setup, and therefore performs better in the majority of cases when we don't need instrumentations.
  • Lastly, it completely changes the way garbage collection works. GC is needed for two reasons: Firstly, SYCL builds a global task graph behind the scenes, and old completed nodes must be removed from time to time so that it does not grow to a size where it starts to slow down performance. Secondly, we need to ensure that buffers stay alive as long as we have kernels operating on them. So we need to remove nodes once they have completed such that associated buffers can be freed. Previously, a GC task was spawned for each submission batch that would take care of GC for these particular nodes. This becomes a problem when we are in eager submission scenario. Then, one GC job is spawned per task. All these small GC jobs cause a lot of overhead and lock contention in our data structures. With this PR, a GC job is only spawned
    • If there is not a GC job running already and
    • If we are in a flush_sync() (this will happen when the user waits on a queue or event, and so we know that submission bursts are likely over) or if after a submission the number of in-flight nodes has exceeded HIPSYCL_RT_GC_TRIGGER_BATCH_SIZE.

@al42and @pszi1ard I could not yet try this with Gromacs to see to what extent the benefits translate to real world apps, but I imagine it could be interesting for you :)

@pszi1ard
Copy link

pszi1ard commented Jun 1, 2023

@illuhad Thanks for the effort, this looks very interesting, I'll let you know when I had a chance to give it a try!

@al42and
Copy link
Contributor

al42and commented Jun 13, 2023

Hi!

We did some testing on MI250. This patch clearly improves things for small systems, up to 6% increase of the overall performance (here we're comparing the best results over a few tested configurations, including different values of HIPSYCL_RT_MAX_CACHED_NODES):

Chart showing the speedup of this patch as a function of system size; performance improvements range are up to 6% for small systems of less than 100000 atoms

Thank you! 👍


The CPU usage also clearly goes down. When GROMACS uses hipSYCL+ROCm, there are 5 extra threads spawned: one HSA worker thread, and four hipSYCL threads. The first two hipSYCL threads do nothing in our case, the third thread is a worker doing submission to the underlying runtime, while the fourth thread does GC (and perhaps other things). Sorry for the messy nomenclature.

Even for the lazy submission case (HIPSYCL_RT_MAX_CACHED_NODES=100), the CPU usage by the GC thread is noticeably lowered by this patch:

image

For the eager submission (HIPSYCL_RT_MAX_CACHED_NODES=0), the difference is much more noticeable:

image

Note: the CPU utilization measurements are very approximate (looking at htop and choosing average-ish value), and the error-bars are my visual estimate of how bad at visual estimation I am :)

Note 2: The CPU utilization by the HSA thread is not changed significantly; it's around 95% for small system and gradually drops to 60% for larger systems.

This brings a follow-up question: can we assume that both hipSYCL threads would be happy to share the same core or PU? Can this CPU usage be further reduced? Even with this patch and for large system sizes, the CPU usage of hipSYCL runtime is not-insignificant. E.g., on LUMI, we have 8 cores per GPU, one is already reserved for HSA, and deciding how much we must further reserve for hipSYCL is kind-of a big deal.

P.S.: Looking forward to your DevSummit talk!

@illuhad
Copy link
Collaborator Author

illuhad commented Jun 14, 2023

Thanks for the feedback! That sounds great :)

The CPU usage also clearly goes down. When GROMACS uses hipSYCL+ROCm, there are 5 extra threads spawned: one HSA worker thread, and four hipSYCL threads. The first two hipSYCL threads do nothing in our case, the third thread is a worker doing submission to the underlying runtime, while the fourth thread does GC (and perhaps other things). Sorry for the messy nomenclature.

That's quite right, there's one submission thread and one GC thread. The CPU backend creates one thread as its OpenMP "execution queue". Not sure where the last thread is coming from from the top of my head. In any case I'd indeed expect two active threads (submission and GC) in your case.

Even for the lazy submission case (HIPSYCL_RT_MAX_CACHED_NODES=100), the CPU usage by the GC thread is noticeably lowered by this patch:

That's good to know - I hadn't looked at this case and did not expect a change here :-)

Note 2: The CPU utilization by the HSA thread is not changed significantly; it's around 95% for small system and gradually drops to 60% for larger systems.

Do you have an idea what the HSA thread does when it spikes, or if there are some particular patterns that trigger its high CPU utilization that we should avoid?

This brings a follow-up question: can we assume that both hipSYCL threads would be happy to share the same core or PU?

Try it :) In theory the GC thread should not have high demands, and might be fine running on the same core. In practice there is also some locking going on between submission and GC thread (submission thread produces new nodes, and GC thread removes them again), so especially in the eager case it is likely that there is little concurrency between them anyway.

Can this CPU usage be further reduced? Even with this patch and for large system sizes, the CPU usage of hipSYCL runtime is not-insignificant. E.g., on LUMI, we have 8 cores per GPU, one is already reserved for HSA, and deciding how much we must further reserve for hipSYCL is kind-of a big deal.

Maybe. This PR contains the straight-forward optimizations that I was able to find in a couple of days of profiling and experimentation, but there are some tuning knobs. The first thing I'd recommend is trying to play with the environment variable HIPSYCL_RT_GC_TRIGGER_BATCH_SIZE which controls how many nodes we can have before GC kicks in. The default is 128. A larger value means less frequent but larger GC jobs. Note that currently, when the application waits on something like a queue::wait() or event::wait() , GC is spawned in any case. The assumption is that when the code is waiting, a submission burst is over and that might be a good time to clean up. I think we might be able to loosen this if it causes too many GC jobs.

Independently I think in the long term we need to look into the memory allocation behavior and reducing the required mallocs during a submission. That requires larger changes and migration to an object pool architecture, so is not a quick change.

EDIT: I wonder what would happen if we dispatched submission and GC to the same thread. Semantically, I'm not 100% sure, but I think it should be fine. The GC thread currently also waits on all nodes to update the status in the state cache to avoid backend state queries. That would still need to be done in a separate thread.

private:
void purge_known_completed();
void copy_node_list(std::vector<dag_node_ptr>& out) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic question:

We have wait_for_all() and wait_for_group(std::size_t node_group).

Since we have std::vector<dag_node_ptr> get_group(std::size_t node_group);, it would, IMO, be more consistent to rename this function to std::vector<dat_node_ptr> get_all(). Both functions can be const.

@pszi1ard
Copy link

In theory the GC thread should not have high demands, and might be fine running on the same core. In practice there is also some locking going on between submission and GC thread (submission thread produces new nodes, and GC thread removes them again), so especially in the eager case it is likely that there is little concurrency between them anyway.

Not sure which thread is which but (after the optimization in this PR) we saw 20-30% and 50-60% CPU utilization for the two hipSYCL runtime threads, respectively; any double-digit value I'd consider relatively high demand since I have the feeling that these runtime threads interfere with each-other and cause performance loss (and/or at performance inconsistency).

Assigning both of these threads to the same core can be done as a manual hack at runtime (by setting affinities after application startup), but in general that's just a hack.

EDIT: I wonder what would happen if we dispatched submission and GC to the same thread.

That's exactly what my thought was, purely based on the amount of work observed empirically, a single thread should be sufficient (especially if the GC trigger frequency is low).

A larger value means less frequent but larger GC jobs. Note that currently, when the application waits on something like a queue::wait() or event::wait() , GC is spawned in any case. The assumption is that when the code is waiting, a submission burst is over and that might be a good time to clean up. I think we might be able to loosen this if it causes too many GC jobs.

The issue is that at least with MPI we need to wait quite frequently (several time per iteration) because communication has to be initiated from the CPU upon the readiness of data on the device. While the current data Andrey presented is single-GPU hence no MPI and very infrequent CPU waiting on GPU, any multi-GPU run will suffer from the GC triggered by event::wait().

@pszi1ard
Copy link

Something else: do you have any experience with the impact of the HSA worker thread's placement relative to the hipSYCL worker threads e.g. same cores, same CCX, same NUMA, different NUMA and at least in the first 2-3 cases with/without contention/competition for resources?

@al42and
Copy link
Contributor

al42and commented Jun 14, 2023

Do you have an idea what the HSA thread does when it spikes, or if there are some particular patterns that trigger its high CPU utilization that we should avoid?

It's not spiking, it's more or less stably at 60-90% (depending on how much we're hammering it). Randomly sampling backtraces with gdb was not enlightening since AMD stripped all debug info from their binary distributions: most time is spent deep in libhsa-runtime64.so.1, and sometimes in ioctl.

@illuhad
Copy link
Collaborator Author

illuhad commented Jun 19, 2023

That's exactly what my thought was, purely based on the amount of work observed empirically, a single thread should be sufficient (especially if the GC trigger frequency is low).

Thinking more about this, I'm not sure to what extent this is meaningfully possible as GC without knowing that events have completed makes little sense. So the current couple between waiting on events (which needs to be done asynchronously) and GC might be difficult to remove.

The issue is that at least with MPI we need to wait quite frequently (several time per iteration) because communication has to be initiated from the CPU upon the readiness of data on the device. While the current data Andrey presented is single-GPU hence no MPI and very infrequent CPU waiting on GPU, any multi-GPU run will suffer from the GC triggered by event::wait().

Understood. We could potentially experiment with other heuristics for your use case, but for this we should first look at some data of how big the impact in the MPI case is in practice.

Something else: do you have any experience with the impact of the HSA worker thread's placement relative to the hipSYCL worker threads e.g. same cores, same CCX, same NUMA, different NUMA and at least in the first 2-3 cases with/without contention/competition for resources?

Sorry, I don't know. I also don't know what the HSA workers do exactly and how they relate to input data from HIP or hipSYCL layers.

It's not spiking, it's more or less stably at 60-90% (depending on how much we're hammering it). Randomly sampling backtraces with gdb was not enlightening since AMD stripped all debug info from their binary distributions: most time is spent deep in libhsa-runtime64.so.1, and sometimes in ioctl.

Ok, that's unfortunate that we know so little about this.

@illuhad
Copy link
Collaborator Author

illuhad commented Jun 26, 2023

We should probably merge this for now, as it has been shown to improve things, and then figure out how to improve performance further from there.

@illuhad illuhad merged commit cd2c72f into develop Jun 26, 2023
31 of 32 checks passed
@illuhad illuhad deleted the feature/rt-optimizations branch June 26, 2023 17:43
@pszi1ard
Copy link

We should probably merge this for now, as it has been shown to improve things, and then figure out how to improve performance further from there.

I agree. We are looking into performance in detail, performance is consistently improved in fast-iterating cases with fully GPU-resident mode of GROMACS where tens of iterations are launched without CPU involvement. We are going to look next at cases where there is CPU involvement per-iteration, including synchronization prior to MPI and will report back.

nilsfriess pushed a commit to nilsfriess/OpenSYCL that referenced this pull request Aug 7, 2023
* Avoid triggering event query when calling for_each_nonvirtual_requirement()

* Instrumentation: Return to using spin lock instead of signal_channel to avoid additional latency when submitting even if profiling is disabled

* Garbage collection: Insert new nodes at the beginning of the queue to reduce number of event waits and purges of completed nodes

* RT garbage collection: Decouple GC from submission; trigger GC in flush_sync and every N submissions

* Remove outdated comment
@al42and al42and mentioned this pull request Sep 15, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants