Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Alpaka event synchronization calls via EDMetadata #44841

Merged
merged 3 commits into from Apr 29, 2024

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Apr 24, 2024

PR description:

Prompted by #44769 (comment) this PR reduces the amount of blocking alpaka::wait() calls (leading to cudaEventSynchronize() on NVIDIA GPU) via the EDMetadata destructor. To check the impact I used the reproducer #44769 (even if it eventually leads to an assertion failure). Initially the reproducer lead to 2470 calls to cudaEventSynchronize().

The first commit avoids calling alpaka::wait() in EDProducer::produce() when no asynchronous work was launched (possible async work is identified by the use of device::Event::queue() in any way, including consuming a device side data product). In this case there is no need to synchronize the device and host. This commit reduced the number of cudaEventSynchronize() calls to 2284 (-7.5 %).

The second commit caches the alpaka::Event completion, so that alpaka::wait() does not need to be called again on an already-completed alpaka Event. This commit reduced the number of cudaEventSynchronize() calls to 2019 (-18 % wrt the starting point).

The third commit avoids calling alpaka::wait() on the destructor of an EDMetadata when the Queue of the EDMetadata object has been re-used by another EDMetadata (in an EDProducer::produce() that consumed the device-side data product holding the first EDMetadata object). Since the only purpose of the alpaka::wait() in the ~EDMetadata() is to ensure all asynchronous work of an edm::Event has completed before the edm::Stream moves to the next edm::Event, for any Queue it is enough to wait() on the alpaka::Event that was last recorded on that Queue. This commit reduced the number of cudaEventSynchronize() calls to 1462 (-40 % wrt the starting point).

There is one case left (on purpose) where the EDProducer::produce() still calls alpaka::wait(). This happens when the produce() launches asynchronous work (i.e. calls device::Event::queue()), but does not produce any device-side data products. I can't imagine a good use case for such an EDProducer (except maybe a DQM-style module, but that would need a separate module base class in any case that could then deal with the necessary edm::Event-level synchronization

I don't expect a huge impact from the reduction of these calls, as in the present cases in HLT the alpaka::Events should have completed by the time the alpaka::wait() is called. The stack traces in #44769 (comment) nevertheless show the cudaEventSynchronize() acquiring a lock even when the CUDA event was complete at the time the event was recorded, so at least this PR should reduce the contention on that lock.

Resolves cms-sw/framework-team#902

PR validation:

HeterogeneousCore/Alpaka{Core,Test} unit tests pass.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Likely will be backported to 14_0_X

…as launched

Notably the case where an EDProducer launches asynchronous work, but
does not produce any device-side data products continues to lead to
blocking alpaka::wait() being called at the end of produce().
…not been reused by a consuming module

If the consuming module does not produce any device-side products, the
EDMetadata destructor should result in blocking alpaka::wait() call.
@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 24, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44841/40091

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for master.

It involves the following packages:

  • HeterogeneousCore/AlpakaCore (heterogeneous)
  • HeterogeneousCore/AlpakaTest (heterogeneous)

@cmsbuild, @fwyzard, @makortel can you please review it and eventually sign? Thanks.
@missirol, @rovere this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@makortel
Copy link
Contributor Author

Likely will be backported to 14_0_X

@fwyzard This PR might warrant some further testing on the HLT side. Let me know how you want to proceed (e.g. I have a 14_0_X-based branch already too).

@@ -170,7 +170,7 @@ Also note that the `fillDescription()` function must have the same content for a
* All Event data products in the host memory space are guaranteed to be accessible for all operations (after the data product has been obtained from the `edm::Event` or `device::Event`).
* All EventSetup data products in the device memory space are guaranteed to be accessible only for operations enqueued in the `Queue` given by `device::Event::queue()` when accessed via the `device::EventSetup` (ED modules), or by `device::Record<TRecord>::queue()` when accessed via the `device::Record<TRecord>` (ESProducers).
* The EDM Stream does not proceed to the next Event until after all asynchronous work of the current Event has finished.
* **Note**: currently this guarantee does not hold if the job has any EDModule that launches asynchronous work but does not explicitly synchronize or produce any device-side data products.
* **Note**: this implies if an EDProducer in its `produce()` function uses the `Event::queue()` or gets a device-side data product, and does not produce any device-side data products, the `produce()` call will be synchronous (i.e. will block the CPU thread until the asynchronous work finishes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update in the README actually holds (and is stronger) even without this PR, I just hadn't realized it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain hurts, but I trust you on this :-)

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fda4d/39077/summary.html
COMMIT: c237b97
CMSSW: CMSSW_14_1_X_2024-04-24-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44841/39077/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
24834.78 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 19
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39721
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

The code changes look good to me.

I will try to test the impact on an HLT workflow.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

here's the impact on the HLT (8 jobs, 32 threads / 24 streams each):

CMSSW_14_0_5_patch2
Running 4 times over 10300 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
   669.6 ±   0.2 ev/s (10000 events, 99.1% overlap)
   672.7 ±   0.2 ev/s (10000 events, 98.7% overlap)
   668.4 ±   0.3 ev/s (10000 events, 98.5% overlap)
   671.1 ±   0.2 ev/s (10000 events, 99.0% overlap)
 --------------------
   670.5 ±   1.9 ev/s
CMSSW_14_0_5_patch2 plus cms-sw/cmssw#44841
Running 4 times over 10300 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
   671.1 ±   0.2 ev/s (10000 events, 99.4% overlap)
   671.9 ±   0.2 ev/s (10000 events, 98.7% overlap)
   673.5 ±   0.2 ev/s (10000 events, 98.8% overlap)
   675.6 ±   0.2 ev/s (10000 events, 98.2% overlap)
 --------------------
   673.0 ±   2.0 ev/s

Whether the small improvement is real or a fluctuation is hard to say - but anyway it goes in the right direction.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit e6283d3 into cms-sw:master Apr 29, 2024
14 checks passed
@makortel makortel deleted the alpakaReducedEventSynchronize branch April 30, 2024 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce Alpaka event synchronization calls via EDMetadata
4 participants