Reduce Alpaka event synchronization calls via EDMetadata #44841

makortel · 2024-04-24T17:48:10Z

PR description:

Prompted by #44769 (comment) this PR reduces the amount of blocking alpaka::wait() calls (leading to cudaEventSynchronize() on NVIDIA GPU) via the EDMetadata destructor. To check the impact I used the reproducer #44769 (even if it eventually leads to an assertion failure). Initially the reproducer lead to 2470 calls to cudaEventSynchronize().

The first commit avoids calling alpaka::wait() in EDProducer::produce() when no asynchronous work was launched (possible async work is identified by the use of device::Event::queue() in any way, including consuming a device side data product). In this case there is no need to synchronize the device and host. This commit reduced the number of cudaEventSynchronize() calls to 2284 (-7.5 %).

The second commit caches the alpaka::Event completion, so that alpaka::wait() does not need to be called again on an already-completed alpaka Event. This commit reduced the number of cudaEventSynchronize() calls to 2019 (-18 % wrt the starting point).

The third commit avoids calling alpaka::wait() on the destructor of an EDMetadata when the Queue of the EDMetadata object has been re-used by another EDMetadata (in an EDProducer::produce() that consumed the device-side data product holding the first EDMetadata object). Since the only purpose of the alpaka::wait() in the ~EDMetadata() is to ensure all asynchronous work of an edm::Event has completed before the edm::Stream moves to the next edm::Event, for any Queue it is enough to wait() on the alpaka::Event that was last recorded on that Queue. This commit reduced the number of cudaEventSynchronize() calls to 1462 (-40 % wrt the starting point).

There is one case left (on purpose) where the EDProducer::produce() still calls alpaka::wait(). This happens when the produce() launches asynchronous work (i.e. calls device::Event::queue()), but does not produce any device-side data products. I can't imagine a good use case for such an EDProducer (except maybe a DQM-style module, but that would need a separate module base class in any case that could then deal with the necessary edm::Event-level synchronization

I don't expect a huge impact from the reduction of these calls, as in the present cases in HLT the alpaka::Events should have completed by the time the alpaka::wait() is called. The stack traces in #44769 (comment) nevertheless show the cudaEventSynchronize() acquiring a lock even when the CUDA event was complete at the time the event was recorded, so at least this PR should reduce the contention on that lock.

Resolves cms-sw/framework-team#902

PR validation:

HeterogeneousCore/Alpaka{Core,Test} unit tests pass.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Likely will be backported to 14_0_X

…as launched Notably the case where an EDProducer launches asynchronous work, but does not produce any device-side data products continues to lead to blocking alpaka::wait() being called at the end of produce().

…() calls

…not been reused by a consuming module If the consuming module does not produce any device-side products, the EDMetadata destructor should result in blocking alpaka::wait() call.

cmsbuild · 2024-04-24T17:48:38Z

cms-bot internal usage

cmsbuild · 2024-04-24T17:53:22Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44841/40091

This PR adds an extra 36KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File HeterogeneousCore/AlpakaCore/README.md modified in PR(s): [RFC] Add {Copy,Move}ToDeviceCache<T> class templates and moveToDeviceAsync function template #43969
- File HeterogeneousCore/AlpakaTest/test/testAlpakaModules_cfg.py modified in PR(s): [RFC] Add {Copy,Move}ToDeviceCache<T> class templates and moveToDeviceAsync function template #43969

cmsbuild · 2024-04-24T17:53:46Z

A new Pull Request was created by @makortel for master.

It involves the following packages:

HeterogeneousCore/AlpakaCore (heterogeneous)
HeterogeneousCore/AlpakaTest (heterogeneous)

@cmsbuild, @fwyzard, @makortel can you please review it and eventually sign? Thanks.
@missirol, @rovere this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

makortel · 2024-04-24T17:54:30Z

enable gpu

makortel · 2024-04-24T17:54:35Z

@cmsbuild, please test

makortel · 2024-04-24T17:56:13Z

Likely will be backported to 14_0_X

@fwyzard This PR might warrant some further testing on the HLT side. Let me know how you want to proceed (e.g. I have a 14_0_X-based branch already too).

makortel · 2024-04-24T20:16:23Z

HeterogeneousCore/AlpakaCore/README.md

@@ -170,7 +170,7 @@ Also note that the `fillDescription()` function must have the same content for a
 * All Event data products in the host memory space are guaranteed to be accessible for all operations (after the data product has been obtained from the `edm::Event` or `device::Event`).
 * All EventSetup data products in the device memory space are guaranteed to be accessible only for operations enqueued in the `Queue` given by `device::Event::queue()` when accessed via the `device::EventSetup` (ED modules), or by `device::Record<TRecord>::queue()` when accessed via the `device::Record<TRecord>` (ESProducers).
 * The EDM Stream does not proceed to the next Event until after all asynchronous work of the current Event has finished.
-  * **Note**: currently this guarantee does not hold if the job has any EDModule that launches asynchronous work but does not explicitly synchronize or produce any device-side data products.
+  * **Note**: this implies if an EDProducer in its `produce()` function uses the `Event::queue()` or gets a device-side data product, and does not produce any device-side data products, the `produce()` call will be synchronous (i.e. will block the CPU thread until the asynchronous work finishes)


This update in the README actually holds (and is stronger) even without this PR, I just hadn't realized it.

My brain hurts, but I trust you on this :-)

cmsbuild · 2024-04-24T21:15:13Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fda4d/39077/summary.html
COMMIT: c237b97
CMSSW: CMSSW_14_1_X_2024-04-24-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44841/39077/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
24834.78 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

You potentially removed 60 lines from the logs
Reco comparison results: 40 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3319852
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3319829
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39740
DQMHistoTests: Total failures: 19
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 39721
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

fwyzard · 2024-04-25T11:49:11Z

The code changes look good to me.

I will try to test the impact on an HLT workflow.

fwyzard · 2024-04-25T19:48:33Z

here's the impact on the HLT (8 jobs, 32 threads / 24 streams each):

CMSSW_14_0_5_patch2

Running 4 times over 10300 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
   669.6 ±   0.2 ev/s (10000 events, 99.1% overlap)
   672.7 ±   0.2 ev/s (10000 events, 98.7% overlap)
   668.4 ±   0.3 ev/s (10000 events, 98.5% overlap)
   671.1 ±   0.2 ev/s (10000 events, 99.0% overlap)
 --------------------
   670.5 ±   1.9 ev/s

CMSSW_14_0_5_patch2 plus cms-sw/cmssw#44841

Running 4 times over 10300 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
   671.1 ±   0.2 ev/s (10000 events, 99.4% overlap)
   671.9 ±   0.2 ev/s (10000 events, 98.7% overlap)
   673.5 ±   0.2 ev/s (10000 events, 98.8% overlap)
   675.6 ±   0.2 ev/s (10000 events, 98.2% overlap)
 --------------------
   673.0 ±   2.0 ev/s

Whether the small improvement is real or a fluctuation is hard to say - but anyway it goes in the right direction.

fwyzard · 2024-04-25T19:48:40Z

+heterogeneous

cmsbuild · 2024-04-25T19:48:59Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

antoniovilela · 2024-04-29T16:41:43Z

+1

makortel added 3 commits April 24, 2024 12:42

Avoid calling alpaka::wait() in produce() when no asynchronous work w…

7b30903

…as launched Notably the case where an EDProducer launches asynchronous work, but does not produce any device-side data products continues to lead to blocking alpaka::wait() being called at the end of produce().

Add a cache to record event completion to further reduce alpaka::wait…

d18a254

…() calls

Call alpaka::wait() in EDMetadata destructor only when its Queue has …

c237b97

…not been reused by a consuming module If the consuming module does not produce any device-side products, the EDMetadata destructor should result in blocking alpaka::wait() call.

cmsbuild added this to the CMSSW_14_1_X milestone Apr 24, 2024

cmsbuild added pending-signatures tests-pending orp-pending code-checks-pending heterogeneous-pending labels Apr 24, 2024

makortel mentioned this pull request Apr 24, 2024

HLT farm crash in run 379617 #44769

Open

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 24, 2024

cmsbuild added tests-started and removed tests-pending labels Apr 24, 2024

makortel commented Apr 24, 2024

View reviewed changes

cmsbuild added tests-approved and removed tests-started labels Apr 24, 2024

cmsbuild added fully-signed and removed pending-signatures heterogeneous-pending labels Apr 25, 2024

cmsbuild added the heterogeneous-approved label Apr 25, 2024

fwyzard mentioned this pull request Apr 25, 2024

Reduce Alpaka event synchronization calls via EDMetadata [14.0.x] #44854

Merged

cmsbuild added orp-approved and removed orp-pending labels Apr 29, 2024

cmsbuild merged commit e6283d3 into cms-sw:master Apr 29, 2024
14 checks passed

cmsbuild mentioned this pull request Apr 30, 2024

Reorder include statements #44870

Merged

makortel deleted the alpakaReducedEventSynchronize branch April 30, 2024 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Alpaka event synchronization calls via EDMetadata #44841

Reduce Alpaka event synchronization calls via EDMetadata #44841

makortel commented Apr 24, 2024 •

edited

cmsbuild commented Apr 24, 2024 •

edited

cmsbuild commented Apr 24, 2024

cmsbuild commented Apr 24, 2024

makortel commented Apr 24, 2024

makortel commented Apr 24, 2024

makortel commented Apr 24, 2024

makortel Apr 24, 2024

fwyzard Apr 25, 2024

cmsbuild commented Apr 24, 2024

fwyzard commented Apr 25, 2024 •

edited

fwyzard commented Apr 25, 2024

fwyzard commented Apr 25, 2024

cmsbuild commented Apr 25, 2024

antoniovilela commented Apr 29, 2024

Reduce Alpaka event synchronization calls via EDMetadata #44841

Reduce Alpaka event synchronization calls via EDMetadata #44841

Conversation

makortel commented Apr 24, 2024 • edited

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

cmsbuild commented Apr 24, 2024 • edited

cmsbuild commented Apr 24, 2024

cmsbuild commented Apr 24, 2024

makortel commented Apr 24, 2024

makortel commented Apr 24, 2024

makortel commented Apr 24, 2024

makortel Apr 24, 2024

Choose a reason for hiding this comment

fwyzard Apr 25, 2024

Choose a reason for hiding this comment

cmsbuild commented Apr 24, 2024

Comparison Summary

GPU Comparison Summary

fwyzard commented Apr 25, 2024 • edited

fwyzard commented Apr 25, 2024

CMSSW_14_0_5_patch2

CMSSW_14_0_5_patch2 plus cms-sw/cmssw#44841

fwyzard commented Apr 25, 2024

cmsbuild commented Apr 25, 2024

antoniovilela commented Apr 29, 2024

makortel commented Apr 24, 2024 •

edited

cmsbuild commented Apr 24, 2024 •

edited

fwyzard commented Apr 25, 2024 •

edited