Reduce the ECAL and HCAL GPU memory usage [12.4.x] #39580

fwyzard · 2022-10-03T12:44:37Z

PR description:

Allocate memory buffers based on the actual number of events, instead of always allocating the maximum size.

Declare the the obsolete parameters as optional, and ignore them if they are present.

Reduces the total GPU memory from running the HLT with 4 jobs with 32 threads and 32 streams by about 25%:

release	total	reserved	used	free
CMSSW_12_4_9	15360 MB	449 MB	10678 - 10090 MB	4231 - 4819 MB
with #39580	15360 MB	449 MB	7366 - 8056 MB	7543 - 6853 MB

Thanks to @VinInn for finding the issue and for the changes.

PR validation:

The full HLT menu runs on GPU (with 12.4.9 plus #39580) without issues.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Backport of #39577.

cmsbuild · 2022-10-03T12:45:04Z

A new Pull Request was created by @fwyzard (Andrea Bocci) for CMSSW_12_4_X.

It involves the following packages:

RecoLocalCalo/EcalRecProducers (reconstruction)
RecoLocalCalo/HcalRecProducers (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@youyingli, @apsallid, @rchatter, @argiro, @missirol, @bsunanda, @thomreis, @simonepigazzini, @mariadalfonso, @abdoulline this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

cmsbuild · 2022-10-03T12:50:40Z

Pull request #39580 was updated. @cmsbuild, @missirol, @mandrenguyen, @clacaputo, @Martin-Grunewald can you please check and sign again.

RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducerGPU.cc

cmsbuild · 2022-10-03T14:12:57Z

Pull request #39580 was updated. @cmsbuild, @mandrenguyen, @clacaputo can you please check and sign again.

Allocate memory buffers based on the actual number of events, instead of always allocating the maximum size.

cmsbuild · 2022-10-03T14:16:57Z

Pull request #39580 was updated. @cmsbuild, @mandrenguyen, @clacaputo can you please check and sign again.

fwyzard · 2022-10-03T14:17:01Z

enable gpu

fwyzard · 2022-10-03T14:17:05Z

please test

clacaputo · 2022-10-05T08:16:55Z

it seems the test is stuck and build aborted
22:26:36 Build timed out (after 600 minutes). Marking the build as aborted.

clacaputo · 2022-10-05T08:17:16Z

please abort

clacaputo · 2022-10-05T08:17:24Z

enable gpu

clacaputo · 2022-10-05T08:17:32Z

please test

cmsbuild · 2022-10-05T12:20:50Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3e9dc2/28012/summary.html
COMMIT: 7d941ac
CMSSW: CMSSW_12_4_X_2022-10-04-2300/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39580/28012/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 50
DQMHistoTests: Total histograms compared: 3677402
DQMHistoTests: Total failures: 2
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3677378
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
Checked 208 log files, 45 edm output root files, 50 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
Reco comparison had 3 failed jobs
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19876
DQMHistoTests: Total failures: 8
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 19868
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: found differences in 1 / 3 workflows

mandrenguyen · 2022-10-05T17:29:28Z

+1

cmsbuild · 2022-10-05T17:29:49Z

This pull request is fully signed and it will be integrated in one of the next CMSSW_12_4_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_6_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2022-10-06T05:15:34Z

+1

Differences wrt to master are due to Reduce the ECAL and HCAL GPU memory usage [12.4.x] #39580 (comment): ok!

Reduce ECAL memory usage

e63520a

cmsbuild added this to the CMSSW_12_4_X milestone Oct 3, 2022

cmsbuild added orp-pending pending-signatures reconstruction-pending tests-pending labels Oct 3, 2022

cmsbuild added the hlt-pending label Oct 3, 2022

missirol reviewed Oct 3, 2022

View reviewed changes

RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducerGPU.cc Show resolved Hide resolved

This was referenced Oct 3, 2022

Reduce the ECAL and HCAL GPU memory usage #39577

Merged

Reduce the ECAL and HCAL GPU memory usage [12.5.x] #39579

Merged

fwyzard force-pushed the reduce_ECAL_HCAL_GPU_memory_usage_124x branch from 3931451 to e8d2ba3 Compare October 3, 2022 14:12

cmsbuild removed the hlt-pending label Oct 3, 2022

Reduce the ECAL and HCAL GPU memory usage

7d941ac

Allocate memory buffers based on the actual number of events, instead of always allocating the maximum size.

fwyzard force-pushed the reduce_ECAL_HCAL_GPU_memory_usage_124x branch from e8d2ba3 to 7d941ac Compare October 3, 2022 14:16

cmsbuild added tests-started and removed tests-pending labels Oct 3, 2022

cmsbuild added tests-pending and removed tests-started labels Oct 5, 2022

cmsbuild added tests-started and removed tests-pending labels Oct 5, 2022

cmsbuild added tests-approved and removed tests-started labels Oct 5, 2022

cmsbuild added fully-signed reconstruction-approved and removed reconstruction-pending pending-signatures labels Oct 5, 2022

cmsbuild added orp-approved and removed orp-pending labels Oct 6, 2022

cmsbuild merged commit 2bc334f into cms-sw:CMSSW_12_4_X Oct 6, 2022

cmsbuild mentioned this pull request Oct 6, 2022

[Do Not Merge] New "CUDA" Memory Pool [12.4.X] #39623

Closed

fwyzard mentioned this pull request Oct 11, 2022

CUDA-related HLT crashes between run-359694 and run-359764 #39680

Closed

fwyzard deleted the reduce_ECAL_HCAL_GPU_memory_usage_124x branch October 11, 2022 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the ECAL and HCAL GPU memory usage [12.4.x] #39580

Reduce the ECAL and HCAL GPU memory usage [12.4.x] #39580

fwyzard commented Oct 3, 2022 •

edited

cmsbuild commented Oct 3, 2022

cmsbuild commented Oct 3, 2022

cmsbuild commented Oct 3, 2022

cmsbuild commented Oct 3, 2022

fwyzard commented Oct 3, 2022

fwyzard commented Oct 3, 2022

clacaputo commented Oct 5, 2022

clacaputo commented Oct 5, 2022

clacaputo commented Oct 5, 2022

clacaputo commented Oct 5, 2022

cmsbuild commented Oct 5, 2022

mandrenguyen commented Oct 5, 2022

cmsbuild commented Oct 5, 2022

perrotta commented Oct 6, 2022

Reduce the ECAL and HCAL GPU memory usage [12.4.x] #39580

Reduce the ECAL and HCAL GPU memory usage [12.4.x] #39580

Conversation

fwyzard commented Oct 3, 2022 • edited

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

cmsbuild commented Oct 3, 2022

cmsbuild commented Oct 3, 2022

cmsbuild commented Oct 3, 2022

cmsbuild commented Oct 3, 2022

fwyzard commented Oct 3, 2022

fwyzard commented Oct 3, 2022

clacaputo commented Oct 5, 2022

clacaputo commented Oct 5, 2022

clacaputo commented Oct 5, 2022

clacaputo commented Oct 5, 2022

cmsbuild commented Oct 5, 2022

Comparison Summary

GPU Comparison Summary

mandrenguyen commented Oct 5, 2022

cmsbuild commented Oct 5, 2022

perrotta commented Oct 6, 2022

fwyzard commented Oct 3, 2022 •

edited