ECAL - GPU unpacker buffer overflow fix #38088

thomreis · 2022-05-25T17:07:51Z

PR description:

Increase the default maximum number of bytes per ECAL FED to support the full readout mode. This fixes the crash from http://cmsonline.cern.ch/cms-elog/1140795

PR validation:

Increasing the value in the HLT configuration fixes the ECAL full readout GPU unpacker crash in run 352176.

cmsbuild · 2022-05-25T17:14:23Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38088/30180

This PR adds an extra 12KB to repository

cmsbuild · 2022-05-25T17:14:51Z

A new Pull Request was created by @thomreis (Thomas Reis) for master.

It involves the following packages:

EventFilter/EcalRawToDigi (reconstruction)

@jpata, @cmsbuild, @clacaputo, @slava77 can you please review it and eventually sign? Thanks.
@rchatter, @argiro, @Martin-Grunewald, @missirol, @thomreis, @simonepigazzini this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

thomreis · 2022-05-25T17:15:55Z

type bugfix

thomreis · 2022-05-25T17:18:28Z

Since this PR just changes the default value, which is anyway overridden by the HLT menu, there might be no need for a backport. I can still make one if desired.

missirol · 2022-05-27T15:06:35Z

please test

Hi @thomreis , (Marino here, from HLT)

thanks for the PR. Just a few (maybe ignorant) questions: would the larger buffer now be used in all cases? If so, is there any disadvantage in doing that (e.g. using buffers larger than needed in most cases)? Is "full readout mode" something which is used only for special tests, or also standard operations?

Maybe not a question for you, but is a similar change needed for HCAL? (cc: @fwyzard)

cmssw/EventFilter/HcalRawToDigi/plugins/DeclsForKernels.h

Line 18 in f4084ec

constexpr uint32_t nbytes_per_fed_max = 10 * 1024;

fwyzard · 2022-05-27T17:18:42Z

What is the total increase in memory size ?
If it is too large, and all memory end up allocated on GPU at the same time, it might become a problem.

cmsbuild · 2022-05-27T19:10:38Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3dcdd4/25042/summary.html
COMMIT: aa640d4
CMSSW: CMSSW_12_5_X_2022-05-27-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38088/25042/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 50
DQMHistoTests: Total histograms compared: 3648315
DQMHistoTests: Total failures: 8
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3648285
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
Checked 208 log files, 45 edm output root files, 50 DQM output files
TriggerResults: no differences found

thomreis · 2022-05-28T22:01:29Z

This PR changes the default value to the maximum expected size per FED. This is the safe option and can always be overridden in the configuration.
For the 54 ECAL FEDs the changes increase the buffer size about 4x from ~550 kB to ~2.2 MB.

This maximum size per FED set here is for barrel FEDs. Endcap FEDs are typically about half that size in full readout mode. This means that the total buffer size calculated is higher than the actually needed size. One could optimise and safe ~300 kB when taking into account that the endcap FED size is smaller.

Full readout mode is a special readout mode for ECAL and used only in special cases. The normal mode is selective readout and the size is around 2 kB for the raw data size per FED.

fwyzard · 2022-05-30T09:27:44Z

I see, thanks.
So the value is not hardcoded, it is only the default -- then setting it to the safest value sounds like a very good idea.

In parallel to this change, the HLT configuration should be changed setting process.hltEcalDigisGPU.maxFedSize to 41616 , but only for the special runs ?

thomreis · 2022-05-30T09:51:11Z

In parallel to this change, the HLT configuration should be changed setting process.hltEcalDigisGPU.maxFedSize to 41616 , but only for the special runs ?

There is a JIRA ticket about this: https://its.cern.ch/jira/browse/CMSHLT-2323

jpata · 2022-05-30T14:42:57Z

assign hlt

cmsbuild · 2022-05-30T14:43:26Z

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

jpata · 2022-05-30T14:44:25Z

+reconstruction

bugfix for HLT (no effect in reco)

missirol · 2022-05-30T14:59:46Z

+hlt

changes the default to a meaningful value (max needed for ECAL in full-readout mode)
adjustments for HLT operations (e.g. value to be used for standard pp collisions) can be solved via JIRA ticket

@thomreis , I think it would make sense to change this default settings also in previous releases (even though this is not critical). Would you like to make backports to 12_4_X and 12_3_X?

@fwyzard , I'm still wondering if anything similar to this should be done for HCAL (and it seems that the number for HCAL is not configurable from the python PSet). Do we need to ask the HCAL DPG?

cmssw/EventFilter/HcalRawToDigi/plugins/DeclsForKernels.h

Line 18 in f4084ec

constexpr uint32_t nbytes_per_fed_max = 10 * 1024;

cmsbuild · 2022-05-30T15:00:07Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2022-05-30T15:26:13Z

+1

jpata · 2022-06-02T13:25:55Z

type ecal

Increase max number of bytes per FED to support full readout mode.

aa640d4

cmsbuild added this to the CMSSW_12_5_X milestone May 25, 2022

cmsbuild added code-checks-pending orp-pending pending-signatures reconstruction-pending tests-pending labels May 25, 2022

cmsbuild added code-checks-approved and removed code-checks-pending labels May 25, 2022

cmsbuild added the bug-fix label May 25, 2022

cmsbuild added tests-started and removed tests-pending labels May 27, 2022

cmsbuild added tests-approved and removed tests-started labels May 27, 2022

cmsbuild added the hlt-pending label May 30, 2022

cmsbuild added reconstruction-approved and removed reconstruction-pending labels May 30, 2022

cmsbuild added fully-signed hlt-approved and removed hlt-pending pending-signatures labels May 30, 2022

cmsbuild added orp-approved and removed orp-pending labels May 30, 2022

cmsbuild merged commit e5e67af into cms-sw:master May 30, 2022

This was referenced May 30, 2022

ECAL - GPU unpacker buffer overflow fix - 12_4_X #38123

Merged

ECAL - GPU unpacker buffer overflow fix - 12_3_X #38124

Merged

This was referenced May 30, 2022

Remove model files of depreciated tauIds. cms-data/RecoTauTag-TrainingFiles#9

Merged

Fix static variable problem in payload inspector code #38155

Merged

cmsbuild added the ecal label Jun 2, 2022

thomreis deleted the ecal-gpu-unpacker-buffer-overflow-fix branch June 3, 2022 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECAL - GPU unpacker buffer overflow fix #38088

ECAL - GPU unpacker buffer overflow fix #38088

thomreis commented May 25, 2022

cmsbuild commented May 25, 2022

cmsbuild commented May 25, 2022

thomreis commented May 25, 2022

thomreis commented May 25, 2022

missirol commented May 27, 2022

fwyzard commented May 27, 2022

cmsbuild commented May 27, 2022

thomreis commented May 28, 2022

fwyzard commented May 30, 2022

thomreis commented May 30, 2022

jpata commented May 30, 2022

cmsbuild commented May 30, 2022

jpata commented May 30, 2022

missirol commented May 30, 2022

cmsbuild commented May 30, 2022

perrotta commented May 30, 2022

jpata commented Jun 2, 2022

ECAL - GPU unpacker buffer overflow fix #38088

ECAL - GPU unpacker buffer overflow fix #38088

Conversation

thomreis commented May 25, 2022

PR description:

PR validation:

cmsbuild commented May 25, 2022

cmsbuild commented May 25, 2022

thomreis commented May 25, 2022

thomreis commented May 25, 2022

missirol commented May 27, 2022

fwyzard commented May 27, 2022

cmsbuild commented May 27, 2022

Comparison Summary

thomreis commented May 28, 2022

fwyzard commented May 30, 2022

thomreis commented May 30, 2022

jpata commented May 30, 2022

cmsbuild commented May 30, 2022

jpata commented May 30, 2022

missirol commented May 30, 2022

cmsbuild commented May 30, 2022

perrotta commented May 30, 2022

jpata commented Jun 2, 2022