Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECAL - GPU unpacker buffer overflow fix #38088

Merged

Conversation

thomreis
Copy link
Contributor

PR description:

Increase the default maximum number of bytes per ECAL FED to support the full readout mode. This fixes the crash from http://cmsonline.cern.ch/cms-elog/1140795

PR validation:

Increasing the value in the HLT configuration fixes the ECAL full readout GPU unpacker crash in run 352176.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38088/30180

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @thomreis (Thomas Reis) for master.

It involves the following packages:

  • EventFilter/EcalRawToDigi (reconstruction)

@jpata, @cmsbuild, @clacaputo, @slava77 can you please review it and eventually sign? Thanks.
@rchatter, @argiro, @Martin-Grunewald, @missirol, @thomreis, @simonepigazzini this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@thomreis
Copy link
Contributor Author

type bugfix

@thomreis
Copy link
Contributor Author

Since this PR just changes the default value, which is anyway overridden by the HLT menu, there might be no need for a backport. I can still make one if desired.

@missirol
Copy link
Contributor

please test

Hi @thomreis , (Marino here, from HLT)

thanks for the PR. Just a few (maybe ignorant) questions: would the larger buffer now be used in all cases? If so, is there any disadvantage in doing that (e.g. using buffers larger than needed in most cases)? Is "full readout mode" something which is used only for special tests, or also standard operations?

Maybe not a question for you, but is a similar change needed for HCAL? (cc: @fwyzard)

constexpr uint32_t nbytes_per_fed_max = 10 * 1024;

@fwyzard
Copy link
Contributor

fwyzard commented May 27, 2022

What is the total increase in memory size ?
If it is too large, and all memory end up allocated on GPU at the same time, it might become a problem.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3dcdd4/25042/summary.html
COMMIT: aa640d4
CMSSW: CMSSW_12_5_X_2022-05-27-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38088/25042/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3648315
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3648285
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@thomreis
Copy link
Contributor Author

This PR changes the default value to the maximum expected size per FED. This is the safe option and can always be overridden in the configuration.
For the 54 ECAL FEDs the changes increase the buffer size about 4x from ~550 kB to ~2.2 MB.

This maximum size per FED set here is for barrel FEDs. Endcap FEDs are typically about half that size in full readout mode. This means that the total buffer size calculated is higher than the actually needed size. One could optimise and safe ~300 kB when taking into account that the endcap FED size is smaller.

Full readout mode is a special readout mode for ECAL and used only in special cases. The normal mode is selective readout and the size is around 2 kB for the raw data size per FED.

@fwyzard
Copy link
Contributor

fwyzard commented May 30, 2022

I see, thanks.
So the value is not hardcoded, it is only the default -- then setting it to the safest value sounds like a very good idea.

In parallel to this change, the HLT configuration should be changed setting process.hltEcalDigisGPU.maxFedSize to 41616 , but only for the special runs ?

@thomreis
Copy link
Contributor Author

In parallel to this change, the HLT configuration should be changed setting process.hltEcalDigisGPU.maxFedSize to 41616 , but only for the special runs ?

There is a JIRA ticket about this: https://its.cern.ch/jira/browse/CMSHLT-2323

@jpata
Copy link
Contributor

jpata commented May 30, 2022

assign hlt

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jpata
Copy link
Contributor

jpata commented May 30, 2022

+reconstruction

  • bugfix for HLT (no effect in reco)

@missirol
Copy link
Contributor

+hlt

  • changes the default to a meaningful value (max needed for ECAL in full-readout mode)
  • adjustments for HLT operations (e.g. value to be used for standard pp collisions) can be solved via JIRA ticket

@thomreis , I think it would make sense to change this default settings also in previous releases (even though this is not critical). Would you like to make backports to 12_4_X and 12_3_X?

@fwyzard , I'm still wondering if anything similar to this should be done for HCAL (and it seems that the number for HCAL is not configurable from the python PSet). Do we need to ask the HCAL DPG?

constexpr uint32_t nbytes_per_fed_max = 10 * 1024;

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@jpata
Copy link
Contributor

jpata commented Jun 2, 2022

type ecal

@cmsbuild cmsbuild added the ecal label Jun 2, 2022
@thomreis thomreis deleted the ecal-gpu-unpacker-buffer-overflow-fix branch June 3, 2022 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants