Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase size of buffer in InitMsgBuilder [12_3_X] #37940

Merged

Conversation

missirol
Copy link
Contributor

backport of #37937

Attn: @smorovic

PR description:

From the original PR:

This PR fixes a problem reported by @fwyzard when running the full HLT GRun menu using GlobalEvFOutputModule as cms.OutputModule (which is how the HLT produces output files online).

The issue occurs in the beginRun stage, when serializing the content of the "INI" streamer files: the size of the buffer given by InitMsgBuilder to EventMsgBuilder can be too small if the number of L1 and HL triggers is above certain values.

The current size of 256 is insufficient, for example, in the presence of 500 L1T seeds and 500 HLT paths.

The issue leads to a crash, and it can be reproduced with this minimal update of the relevant DAQ unit test.

diff --git a/EventFilter/Utilities/test/startFU.py b/EventFilter/Utilities/test/startFU.py
index c00d612aae8..9133a1196aa 100644
--- a/EventFilter/Utilities/test/startFU.py
+++ b/EventFilter/Utilities/test/startFU.py
@@ -128,6 +128,9 @@ process.tcdsRawToDigi.InputLabel = cms.InputTag("rawDataCollector")
 process.p1 = cms.Path(process.a*process.tcdsRawToDigi*process.filter1)
 process.p2 = cms.Path(process.b*process.filter2)
 
+for pidx in range(3,1000):
+  setattr(process, f'p{pidx}', cms.Path(process.b))
+
 process.streamA = cms.OutputModule("EvFOutputModule",
     SelectEvents = cms.untracked.PSet(SelectEvents = cms.vstring( 'p1' ))
 )
./EventFilter/Utilities/test/LocalRunBUFU.sh

To my knowledge, this problem affects both GlobalEvFOutputModule and EvFOutputModule.

Given the deadline for 12_4_0_pre4 (and possible need for a patch release in 12_3_X), this PR applies a minimal fix increasing the buffer size.

A buffer size of 640 should be sufficient for 512 L1T seeds and 2000 HLT paths (the current HLT menu for pp collisions has approx. 800 paths).

In the near future, the algorithm could be improved to find an optimal buffer size based on the number of L1 and HL triggers in the configuration.

Debugged with @fwyzard.

PR validation:

None. Relies on the testing done for the original PR.

If this PR is a backport, please specify the original PR and why you need to backport that PR:

#37937

Potentially needed for collisions data-taking in 12_3_X.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 13, 2022

A new Pull Request was created by @missirol (Marino Missiroli) for CMSSW_12_3_X.

It involves the following packages:

  • IOPool/Streamer (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks.
@makortel, @wddgit this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@missirol
Copy link
Contributor Author

type bugfix

@missirol
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ebab18/24694/summary.html
COMMIT: 9c8fbba
CMSSW: CMSSW_12_3_X_2022-05-12-2300/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37940/24694/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

  • 27.027.0_WM+WMINPUT+DIGI+RECO+HARVEST/step2_WM+WMINPUT+DIGI+RECO+HARVEST.log
  • 26.026.0_WE+WEINPUT+DIGI+RECOAlCaCalo+HARVEST/step2_WE+WEINPUT+DIGI+RECOAlCaCalo+HARVEST.log
  • 23.023.0_JpsiMM+JpsiMMINPUT+DIGI+RECO+HARVEST/step2_JpsiMM+JpsiMMINPUT+DIGI+RECO+HARVEST.log
Expand to see more relval errors ...

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3696954
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3696924
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

Looks like a failure in test infrastructure

Fatal in <TROOT::InitInterpreter>: cannot load library /cvmfs/cms-ib.cern.ch/nweek-02732/slc7_amd64_gcc10/lcg/root/6.24.07-6b24df5a7040a677b8f0d27957c7cb74/lib/libRIO.so: cannot open shared object file: Transport endpoint is not connected
/bin/sh: /cvmfs/cms-ib.cern.ch/nweek-02732/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-05-08-0000/bin/slc7_amd64_gcc10/cmsDriver.py: Transport endpoint is not connected
13-May-2022 06:37:05 UTC  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_9_2_4/RelValWE/GEN-SIM/91X_mcRun1_realistic_v2-v1/00000/4EE1D275-BA61-E711-8D08-0CC47A4D7644.root
%MSG-w XrdAdaptor:  file_open 13-May-2022 06:37:07 UTC pre-events
Data is served from cern.ch instead of original site eoscms
%MSG
13-May-2022 06:37:07 UTC  Successfully opened file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_9_2_4/RelValWE/GEN-SIM/91X_mcRun1_realistic_v2-v1/00000/4EE1D275-BA61-E711-8D08-0CC47A4D7644.root


A fatal system signal has occurred: bus error
The following is the call stack containing the origin of the signal.

Fri May 13 06:37:08 UTC 2022
Thread 1 (process 22086):

Current Modules:

Module: none (crashed)

A fatal system signal has occurred: bus error

@makortel
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_12_3_X IBs (but tests are reportedly failing) and once validation in the development release cycle CMSSW_12_4_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@missirol
Copy link
Contributor Author

urgent

To signal that this fix targets the next 12_3_X release (discussed today in the Joint PPD/TRG/O&C Ops Meeting).

@qliphy
Copy link
Contributor

qliphy commented May 14, 2022

merge

@cmsbuild cmsbuild merged commit 43c0137 into cms-sw:CMSSW_12_3_X May 14, 2022
@missirol missirol deleted the devel_fixBufferOfInitMsgBuilder_123X branch May 15, 2022 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants