Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTPPS related issue in several IB workflows #35928

Closed
qliphy opened this issue Nov 1, 2021 · 29 comments
Closed

CTPPS related issue in several IB workflows #35928

qliphy opened this issue Nov 1, 2021 · 29 comments

Comments

@qliphy
Copy link
Contributor

qliphy commented Nov 1, 2021

Although the issues mentioned in #35927 should have been fixed mostly by #35766
There appears several CTPPS related issues in IB:

https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_12_2/2021-10-31-2300?selectedArchs=slc7_amd64_gcc900&selectedFlavors=X&selectedStatus=failed

For example, workflow 136.8311
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_2_X_2021-10-31-2300/pyRelValMatrixLogs/run/136.8311_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017/step2_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017.log#/115-115

----- Begin Fatal Exception 01-Nov-2021 02:51:10 CET-----------------------
An exception of category 'FatalRootError' occurred while
[0] Processing Event run: 305064 lumi: 36 event: 55020723 stream: 1
[1] Running path 'MINIAODoutput_step'
[2] Prefetching for module PoolOutputModule/'MINIAODoutput'
[3] Calling method for module CTPPSProtonProducer/'ctppsProtons'
Additional Info:
[a] Fatal Root Error: @sub=TFormula::Eval
Formula is invalid and not ready to execute

and workflow 136.796

Module: CTPPSProtonProducer:ctppsProtons (crashed)
Module: StandAloneMuonProducer:displacedStandAloneMuons
Module: LXXXCorrectorProducer:ak4CaloResidualCorrector
Module: PreshowerPhiClusterProducer:multi5x5SuperClustersWithPreshower

A fatal system signal has occurred: segmentation violation

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 1, 2021

A new Issue was created by @qliphy Qiang Li.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@qliphy
Copy link
Contributor Author

qliphy commented Nov 1, 2021

The issues should be related to #35766 or #35914

@malbouis @CTPPS

@qliphy
Copy link
Contributor Author

qliphy commented Nov 1, 2021

assign dqm, db, alca

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 1, 2021

New categories assigned: dqm,db,alca

@jfernan2,@ahmad3213,@yuanchao,@rvenditti,@emanueleusai,@ggovi,@francescobrivio,@francescobrivio,@pbo0,@malbouis,@malbouis,@tvami,@tvami,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@tvami
Copy link
Contributor

tvami commented Nov 1, 2021

@jan-kaspar can you please have a look?

@qliphy
Copy link
Contributor Author

qliphy commented Nov 1, 2021

@tvami @malbouis Thanks. Yes, it is curious. Let's wait for next IB results. In the meantime, I will do some local tests.

@makortel
Copy link
Contributor

makortel commented Nov 1, 2021

If it problem is a threading issue (e.g. a race condition), it would not show up in PR tests (unless multithreaded tests were enabled explicitly), but would show up IB tests.

@francescobrivio
Copy link
Contributor

Just to make it more evident there is another crash (spotted by @mmusich) in the same IB and still related to CTPPS:

----- Begin Fatal Exception 01-Nov-2021 03:11:13 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 317435 lumi: 36 event: 47465289 stream: 3
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module DQMMessageLogger/'DQMMessageLogger'
   [3] Prefetching for module LogErrorHarvester/'logErrorHarvester'
   [4] Calling method for module CTPPSProtonProducer/'ctppsProtons'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
----- End Fatal Exception -------------------------------------------------

from: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_2_X_2021-10-31-2300/pyRelValMatrixLogs/run/136.8642_RunJetHT2018BHEfail+RunJetHT2018BHEfail+HLTDR2_2018+RECODR2_2018reHLT_skimJetHT_Prompt_HEfail+HARVEST2018_HEfail/step3_RunJetHT2018BHEfail+RunJetHT2018BHEfail+HLTDR2_2018+RECODR2_2018reHLT_skimJetHT_Prompt_HEfail+HARVEST2018_HEfail.log#/833

I have reported it in #35936

@francescobrivio
Copy link
Contributor

If it problem is a threading issue (e.g. a race condition), it would not show up in PR tests (unless multithreaded tests were enabled explicitly), but would show up IB tests.

Thanks @makortel for the suggestion, I believe @malbouis is testing this.
Then maybe it would be worth to enable multithreading by default also in the PR test in order to avoid such issues in the future...

@malbouis
Copy link
Contributor

malbouis commented Nov 1, 2021

As a comment, I'm not sure how useful it is, but I have ran workflows 136.8311 and 136.796 with runTheMatrix on the latest IB and observed no crashes. Will try it with --nThreads > 1 in one of these to see if I'm able to reproduce the error.

@makortel
Copy link
Contributor

makortel commented Nov 1, 2021

Then maybe it would be worth to enable multithreading by default also in the PR test in order to avoid such issues in the future...

Multi-threading in PR tests would effectively destroy bitwise reproducibility of simulated workflows, e.g. because different GEN events could be simulated in different EDM streams leading to different random number sequences between runs. So far race conditions have been rare enough to continue with the current setup.

@jan-kaspar
Copy link
Contributor

I have run runTheMatrix -l 136.796,136.816,136.8642,23234.103 --ibeos on LXPLUS and it yielded 4 4 3 3 tests passed, 0 0 1 0 failed. Thus I cannot reproduce many of the failures anymore. I continue debugging 136.8642, which is the failure I can reproduce (and also reported by Francesco above: #35928 (comment)).

@malbouis
Copy link
Contributor

malbouis commented Nov 1, 2021

As a comment, I'm not sure how useful it is, but I have ran workflows 136.8311 and 136.796 with runTheMatrix on the latest IB and observed no crashes. Will try it with --nThreads > 1 in one of these to see if I'm able to reproduce the error.

I guess testing the multithreaded option is not as straightforward as I thought.
I tried rerunning step 3 of wf 136.8311 while enabling --nThreads 2 but I get the following error:

%MSG-i ThreadStreamSetup: (NoModuleName) 01-Nov-2021 13:07:29 CET pre-events
setting # threads 2
setting # streams 2
%MSG
----- Begin Fatal Exception 01-Nov-2021 13:07:47 CET-----------------------
An exception of category 'ModulesSynchingOnLumis' occurred while
[0] Calling beginJob
Exception Message:
The framework is configured to use at least two streams, but the following modules
require synchronizing on LuminosityBlock boundaries:
QualityTester qTesterJet
QualityTester qTesterMET
DataCertificationJetMET dataCertificationJetMET
QualityTester muonSourcesQualityTests
MuonTrackResidualsTest muTrackResidualsTest
EfficiencyPlotter effPlotterLooseMiniAOD
EfficiencyPlotter effPlotterMediumMiniAOD
EfficiencyPlotter effPlotterTightMiniAOD
MuonRecoTest muRecoTest
QualityTester muonClientsQualityTests
MuonTestSummary muonTestSummary
TriggerMatchEfficiencyPlotter triggerMatchEffPlotterTightMiniAOD
PFJetDQMPostProcessor pfJetDQMPostProcessor
OffsetDQMPostProcessor offsetDQMPostProcessor

The situation can be fixed by either

  • modifying the modules to support concurrent LuminosityBlocks (preferred), or
  • setting 'process.options.numberOfConcurrentLuminosityBlocks = 1' in the configuration file
    ----- End Fatal Exception -------------------------------------------------

@qliphy
Copy link
Contributor Author

qliphy commented Nov 1, 2021

I guess testing the multithreaded option is not as straightforward as I thought. I tried rerunning step 3 of wf 136.8311 while enabling --nThreads 2 but I get the following error:

Instead of just running step3 of wf 136.8311, I tried to run locally with 4 threads:
runTheMatrix.py -l 136.8311 --job-reports -t 4 --ibeos
under both CMSSW_12_2_X_2021-11-01-1100 and CMSSW_12_2_X_2021-10-31-2300
and both work well without any issue.

@Dr15Jones
Copy link
Contributor

A very quick look at the PPS code I found

mutable std::vector<std::shared_ptr<TF1> > f_means_ COND_TRANSIENT;
mutable std::vector<std::shared_ptr<TF1> > f_thresholds_ COND_TRANSIENT;

where a mutable is likely to be the cause of thread-safety problems.

@Dr15Jones
Copy link
Contributor

Indeed, it looks like those vectors are filled from a const function

// build functions if not already done
// (this may happen if data (string representation) are loaded from DB and the constructor is not executed)
if (f_means_.size() < s_means_.size())
buildFunctions();

This is not thread-safe.

@Dr15Jones
Copy link
Contributor

NOTE: the DB does have a mechanism to modify the object right after read from DB but before it is put out into the EventSetup. That allows one to avoid using mutables and provides a thread-safe way to update objects coming out of storage.

@jan-kaspar
Copy link
Contributor

Thanks @Dr15Jones ! Could you please give me a pointer to this mechanism? I can open a fix RP shortly then.

@Dr15Jones
Copy link
Contributor

@jan-kaspar I can try to find it but this is really the domain for @cms-sw/db-l2

@jan-kaspar
Copy link
Contributor

Thanks @Dr15Jones ! Anyone's help appreciated. I will try googling in the meantime.

@Dr15Jones
Copy link
Contributor

Dr15Jones commented Nov 1, 2021

So it looks like you must call REGISTER_PLUGIN_INIT when registering your new EventSetup data product object and pass in an 'Initializer' class which will update the object before handing it to the EventSetup for access. I found examples at

REGISTER_PLUGIN_INIT(SiPixelGainCalibrationRcd, SiPixelGainCalibration, InitGains<SiPixelGainCalibration>);

with

template <typename G>
struct InitGains {
void operator()(G& g) { g.initialize(); }
};

@jan-kaspar
Copy link
Contributor

Thanks again, I will give it a try!

@Dr15Jones
Copy link
Contributor

In general, one should never use mutable data for data products for the Run, LuminosityBlock, Event or the EventSetup.

@jan-kaspar
Copy link
Contributor

In general, one should never use mutable data for data products for the Run, LuminosityBlock, Event or the EventSetup.

Got it! @grzanka @fabferro Possibly good to point this out in the next PPS SW meeting (so as we prevent making this mistake again).

@jan-kaspar
Copy link
Contributor

Hopefully, here's a fix: #35941

@qliphy
Copy link
Contributor Author

qliphy commented Nov 4, 2021

After merging #35941 new IB tests look good.

@qliphy qliphy closed this as completed Nov 4, 2021
@jfernan2
Copy link
Contributor

jfernan2 commented Nov 4, 2021

+1
For the records

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants