CTPPS related issue in several IB workflows #35928

qliphy · 2021-11-01T06:11:01Z

Although the issues mentioned in #35927 should have been fixed mostly by #35766
There appears several CTPPS related issues in IB:

https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_12_2/2021-10-31-2300?selectedArchs=slc7_amd64_gcc900&selectedFlavors=X&selectedStatus=failed

For example, workflow 136.8311
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_2_X_2021-10-31-2300/pyRelValMatrixLogs/run/136.8311_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017/step2_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017.log#/115-115

----- Begin Fatal Exception 01-Nov-2021 02:51:10 CET-----------------------
An exception of category 'FatalRootError' occurred while
[0] Processing Event run: 305064 lumi: 36 event: 55020723 stream: 1
[1] Running path 'MINIAODoutput_step'
[2] Prefetching for module PoolOutputModule/'MINIAODoutput'
[3] Calling method for module CTPPSProtonProducer/'ctppsProtons'
Additional Info:
[a] Fatal Root Error: @sub=TFormula::Eval
Formula is invalid and not ready to execute

and workflow 136.796

Module: CTPPSProtonProducer:ctppsProtons (crashed)
Module: StandAloneMuonProducer:displacedStandAloneMuons
Module: LXXXCorrectorProducer:ak4CaloResidualCorrector
Module: PreshowerPhiClusterProducer:multi5x5SuperClustersWithPreshower

A fatal system signal has occurred: segmentation violation

cmsbuild · 2021-11-01T06:11:22Z

A new Issue was created by @qliphy Qiang Li.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

qliphy · 2021-11-01T06:11:57Z

The issues should be related to #35766 or #35914

@malbouis @CTPPS

qliphy · 2021-11-01T06:12:52Z

assign dqm, db, alca

cmsbuild · 2021-11-01T06:13:11Z

New categories assigned: dqm,db,alca

@jfernan2,@ahmad3213,@yuanchao,@rvenditti,@emanueleusai,@ggovi,@francescobrivio,@francescobrivio,@pbo0,@malbouis,@malbouis,@tvami,@tvami,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

tvami · 2021-11-01T06:16:17Z

@jan-kaspar can you please have a look?

tvami · 2021-11-01T06:23:02Z

I'm somewhat puzzled how these didnt come up in #35766 ' test,
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c5d209/20110/runTheMatrixINPUT-results/136.796_RunMET2017C+RunMET2017C+HLTDR2_2017+RECODR2_2017reHLT_skimMET_Prompt+HARVEST2017/
and
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c5d209/20110/runTheMatrixINPUT-results/136.8311_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017/
dont show this error, right?

malbouis · 2021-11-01T07:44:45Z

I'm somewhat puzzled how these didnt come up in #35766 ' test, https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c5d209/20110/runTheMatrixINPUT-results/136.796_RunMET2017C+RunMET2017C+HLTDR2_2017+RECODR2_2017reHLT_skimMET_Prompt+HARVEST2017/ and https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c5d209/20110/runTheMatrixINPUT-results/136.8311_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017/ dont show this error, right?

In addition to those two above, also wf 136.8642 was tested in #35914 without crashes.

qliphy · 2021-11-01T09:12:37Z

@tvami @malbouis Thanks. Yes, it is curious. Let's wait for next IB results. In the meantime, I will do some local tests.

makortel · 2021-11-01T11:08:02Z

If it problem is a threading issue (e.g. a race condition), it would not show up in PR tests (unless multithreaded tests were enabled explicitly), but would show up IB tests.

francescobrivio · 2021-11-01T11:23:30Z

Just to make it more evident there is another crash (spotted by @mmusich) in the same IB and still related to CTPPS:

----- Begin Fatal Exception 01-Nov-2021 03:11:13 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 317435 lumi: 36 event: 47465289 stream: 3
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module DQMMessageLogger/'DQMMessageLogger'
   [3] Prefetching for module LogErrorHarvester/'logErrorHarvester'
   [4] Calling method for module CTPPSProtonProducer/'ctppsProtons'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
----- End Fatal Exception -------------------------------------------------

from: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_2_X_2021-10-31-2300/pyRelValMatrixLogs/run/136.8642_RunJetHT2018BHEfail+RunJetHT2018BHEfail+HLTDR2_2018+RECODR2_2018reHLT_skimJetHT_Prompt_HEfail+HARVEST2018_HEfail/step3_RunJetHT2018BHEfail+RunJetHT2018BHEfail+HLTDR2_2018+RECODR2_2018reHLT_skimJetHT_Prompt_HEfail+HARVEST2018_HEfail.log#/833

I have reported it in #35936

francescobrivio · 2021-11-01T11:31:09Z

If it problem is a threading issue (e.g. a race condition), it would not show up in PR tests (unless multithreaded tests were enabled explicitly), but would show up IB tests.

Thanks @makortel for the suggestion, I believe @malbouis is testing this.
Then maybe it would be worth to enable multithreading by default also in the PR test in order to avoid such issues in the future...

malbouis · 2021-11-01T11:32:50Z

As a comment, I'm not sure how useful it is, but I have ran workflows 136.8311 and 136.796 with runTheMatrix on the latest IB and observed no crashes. Will try it with --nThreads > 1 in one of these to see if I'm able to reproduce the error.

makortel · 2021-11-01T11:41:32Z

Then maybe it would be worth to enable multithreading by default also in the PR test in order to avoid such issues in the future...

Multi-threading in PR tests would effectively destroy bitwise reproducibility of simulated workflows, e.g. because different GEN events could be simulated in different EDM streams leading to different random number sequences between runs. So far race conditions have been rare enough to continue with the current setup.

jan-kaspar · 2021-11-01T11:48:18Z

I have run runTheMatrix -l 136.796,136.816,136.8642,23234.103 --ibeos on LXPLUS and it yielded 4 4 3 3 tests passed, 0 0 1 0 failed. Thus I cannot reproduce many of the failures anymore. I continue debugging 136.8642, which is the failure I can reproduce (and also reported by Francesco above: #35928 (comment)).

malbouis · 2021-11-01T12:13:25Z

As a comment, I'm not sure how useful it is, but I have ran workflows 136.8311 and 136.796 with runTheMatrix on the latest IB and observed no crashes. Will try it with --nThreads > 1 in one of these to see if I'm able to reproduce the error.

I guess testing the multithreaded option is not as straightforward as I thought.
I tried rerunning step 3 of wf 136.8311 while enabling --nThreads 2 but I get the following error:

%MSG-i ThreadStreamSetup: (NoModuleName) 01-Nov-2021 13:07:29 CET pre-events
setting # threads 2
setting # streams 2
%MSG
----- Begin Fatal Exception 01-Nov-2021 13:07:47 CET-----------------------
An exception of category 'ModulesSynchingOnLumis' occurred while
[0] Calling beginJob
Exception Message:
The framework is configured to use at least two streams, but the following modules
require synchronizing on LuminosityBlock boundaries:
QualityTester qTesterJet
QualityTester qTesterMET
DataCertificationJetMET dataCertificationJetMET
QualityTester muonSourcesQualityTests
MuonTrackResidualsTest muTrackResidualsTest
EfficiencyPlotter effPlotterLooseMiniAOD
EfficiencyPlotter effPlotterMediumMiniAOD
EfficiencyPlotter effPlotterTightMiniAOD
MuonRecoTest muRecoTest
QualityTester muonClientsQualityTests
MuonTestSummary muonTestSummary
TriggerMatchEfficiencyPlotter triggerMatchEffPlotterTightMiniAOD
PFJetDQMPostProcessor pfJetDQMPostProcessor
OffsetDQMPostProcessor offsetDQMPostProcessor

The situation can be fixed by either

modifying the modules to support concurrent LuminosityBlocks (preferred), or
setting 'process.options.numberOfConcurrentLuminosityBlocks = 1' in the configuration file
----- End Fatal Exception -------------------------------------------------

qliphy · 2021-11-01T12:26:46Z

I guess testing the multithreaded option is not as straightforward as I thought. I tried rerunning step 3 of wf 136.8311 while enabling --nThreads 2 but I get the following error:

Instead of just running step3 of wf 136.8311, I tried to run locally with 4 threads:
runTheMatrix.py -l 136.8311 --job-reports -t 4 --ibeos
under both CMSSW_12_2_X_2021-11-01-1100 and CMSSW_12_2_X_2021-10-31-2300
and both work well without any issue.

Dr15Jones · 2021-11-01T12:37:06Z

A very quick look at the PPS code I found

cmssw/CondFormats/PPSObjects/interface/PPSAssociationCuts.h

Lines 55 to 56 in 6681a5f

    
           mutable std::vector<std::shared_ptr<TF1> > f_means_ COND_TRANSIENT; 
        
           mutable std::vector<std::shared_ptr<TF1> > f_thresholds_ COND_TRANSIENT;

where a mutable is likely to be the cause of thread-safety problems.

Dr15Jones · 2021-11-01T12:40:08Z

Indeed, it looks like those vectors are filled from a const function

cmssw/CondFormats/PPSObjects/src/PPSAssociationCuts.cc

Lines 60 to 63 in 6681a5f

    
           // build functions if not already done 
        
           // (this may happen if data (string representation) are loaded from DB and the constructor is not executed) 
        
           if (f_means_.size() < s_means_.size()) 
        
             buildFunctions();

This is not thread-safe.

Dr15Jones · 2021-11-01T12:43:31Z

NOTE: the DB does have a mechanism to modify the object right after read from DB but before it is put out into the EventSetup. That allows one to avoid using mutables and provides a thread-safe way to update objects coming out of storage.

jan-kaspar · 2021-11-01T12:44:42Z

Thanks @Dr15Jones ! Could you please give me a pointer to this mechanism? I can open a fix RP shortly then.

Dr15Jones · 2021-11-01T12:46:11Z

@jan-kaspar I can try to find it but this is really the domain for @cms-sw/db-l2

jan-kaspar · 2021-11-01T12:47:35Z

Thanks @Dr15Jones ! Anyone's help appreciated. I will try googling in the meantime.

Dr15Jones · 2021-11-01T12:55:36Z

So it looks like you must call REGISTER_PLUGIN_INIT when registering your new EventSetup data product object and pass in an 'Initializer' class which will update the object before handing it to the EventSetup for access. I found examples at

cmssw/CondCore/SiPixelPlugins/plugins/plugin.cc

Line 58 in 6d2f660

    
           REGISTER_PLUGIN_INIT(SiPixelGainCalibrationRcd, SiPixelGainCalibration, InitGains<SiPixelGainCalibration>);

with

cmssw/CondCore/SiPixelPlugins/plugins/plugin.cc

Lines 51 to 54 in 6d2f660

    
           template <typename G> 
        
           struct InitGains { 
        
             void operator()(G& g) { g.initialize(); } 
        
           };

jan-kaspar · 2021-11-01T12:56:35Z

Thanks again, I will give it a try!

Dr15Jones · 2021-11-01T12:57:30Z

In general, one should never use mutable data for data products for the Run, LuminosityBlock, Event or the EventSetup.

jan-kaspar · 2021-11-01T13:01:12Z

In general, one should never use mutable data for data products for the Run, LuminosityBlock, Event or the EventSetup.

Got it! @grzanka @fabferro Possibly good to point this out in the next PPS SW meeting (so as we prevent making this mistake again).

jan-kaspar · 2021-11-01T14:04:20Z

Hopefully, here's a fix: #35941

qliphy · 2021-11-04T00:49:47Z

After merging #35941 new IB tests look good.

jfernan2 · 2021-11-04T08:35:43Z

+1
For the records

cmsbuild added the pending-assignment label Nov 1, 2021

cmsbuild added alca-pending db-pending dqm-pending pending-signatures and removed pending-assignment labels Nov 1, 2021

jan-kaspar mentioned this issue Nov 1, 2021

PPS: association cut fix #35941

Merged

qliphy closed this as completed Nov 4, 2021

cmsbuild added dqm-approved and removed dqm-pending labels Nov 4, 2021

francescobrivio mentioned this issue Nov 4, 2021

Range check issue in CTPPS breaking IB #35936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTPPS related issue in several IB workflows #35928

CTPPS related issue in several IB workflows #35928

qliphy commented Nov 1, 2021

cmsbuild commented Nov 1, 2021

qliphy commented Nov 1, 2021

qliphy commented Nov 1, 2021

cmsbuild commented Nov 1, 2021

tvami commented Nov 1, 2021

tvami commented Nov 1, 2021

malbouis commented Nov 1, 2021

qliphy commented Nov 1, 2021

makortel commented Nov 1, 2021

francescobrivio commented Nov 1, 2021

francescobrivio commented Nov 1, 2021

malbouis commented Nov 1, 2021 •

edited

makortel commented Nov 1, 2021 •

edited

jan-kaspar commented Nov 1, 2021

malbouis commented Nov 1, 2021

qliphy commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

jan-kaspar commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

jan-kaspar commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021 •

edited

jan-kaspar commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

jan-kaspar commented Nov 1, 2021

jan-kaspar commented Nov 1, 2021

qliphy commented Nov 4, 2021 •

edited

jfernan2 commented Nov 4, 2021

CTPPS related issue in several IB workflows #35928

CTPPS related issue in several IB workflows #35928

Comments

qliphy commented Nov 1, 2021

cmsbuild commented Nov 1, 2021

qliphy commented Nov 1, 2021

qliphy commented Nov 1, 2021

cmsbuild commented Nov 1, 2021

tvami commented Nov 1, 2021

tvami commented Nov 1, 2021

malbouis commented Nov 1, 2021

qliphy commented Nov 1, 2021

makortel commented Nov 1, 2021

francescobrivio commented Nov 1, 2021

francescobrivio commented Nov 1, 2021

malbouis commented Nov 1, 2021 • edited

makortel commented Nov 1, 2021 • edited

jan-kaspar commented Nov 1, 2021

malbouis commented Nov 1, 2021

qliphy commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

jan-kaspar commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

jan-kaspar commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021 • edited

jan-kaspar commented Nov 1, 2021

Dr15Jones commented Nov 1, 2021

jan-kaspar commented Nov 1, 2021

jan-kaspar commented Nov 1, 2021

qliphy commented Nov 4, 2021 • edited

jfernan2 commented Nov 4, 2021

malbouis commented Nov 1, 2021 •

edited

makortel commented Nov 1, 2021 •

edited

Dr15Jones commented Nov 1, 2021 •

edited

qliphy commented Nov 4, 2021 •

edited