DQMEDHarvesters seg faulting in nightly IB #22281

Dr15Jones · 2018-02-21T14:12:30Z

We have two RelVals failing with segmentation faults in DQMEDHarvesters
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1004.0_RunHI2011+RunHI2011+TIER0EXPHI+ALCAEXPHI+ALCAHARVD1HI+ALCAHARVD2HI+ALCAHARVD3HI+ALCAHARVD5HI/step6_RunHI2011+RunHI2011+TIER0EXPHI+ALCAEXPHI+ALCAHARVD1HI+ALCAHARVD2HI+ALCAHARVD3HI+ALCAHARVD5HI.log

and

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/step8_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5.log

with both having the traceback

#4  <signal handler called>
#5  0x00007f7249c59d30 in MonitorElement::getTH2S() const () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libDQMServicesCore.so
#6  0x00007f722fb39267 in SiStripGainsPCLHarvester::gainQualityMonitor(DQMStore::IBooker&, MonitorElement const*) const () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libCalibTrackerSiStripChannelGain.so
#7  0x00007f722fb3d289 in SiStripGainsPCLHarvester::dqmEndJob(DQMStore::IBooker&, DQMStore::IGetter&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libCalibTrackerSiStripChannelGain.so
#8  0x00007f7249c5d7a6 in DQMEDHarvester::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libDQMServicesCore.so
#9  0x00007f724ec92e75 in edm::Worker::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#10 0x00007f724ec53670 in edm::WorkerManager::endJob(edm::ExceptionCollector&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#11 0x00007f724ebfb6a8 in edm::Schedule::endJob(edm::ExceptionCollector&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#12 0x00007f724ec5f1c9 in edm::EventProcessor::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libFWCoreFramework.so

The text was updated successfully, but these errors were encountered:

cmsbuild · 2018-02-21T14:12:43Z

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2018-02-21T14:13:10Z

assign dqm

cmsbuild · 2018-02-21T14:13:20Z

New categories assigned: dqm

@kmaeshima,@vanbesien,@jfernan2,@vazzolini,@dmitrijus you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones · 2018-02-21T14:13:22Z

It is likely this problem comes from #22218

Dr15Jones · 2018-02-21T14:14:54Z

@fwyzard FYI

Dr15Jones · 2018-02-21T14:19:09Z

The logs of the job do have error/warning messages from the module

%MSG-e SiStripGainsPCLHarvester:   SiStripGainsPCLHarvester:alcaSiStripGainsHarvester@endJob  21-Feb-2018 08:39:02 CET PostGlobalEndRun
Harvesting: could not retrieve AlCaReco/SiStripGains/Charge_Vs_Index_StdBunch, statistics will not be summed!
%MSG
%MSG-e SiStripGainsPCLHarvester:   SiStripGainsPCLHarvester:alcaSiStripGainsHarvester@endJob  21-Feb-2018 08:39:02 CET PostGlobalEndRun
Harvesting: could not execute algoComputeMPVandGain method because StdBunch statistics cannot be retrieved.
Please check if input contains StdBunch data.
%MSG
%MSG-e SiStripGainsPCLHarvester:   SiStripGainsPCLHarvester:alcaSiStripGainsHarvester@endJob  21-Feb-2018 08:39:02 CET PostGlobalEndRun
produceTagFilter -> Return false: could not retrieve the StdBunch statistics.
Please check if input contains StdBunch data.
%MSG
%MSG-w SiStripGainsPCLHarvester:   SiStripGainsPCLHarvester:alcaSiStripGainsHarvester@endJob  21-Feb-2018 08:39:02 CET PostGlobalEndRun
getNewObject -> will not produce a paylaod because produceTagFilter returned false 
%MSG

fwyzard · 2018-02-21T14:34:31Z

My guess is that SiStripGainFromCalibTree needs to be migrated from legacy to DQMEDAnalyzer.

dmitrijus · 2018-02-21T15:16:29Z

The bigger question is, why does alca harvesting run in multi thread?
This is for sure not intended.

Dr15Jones · 2018-02-21T15:30:16Z

Given there are no events processed in the job and the failure happens at endJob, the threads are just being used for concurrent module running. Such concurrency shouldn't matter since if you do not have a data dependency between modules which do depend on one another, the framework will run them in an arbitrary order even in the single threaded case.

fwyzard · 2018-02-21T15:30:27Z

given the comments at the beginning of the .h file

//  Original Author: L. Quertermont (calibration algorithm)`
//  Contributors:    M. Verzetti    (data access)`
//                   A. Di Mattia   (PCL multi stream processing and monitoring)`
//                   M. Delcourt    (monitoring)`
//                   M. Musich      (migration to thread-safe DQMStore access)`

I would say that it is definitely intended

cerminar · 2018-02-21T15:33:37Z

@dmitrijus Indeed this is not intentional. The step of these workflows are defined in:
https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/relval_steps.py#L1789 and following lines. No mention there of the 4 threads

cerminar · 2018-02-21T15:41:12Z

Here the culprit (wild guess, I didn't try)
https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/WorkFlowRunner.py#L156

ALCAHARVEST needs to be added to the exception I guess

fwyzard · 2018-02-21T15:43:53Z

However, the problem is likely not in the harvesting step, but in the module that produces the DQM plots in the first place - which is not using the thread-safe interface to the DQMStore.

dmitrijus · 2018-02-21T15:50:14Z

@Dr15Jones The issue is cause by regular legacy framework EDAnalyzer modules, in a sequence.

We asked AlCa to migrate them to DQMEDHarvesters, that would be most ideal.
Last time the answer was "too much work and not necessary, as modules never run in multithread."
Therefore, this is why I made the comment, why do these modules run multithread?

However, that opens up another possible issue: even when running single thread, these modules must remain properly scheduled. I imagine this is the case for EDAnalyzer modules, it may be the case for DQMEDHarvesters too....

schneiml · 2018-02-21T16:02:52Z

Judging by [1], there is sth. horribly wrong here. @cerminar 's guess sounds very plausible, though I'd like to know the entire runTheMatrix commandline used here, there a lots of interesting options.

At the moment, I simply can't reproduce the issue.

[1] https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/cmdLog

schneiml · 2018-02-21T16:04:59Z

The current theory is that the edm::one migration uncovered a relval config issue, that accidentally worked before. What fails it interesting for us, but not really relevant, since nobody ever expected anthing like that to work.

Dr15Jones · 2018-02-21T16:06:41Z

If you go to the page,
https://cms-sw.github.io/relvalLogDetail.html#slc6_amd64_gcc630;CMSSW_10_1_X_2018-02-20-2300
you can click on the left had side cmd to see the actual commands. In this case

cmsDriver.py step5 --data --conditions auto:run1_data -s ALCAHARVEST:SiStripQuality --scenario HeavyIons --filein file:PromptCalibProdSiStrip.root --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --prefix '/build/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 7200 ' -n 100 --fileout file:step5.root --suffix "-j JobReport5.xml " --nThreads 4 > step5_RunHI2011+RunHI2011+TIER0EXPHI+ALCAEXPHI+ALCAHARVD1HI+ALCAHARVD2HI+ALCAHARVD3HI+ALCAHARVD5HI.log 2>&1

davidlange6 · 2018-02-21T16:09:26Z

i fail to see why this should not work let alone be "expected" not to work

…

On Feb 21, 2018, at 5:05 PM, Marcel Schneider ***@***.***> wrote: The current theory is that the edm::one migration uncovered a relval config issue, that accidentally worked before. What fails it interesting for us, but not really relevant, since nobody ever expected anthing like that to work. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

schneiml · 2018-02-21T16:09:37Z

@Dr15Jones though this is already after the bug happend. --nThreads 4 should never make it into the command line.

davidlange6 · 2018-02-21T16:11:02Z

you can find it from here https://cms-sw.github.io/relvalLogDetail.html#slc7_amd64_gcc630;CMSSW_10_1_X_2018-02-21-1100

…

On Feb 21, 2018, at 5:02 PM, Marcel Schneider ***@***.***> wrote: Judging by [1], there is sth. horribly wrong here. @cerminar 's guess sounds very plausible, though I'd like to know the entire runTheMatrix commandline used here, there a lots of interesting options. At the moment, I simply can't reproduce the issue. [1] https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/cmdLog — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

schneiml · 2018-02-21T16:11:03Z

@davidlange6 We gon't guarantee anything for unmigrated, DQM legacy modules running multi threaded ("unmigrated" relative to the threaded migration from many years ago).

ALCA uses many legacy modules.

dmitrijus · 2018-02-21T16:12:08Z

For a test, #22285

schneiml · 2018-02-21T16:21:07Z

@davidlange6 I see the commandlines that came out of runTheMatrix, which are already broken -- we suspect the bug within runTheMatrix.

@fwyzard regarding the comments: If this workflow/modules are supposed to work multi-threaded or not is an interesting detail for us. However, as I confirmed with @cerminar, this should not run multi-threaded, and the reason for that are some legacy plugins in ALCA.

davidlange6 · 2018-02-21T16:21:45Z

Right, as of last night. Presumably you are now in progress of getting them migratied..

…

On Feb 21, 2018, at 5:11 PM, Marcel Schneider ***@***.***> wrote: @davidlange6 We gon't guarantee anything for unmigrated, DQM legacy modules running multi threaded ("unmigrated" relative to the threaded migration from many years ago). ALCA uses many legacy modules. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

schneiml · 2018-02-21T16:36:30Z

Off topic: Running WF like 1001 locally (lxplus/cmsdev) fails (w/o error message!), since DAS returns no files. Removing site=T2_CH_CERN fixes that.

Is there a fool-proof way of running these locally, without patching Configuration/PyReleaseValidation/python/MatrixUtil.py?

I can reproduce the issue now. #22285 seems not to fix it, even though step8 now runs single threaded. Running 1001 without -t 4 option seems to work. This is getting more interesting.

davidlange6 · 2018-02-21T16:43:03Z

Right- I was answering the question of what was actually run... legacy plugins are not a reason why something does not run threaded...[maybe not efficiently]

…

On Feb 21, 2018, at 5:21 PM, Marcel Schneider ***@***.***> wrote: @davidlange6 I see the commandlines that came out of runTheMatrix, which are already broken -- we suspect the bug within runTheMatrix. @fwyzard regarding the comments: If this workflow/modules are supposed to work multi-threaded or not is an interesting detail for us. However, as I confirmed with @cerminar, this should not run multi-threaded, and the reason for that are some legacy plugins in ALCA. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

schneiml · 2018-02-22T17:36:33Z

@Dr15Jones sounds useful, I'll give it a try.

schneiml · 2018-02-22T17:39:54Z

Ah, and since it might not be obvious to the outsiders -- most of the issues outlined above are completely unrelated to #22218, it just happend to make things fail that accidentally worked in the past. And these things were probably only used in the relvals, not Tier0 production.

davidlange6 · 2018-02-22T17:46:28Z

the das query is working just fine...

…

Protip: use the --ibeos option with runTheMatrix to run WF 1001 locally, the DAS query it uses is broken. Maybe we should open an issue for that.

schneiml · 2018-02-22T17:50:54Z

@davidlange6 I am pretty sure it returns an empty result on cmsdev machines.

$ runTheMatrix.py -l 1001        
...
0 0 0 0 0 0 0 0 tests passed, 1 0 0 0 0 0 0 0 failed
$ cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/
$ cat cmdLog 

# in: /build/schneiml/CMSSW_10_1_X_2018-02-21-1100/src/blub going to execute cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5
 echo '{
"165121":[[1,268435455]]
}' > step1_lumiRanges.log  2>&1
 
# in: /build/schneiml/CMSSW_10_1_X_2018-02-21-1100/src/blub going to execute cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5
 (dasgoclient --limit 0 --query 'file dataset=/MinimumBias/Run2011A-v1/RAW run=165121 site=T2_CH_CERN') | sort -u > step1_dasquery.log  2>&1

$ dasgoclient --limit 0 --query 'file dataset=/MinimumBias/Run2011A-v1/RAW run=165121 site=T2_CH_CERN'
$ # no output

davidlange6 · 2018-02-22T17:52:44Z

which is the correct result given the query... what is "broken" is elsewhere - in this case, data is not available on T2_CH_CERN.

…

On Feb 22, 2018, at 6:50 PM, Marcel Schneider ***@***.***> wrote: @davidlange6 I am pretty sure it returns an empty result on cmsdev machines. $ runTheMatrix.py -l 1001 ... 0 0 0 0 0 0 0 0 tests passed, 1 0 0 0 0 0 0 0 failed $ cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/ $ cat cmdLog # in: /build/schneiml/CMSSW_10_1_X_2018-02-21-1100/src/blub going to execute cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5 echo '{ "165121":[[1,268435455]] }' > step1_lumiRanges.log 2>&1 # in: /build/schneiml/CMSSW_10_1_X_2018-02-21-1100/src/blub going to execute cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5 (dasgoclient --limit 0 --query 'file dataset=/MinimumBias/Run2011A-v1/RAW run=165121 site=T2_CH_CERN') | sort -u > step1_dasquery.log 2>&1 $ dasgoclient --limit 0 --query 'file dataset=/MinimumBias/Run2011A-v1/RAW run=165121 site=T2_CH_CERN' $ # no output — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

schneiml · 2018-02-22T17:54:27Z

Then, why does runTheMatrix try to load it from there? This is the issue I'd like to see fixed...

(or maybe the data needs to be fixed, but there is clearly an issue...)

davidlange6 · 2018-02-22T17:57:05Z

there is an on-going GitHub issue on the topic #22278 (not that it will necessarily fix this issue)

…

On Feb 22, 2018, at 6:54 PM, Marcel Schneider ***@***.***> wrote: Then, why does runTheMatrix try to load it from there? This is the issue I'd like to see fixed... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

schneiml · 2018-02-22T18:00:39Z

Ah, great, so I am not the only one confused by this issue and people work on it. Then --ibeos is the workaround for now. Thanks for the pointer!

schneiml · 2018-02-22T19:29:51Z

@Dr15Jones the Tracer service seems like a really useful tool, saves a ton of manual debug output. However, it actually appears the the MEtoEDMConverters have the same order (-t 4 or not) and only the MEtoEDMConvertSiStripGainsAAG seems to do sth. either way. Which is confusing, since AAG is the content missing, causing the crash in MT.

Also, I don't see SiStripGainsPCLWorker anywhere running (neither ST nor MT), which according to @mmusich should produce the missing plot. Maybe I am just stupid now and looking at the wrong thing, but this is confusing. We'll have a closer look tomorrow.

mmusich · 2018-02-22T20:40:52Z

Also, I don't see SiStripGainsPCLWorker anywhere running (neither ST nor MT), which according to @mmusich should produce the missing plot. Maybe I am just stupid now and looking at the wrong thing, but this is confusing. We'll have a closer look tomorrow.

well, @schneiml, I don't know how you configured your local setup, but SiStripGainsPCLWorker is most certainly designed to be run in step3 (which by the way appears to be the case in the relval IB, see eg. https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-22-1100/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/step3_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5.log

and look for:

Begin processing the 8th record. Run 165121, Event 23697286, LumiSection 62 on stream 0 at 22-Feb-2018 18:34:53.285 CET
%MSG-w MemoryCheck:  SiStripGainsPCLWorker:ALCARECOSiStripCalibAAG  22-Feb-2018 18:34:53 CET Run: 165121 Event: 23673454
MemoryCheck: module SiStripGainsPCLWorker:ALCARECOSiStripCalibAAG VSIZE 1395.19 0 RSS 999.637 0.0195312

Anyway you don't have to take my word for it :)
Just go ahead and dump the python configuration of step3, you will see that:

process.ALCARECOSiStripCalibAAG = cms.EDAnalyzer("SiStripGainsPCLWorker", ....

enters in

process.seqALCARECOPromptCalibProdSiStripGainsAAG

that in turn enters in

process.pathALCARECOPromptCalibProdSiStripGainsAAG

that finally enters in the schedule.

mmusich · 2018-02-22T20:42:17Z

By the way, looking into the tracer output of step3 of runTheMatrix.py -l 1001.0 -t 4 --ibeos I see it live and well:

$ more tracer.out | grep ALCARECOSiStripCalibAAG
++++ starting: constructing module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: constructing module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin job for module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin job for module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin stream for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin stream for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin stream for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin stream for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin stream for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin stream for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin stream for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin stream for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ starting: global begin run for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ finished: global begin run for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ starting: global begin lumi for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ finished: global begin lumi for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ starting: global end lumi for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ finished: global end lumi for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ starting: global end run for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ finished: global end run for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end stream for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end stream for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end stream for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end stream for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end stream for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end stream for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end stream for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end stream for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end job for module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end job for module with label 'ALCARECOSiStripCalibAAG' id = 34

dmitrijus · 2018-02-23T03:57:59Z

After a night of debugging, I can certainly say what the bug is:

getAllContents() is currently broken. It only return MonitorElements from a single module (since they are indexed by module id). Or rather, the module with the lower moduleId.
https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L1887

The "workaround" is to copy ME at endRun, as it was done before, 52c0ef9

However, there are several other issues we will have to address....

schneiml · 2018-02-23T08:47:32Z

@mmusich I was looking for sth. labeled SiStripGainsPCLWorker, since that is also what I saw as a name in some config file. Nervermind, then.SiStripGainsPCLWorker.

@dmitrijus it seems that fixing getAllContents logic might make more sense. It looks like that could be done by properly removing the enableMultiThread stuff. I'll have to figure out what the intended semantics are.

Dr15Jones · 2018-02-23T13:38:00Z

getAllContents() is currently broken. It only return MonitorElements from a single module (since they are indexed by module id). Or rather, the module with the lower moduleId.

In this particular case, what values for the arguments runNumber and lumi are passed to getAllContents() ?

schneiml · 2018-02-26T09:13:50Z

@Dr15Jones note also that #22218 did confuse moduleId with streamId in some places. This did not cause harm so far, but could explain some weird behaviour.

fabiocos · 2018-02-26T16:18:24Z

Is there any update?

@fwyzard I understand that a fix should be applied to the MonitorElement usage (according to your latest comment in #22218)

fwyzard · 2018-02-26T16:35:31Z

I just finished checking, and I don't see any inconsistency introduced by #22218; if @schneiml can point them to me I can try to have a fix.

fwyzard · 2018-02-27T06:30:20Z

@schneiml pointed out that I missed the changes to the MonitorElement constructor itself.
That should be fixed by #22364 .

dmitrijus · 2018-03-06T14:39:38Z

+1

This is fixed now.

cmsbuild · 2018-03-06T14:39:56Z

This issue is fully signed and ready to be closed.

cmsbuild added the pending-assignment label Feb 21, 2018

cmsbuild added dqm-pending pending-signatures and removed pending-assignment labels Feb 21, 2018

fabiocos mentioned this issue Feb 23, 2018

Disable multi-thread for ALCAHARVEST #22285

Merged

mmusich mentioned this issue Feb 23, 2018

RFC: Migrate SiStripGainsPCLWorker to concurrent and global DQM #22320

Closed

This was referenced Feb 26, 2018

Remove the stream id from the MonitorElement public interface #22364

Merged

Migrate DQMEDAnalyze to producers #22319

Merged

cmsbuild added dqm-approved fully-signed and removed dqm-pending pending-signatures labels Mar 6, 2018

Dr15Jones closed this as completed Mar 6, 2018

schneiml mentioned this issue Nov 14, 2018

MEtoEDMConverter does not need synchronizing on Run and Lumi boundaries #25105

Merged

DQMEDHarvesters seg faulting in nightly IB #22281

DQMEDHarvesters seg faulting in nightly IB #22281

Comments

Dr15Jones commented Feb 21, 2018

cmsbuild commented Feb 21, 2018

Dr15Jones commented Feb 21, 2018

cmsbuild commented Feb 21, 2018

Dr15Jones commented Feb 21, 2018

Dr15Jones commented Feb 21, 2018

Dr15Jones commented Feb 21, 2018

fwyzard commented Feb 21, 2018

dmitrijus commented Feb 21, 2018

Dr15Jones commented Feb 21, 2018

fwyzard commented Feb 21, 2018 • edited

cerminar commented Feb 21, 2018

cerminar commented Feb 21, 2018

fwyzard commented Feb 21, 2018

dmitrijus commented Feb 21, 2018

schneiml commented Feb 21, 2018

schneiml commented Feb 21, 2018

Dr15Jones commented Feb 21, 2018

davidlange6 commented Feb 21, 2018 via email

schneiml commented Feb 21, 2018

davidlange6 commented Feb 21, 2018 via email

schneiml commented Feb 21, 2018

dmitrijus commented Feb 21, 2018

schneiml commented Feb 21, 2018

davidlange6 commented Feb 21, 2018 via email

schneiml commented Feb 21, 2018

davidlange6 commented Feb 21, 2018 via email

schneiml commented Feb 22, 2018

schneiml commented Feb 22, 2018

davidlange6 commented Feb 22, 2018 via email

schneiml commented Feb 22, 2018 • edited

davidlange6 commented Feb 22, 2018 via email

schneiml commented Feb 22, 2018 • edited

davidlange6 commented Feb 22, 2018 via email

schneiml commented Feb 22, 2018

schneiml commented Feb 22, 2018

mmusich commented Feb 22, 2018

mmusich commented Feb 22, 2018

dmitrijus commented Feb 23, 2018

schneiml commented Feb 23, 2018

Dr15Jones commented Feb 23, 2018

schneiml commented Feb 26, 2018

fabiocos commented Feb 26, 2018

fwyzard commented Feb 26, 2018

fwyzard commented Feb 27, 2018

dmitrijus commented Mar 6, 2018

cmsbuild commented Mar 6, 2018

fwyzard commented Feb 21, 2018 •

edited

schneiml commented Feb 22, 2018 •

edited

schneiml commented Feb 22, 2018 •

edited