Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQMEDHarvesters seg faulting in nightly IB #22281

Closed
Dr15Jones opened this issue Feb 21, 2018 · 70 comments
Closed

DQMEDHarvesters seg faulting in nightly IB #22281

Dr15Jones opened this issue Feb 21, 2018 · 70 comments

Comments

@Dr15Jones
Copy link
Contributor

We have two RelVals failing with segmentation faults in DQMEDHarvesters
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1004.0_RunHI2011+RunHI2011+TIER0EXPHI+ALCAEXPHI+ALCAHARVD1HI+ALCAHARVD2HI+ALCAHARVD3HI+ALCAHARVD5HI/step6_RunHI2011+RunHI2011+TIER0EXPHI+ALCAEXPHI+ALCAHARVD1HI+ALCAHARVD2HI+ALCAHARVD3HI+ALCAHARVD5HI.log

and

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/step8_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5.log

with both having the traceback

#4  <signal handler called>
#5  0x00007f7249c59d30 in MonitorElement::getTH2S() const () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libDQMServicesCore.so
#6  0x00007f722fb39267 in SiStripGainsPCLHarvester::gainQualityMonitor(DQMStore::IBooker&, MonitorElement const*) const () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libCalibTrackerSiStripChannelGain.so
#7  0x00007f722fb3d289 in SiStripGainsPCLHarvester::dqmEndJob(DQMStore::IBooker&, DQMStore::IGetter&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libCalibTrackerSiStripChannelGain.so
#8  0x00007f7249c5d7a6 in DQMEDHarvester::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libDQMServicesCore.so
#9  0x00007f724ec92e75 in edm::Worker::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#10 0x00007f724ec53670 in edm::WorkerManager::endJob(edm::ExceptionCollector&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#11 0x00007f724ebfb6a8 in edm::Schedule::endJob(edm::ExceptionCollector&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#12 0x00007f724ec5f1c9 in edm::EventProcessor::endJob() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc630/cms/cmssw/CMSSW_10_1_X_2018-02-20-2300/lib/slc6_amd64_gcc630/libFWCoreFramework.so
@cmsbuild
Copy link
Contributor

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@kmaeshima,@vanbesien,@jfernan2,@vazzolini,@dmitrijus you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor Author

It is likely this problem comes from #22218

@Dr15Jones
Copy link
Contributor Author

@fwyzard FYI

@Dr15Jones
Copy link
Contributor Author

The logs of the job do have error/warning messages from the module

%MSG-e SiStripGainsPCLHarvester:   SiStripGainsPCLHarvester:alcaSiStripGainsHarvester@endJob  21-Feb-2018 08:39:02 CET PostGlobalEndRun
Harvesting: could not retrieve AlCaReco/SiStripGains/Charge_Vs_Index_StdBunch, statistics will not be summed!
%MSG
%MSG-e SiStripGainsPCLHarvester:   SiStripGainsPCLHarvester:alcaSiStripGainsHarvester@endJob  21-Feb-2018 08:39:02 CET PostGlobalEndRun
Harvesting: could not execute algoComputeMPVandGain method because StdBunch statistics cannot be retrieved.
Please check if input contains StdBunch data.
%MSG
%MSG-e SiStripGainsPCLHarvester:   SiStripGainsPCLHarvester:alcaSiStripGainsHarvester@endJob  21-Feb-2018 08:39:02 CET PostGlobalEndRun
produceTagFilter -> Return false: could not retrieve the StdBunch statistics.
Please check if input contains StdBunch data.
%MSG
%MSG-w SiStripGainsPCLHarvester:   SiStripGainsPCLHarvester:alcaSiStripGainsHarvester@endJob  21-Feb-2018 08:39:02 CET PostGlobalEndRun
getNewObject -> will not produce a paylaod because produceTagFilter returned false 
%MSG

@fwyzard
Copy link
Contributor

fwyzard commented Feb 21, 2018

My guess is that SiStripGainFromCalibTree needs to be migrated from legacy to DQMEDAnalyzer.

@dmitrijus
Copy link
Contributor

The bigger question is, why does alca harvesting run in multi thread?
This is for sure not intended.

@Dr15Jones
Copy link
Contributor Author

Given there are no events processed in the job and the failure happens at endJob, the threads are just being used for concurrent module running. Such concurrency shouldn't matter since if you do not have a data dependency between modules which do depend on one another, the framework will run them in an arbitrary order even in the single threaded case.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 21, 2018

given the comments at the beginning of the .h file

//  Original Author: L. Quertermont (calibration algorithm)`
//  Contributors:    M. Verzetti    (data access)`
//                   A. Di Mattia   (PCL multi stream processing and monitoring)`
//                   M. Delcourt    (monitoring)`
//                   M. Musich      (migration to thread-safe DQMStore access)`

I would say that it is definitely intended

@cerminar
Copy link
Contributor

@dmitrijus Indeed this is not intentional. The step of these workflows are defined in:
https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/relval_steps.py#L1789 and following lines. No mention there of the 4 threads

@cerminar
Copy link
Contributor

Here the culprit (wild guess, I didn't try)
https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/WorkFlowRunner.py#L156

ALCAHARVEST needs to be added to the exception I guess

@fwyzard
Copy link
Contributor

fwyzard commented Feb 21, 2018

However, the problem is likely not in the harvesting step, but in the module that produces the DQM plots in the first place - which is not using the thread-safe interface to the DQMStore.

@dmitrijus
Copy link
Contributor

@Dr15Jones The issue is cause by regular legacy framework EDAnalyzer modules, in a sequence.

We asked AlCa to migrate them to DQMEDHarvesters, that would be most ideal.
Last time the answer was "too much work and not necessary, as modules never run in multithread."
Therefore, this is why I made the comment, why do these modules run multithread?

However, that opens up another possible issue: even when running single thread, these modules must remain properly scheduled. I imagine this is the case for EDAnalyzer modules, it may be the case for DQMEDHarvesters too....

@schneiml
Copy link
Contributor

Judging by [1], there is sth. horribly wrong here. @cerminar 's guess sounds very plausible, though I'd like to know the entire runTheMatrix commandline used here, there a lots of interesting options.

At the moment, I simply can't reproduce the issue.

[1] https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/cmdLog

@schneiml
Copy link
Contributor

The current theory is that the edm::one migration uncovered a relval config issue, that accidentally worked before. What fails it interesting for us, but not really relevant, since nobody ever expected anthing like that to work.

@Dr15Jones
Copy link
Contributor Author

If you go to the page,
https://cms-sw.github.io/relvalLogDetail.html#slc6_amd64_gcc630;CMSSW_10_1_X_2018-02-20-2300
you can click on the left had side cmd to see the actual commands. In this case

cmsDriver.py step5 --data --conditions auto:run1_data -s ALCAHARVEST:SiStripQuality --scenario HeavyIons --filein file:PromptCalibProdSiStrip.root --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --prefix '/build/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 7200 ' -n 100 --fileout file:step5.root --suffix "-j JobReport5.xml " --nThreads 4 > step5_RunHI2011+RunHI2011+TIER0EXPHI+ALCAEXPHI+ALCAHARVD1HI+ALCAHARVD2HI+ALCAHARVD3HI+ALCAHARVD5HI.log 2>&1

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 21, 2018 via email

@schneiml
Copy link
Contributor

@Dr15Jones though this is already after the bug happend. --nThreads 4 should never make it into the command line.

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 21, 2018 via email

@schneiml
Copy link
Contributor

@davidlange6 We gon't guarantee anything for unmigrated, DQM legacy modules running multi threaded ("unmigrated" relative to the threaded migration from many years ago).

ALCA uses many legacy modules.

@dmitrijus
Copy link
Contributor

For a test, #22285

@schneiml
Copy link
Contributor

@davidlange6 I see the commandlines that came out of runTheMatrix, which are already broken -- we suspect the bug within runTheMatrix.

@fwyzard regarding the comments: If this workflow/modules are supposed to work multi-threaded or not is an interesting detail for us. However, as I confirmed with @cerminar, this should not run multi-threaded, and the reason for that are some legacy plugins in ALCA.

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 21, 2018 via email

@schneiml
Copy link
Contributor

Off topic: Running WF like 1001 locally (lxplus/cmsdev) fails (w/o error message!), since DAS returns no files. Removing site=T2_CH_CERN fixes that.

Is there a fool-proof way of running these locally, without patching Configuration/PyReleaseValidation/python/MatrixUtil.py?

I can reproduce the issue now. #22285 seems not to fix it, even though step8 now runs single threaded. Running 1001 without -t 4 option seems to work. This is getting more interesting.

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 21, 2018 via email

@schneiml
Copy link
Contributor

@Dr15Jones sounds useful, I'll give it a try.

@schneiml
Copy link
Contributor

Ah, and since it might not be obvious to the outsiders -- most of the issues outlined above are completely unrelated to #22218, it just happend to make things fail that accidentally worked in the past. And these things were probably only used in the relvals, not Tier0 production.

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 22, 2018 via email

@schneiml
Copy link
Contributor

schneiml commented Feb 22, 2018

@davidlange6 I am pretty sure it returns an empty result on cmsdev machines.

$ runTheMatrix.py -l 1001        
...
0 0 0 0 0 0 0 0 tests passed, 1 0 0 0 0 0 0 0 failed
$ cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/
$ cat cmdLog 

# in: /build/schneiml/CMSSW_10_1_X_2018-02-21-1100/src/blub going to execute cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5
 echo '{
"165121":[[1,268435455]]
}' > step1_lumiRanges.log  2>&1
 
# in: /build/schneiml/CMSSW_10_1_X_2018-02-21-1100/src/blub going to execute cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5
 (dasgoclient --limit 0 --query 'file dataset=/MinimumBias/Run2011A-v1/RAW run=165121 site=T2_CH_CERN') | sort -u > step1_dasquery.log  2>&1

$ dasgoclient --limit 0 --query 'file dataset=/MinimumBias/Run2011A-v1/RAW run=165121 site=T2_CH_CERN'
$ # no output

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 22, 2018 via email

@schneiml
Copy link
Contributor

schneiml commented Feb 22, 2018

Then, why does runTheMatrix try to load it from there? This is the issue I'd like to see fixed...

(or maybe the data needs to be fixed, but there is clearly an issue...)

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 22, 2018 via email

@schneiml
Copy link
Contributor

Ah, great, so I am not the only one confused by this issue and people work on it. Then --ibeos is the workaround for now. Thanks for the pointer!

@schneiml
Copy link
Contributor

@Dr15Jones the Tracer service seems like a really useful tool, saves a ton of manual debug output. However, it actually appears the the MEtoEDMConverters have the same order (-t 4 or not) and only the MEtoEDMConvertSiStripGainsAAG seems to do sth. either way. Which is confusing, since AAG is the content missing, causing the crash in MT.

Also, I don't see SiStripGainsPCLWorker anywhere running (neither ST nor MT), which according to @mmusich should produce the missing plot. Maybe I am just stupid now and looking at the wrong thing, but this is confusing. We'll have a closer look tomorrow.

@mmusich
Copy link
Contributor

mmusich commented Feb 22, 2018

Also, I don't see SiStripGainsPCLWorker anywhere running (neither ST nor MT), which according to @mmusich should produce the missing plot. Maybe I am just stupid now and looking at the wrong thing, but this is confusing. We'll have a closer look tomorrow.

well, @schneiml, I don't know how you configured your local setup, but SiStripGainsPCLWorker is most certainly designed to be run in step3 (which by the way appears to be the case in the relval IB, see eg. https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-22-1100/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/step3_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5.log

and look for:

Begin processing the 8th record. Run 165121, Event 23697286, LumiSection 62 on stream 0 at 22-Feb-2018 18:34:53.285 CET
%MSG-w MemoryCheck:  SiStripGainsPCLWorker:ALCARECOSiStripCalibAAG  22-Feb-2018 18:34:53 CET Run: 165121 Event: 23673454
MemoryCheck: module SiStripGainsPCLWorker:ALCARECOSiStripCalibAAG VSIZE 1395.19 0 RSS 999.637 0.0195312

Anyway you don't have to take my word for it :)
Just go ahead and dump the python configuration of step3, you will see that:

process.ALCARECOSiStripCalibAAG = cms.EDAnalyzer("SiStripGainsPCLWorker", ....

enters in

process.seqALCARECOPromptCalibProdSiStripGainsAAG

that in turn enters in

process.pathALCARECOPromptCalibProdSiStripGainsAAG

that finally enters in the schedule.

@mmusich
Copy link
Contributor

mmusich commented Feb 22, 2018

By the way, looking into the tracer output of step3 of runTheMatrix.py -l 1001.0 -t 4 --ibeos I see it live and well:

$ more tracer.out | grep ALCARECOSiStripCalibAAG
++++ starting: constructing module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: constructing module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin job for module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin job for module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin stream for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin stream for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin stream for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin stream for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin stream for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin stream for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: begin stream for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: begin stream for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ starting: global begin run for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ finished: global begin run for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ starting: global begin lumi for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ finished: global begin lumi for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ starting: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++++ finished: processing event for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ starting: global end lumi for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ finished: global end lumi for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ starting: global end run for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++++ finished: global end run for module: label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end stream for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end stream for module: stream = 0 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end stream for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end stream for module: stream = 1 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end stream for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end stream for module: stream = 2 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end stream for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end stream for module: stream = 3 label = 'ALCARECOSiStripCalibAAG' id = 34
++++ starting: end job for module with label 'ALCARECOSiStripCalibAAG' id = 34
++++ finished: end job for module with label 'ALCARECOSiStripCalibAAG' id = 34

@dmitrijus
Copy link
Contributor

After a night of debugging, I can certainly say what the bug is:

getAllContents() is currently broken. It only return MonitorElements from a single module (since they are indexed by module id). Or rather, the module with the lower moduleId.
https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L1887

The "workaround" is to copy ME at endRun, as it was done before, 52c0ef9

However, there are several other issues we will have to address....

@schneiml
Copy link
Contributor

@mmusich I was looking for sth. labeled SiStripGainsPCLWorker, since that is also what I saw as a name in some config file. Nervermind, then.SiStripGainsPCLWorker.

@dmitrijus it seems that fixing getAllContents logic might make more sense. It looks like that could be done by properly removing the enableMultiThread stuff. I'll have to figure out what the intended semantics are.

@Dr15Jones
Copy link
Contributor Author

getAllContents() is currently broken. It only return MonitorElements from a single module (since they are indexed by module id). Or rather, the module with the lower moduleId.

In this particular case, what values for the arguments runNumber and lumi are passed to getAllContents() ?

@schneiml
Copy link
Contributor

@Dr15Jones note also that #22218 did confuse moduleId with streamId in some places. This did not cause harm so far, but could explain some weird behaviour.

@fabiocos
Copy link
Contributor

Is there any update?

@fwyzard I understand that a fix should be applied to the MonitorElement usage (according to your latest comment in #22218)

@fwyzard
Copy link
Contributor

fwyzard commented Feb 26, 2018

I just finished checking, and I don't see any inconsistency introduced by #22218; if @schneiml can point them to me I can try to have a fix.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 27, 2018

@schneiml pointed out that I missed the changes to the MonitorElement constructor itself.
That should be fixed by #22364 .

@dmitrijus
Copy link
Contributor

+1

This is fixed now.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 6, 2018

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants