New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DQMEDHarvesters seg faulting in nightly IB #22281
Comments
A new Issue was created by @Dr15Jones Chris Jones. @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign dqm |
New categories assigned: dqm @kmaeshima,@vanbesien,@jfernan2,@vazzolini,@dmitrijus you have been requested to review this Pull request/Issue and eventually sign? Thanks |
It is likely this problem comes from #22218 |
@fwyzard FYI |
The logs of the job do have error/warning messages from the module
|
My guess is that |
The bigger question is, why does alca harvesting run in multi thread? |
Given there are no events processed in the job and the failure happens at endJob, the threads are just being used for concurrent module running. Such concurrency shouldn't matter since if you do not have a data dependency between modules which do depend on one another, the framework will run them in an arbitrary order even in the single threaded case. |
given the comments at the beginning of the .h file
I would say that it is definitely intended |
@dmitrijus Indeed this is not intentional. The step of these workflows are defined in: |
Here the culprit (wild guess, I didn't try) ALCAHARVEST needs to be added to the exception I guess |
However, the problem is likely not in the harvesting step, but in the module that produces the DQM plots in the first place - which is not using the thread-safe interface to the DQMStore. |
@Dr15Jones The issue is cause by regular legacy framework EDAnalyzer modules, in a sequence. We asked AlCa to migrate them to DQMEDHarvesters, that would be most ideal. However, that opens up another possible issue: even when running single thread, these modules must remain properly scheduled. I imagine this is the case for EDAnalyzer modules, it may be the case for DQMEDHarvesters too.... |
Judging by [1], there is sth. horribly wrong here. @cerminar 's guess sounds very plausible, though I'd like to know the entire runTheMatrix commandline used here, there a lots of interesting options. At the moment, I simply can't reproduce the issue. |
The current theory is that the |
If you go to the page,
|
i fail to see why this should not work let alone be "expected" not to work
… On Feb 21, 2018, at 5:05 PM, Marcel Schneider ***@***.***> wrote:
The current theory is that the edm::one migration uncovered a relval config issue, that accidentally worked before. What fails it interesting for us, but not really relevant, since nobody ever expected anthing like that to work.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@Dr15Jones though this is already after the bug happend. |
you can find it from here
https://cms-sw.github.io/relvalLogDetail.html#slc7_amd64_gcc630;CMSSW_10_1_X_2018-02-21-1100
… On Feb 21, 2018, at 5:02 PM, Marcel Schneider ***@***.***> wrote:
Judging by [1], there is sth. horribly wrong here. @cerminar 's guess sounds very plausible, though I'd like to know the entire runTheMatrix commandline used here, there a lots of interesting options.
At the moment, I simply can't reproduce the issue.
[1] https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/cmdLog
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@davidlange6 We gon't guarantee anything for unmigrated, DQM legacy modules running multi threaded ("unmigrated" relative to the threaded migration from many years ago). ALCA uses many legacy modules. |
For a test, #22285 |
@davidlange6 I see the commandlines that came out of runTheMatrix, which are already broken -- we suspect the bug within runTheMatrix. @fwyzard regarding the comments: If this workflow/modules are supposed to work multi-threaded or not is an interesting detail for us. However, as I confirmed with @cerminar, this should not run multi-threaded, and the reason for that are some legacy plugins in ALCA. |
Right, as of last night. Presumably you are now in progress of getting them migratied..
… On Feb 21, 2018, at 5:11 PM, Marcel Schneider ***@***.***> wrote:
@davidlange6 We gon't guarantee anything for unmigrated, DQM legacy modules running multi threaded ("unmigrated" relative to the threaded migration from many years ago).
ALCA uses many legacy modules.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Off topic: Running WF like 1001 locally (lxplus/cmsdev) fails (w/o error message!), since DAS returns no files. Removing Is there a fool-proof way of running these locally, without patching I can reproduce the issue now. #22285 seems not to fix it, even though step8 now runs single threaded. Running 1001 without |
Right- I was answering the question of what was actually run...
legacy plugins are not a reason why something does not run threaded...[maybe not efficiently]
… On Feb 21, 2018, at 5:21 PM, Marcel Schneider ***@***.***> wrote:
@davidlange6 I see the commandlines that came out of runTheMatrix, which are already broken -- we suspect the bug within runTheMatrix.
@fwyzard regarding the comments: If this workflow/modules are supposed to work multi-threaded or not is an interesting detail for us. However, as I confirmed with @cerminar, this should not run multi-threaded, and the reason for that are some legacy plugins in ALCA.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@Dr15Jones sounds useful, I'll give it a try. |
Ah, and since it might not be obvious to the outsiders -- most of the issues outlined above are completely unrelated to #22218, it just happend to make things fail that accidentally worked in the past. And these things were probably only used in the relvals, not Tier0 production. |
the das query is working just fine...
…
Protip: use the --ibeos option with runTheMatrix to run WF 1001 locally, the DAS query it uses is broken. Maybe we should open an issue for that.
|
@davidlange6 I am pretty sure it returns an empty result on cmsdev machines.
|
which is the correct result given the query... what is "broken" is elsewhere - in this case, data is not available on T2_CH_CERN.
… On Feb 22, 2018, at 6:50 PM, Marcel Schneider ***@***.***> wrote:
@davidlange6 I am pretty sure it returns an empty result on cmsdev machines.
$ runTheMatrix.py -l 1001
...
0 0 0 0 0 0 0 0 tests passed, 1 0 0 0 0 0 0 0 failed
$ cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/
$ cat cmdLog
# in: /build/schneiml/CMSSW_10_1_X_2018-02-21-1100/src/blub going to execute cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5
echo '{
"165121":[[1,268435455]]
}' > step1_lumiRanges.log 2>&1
# in: /build/schneiml/CMSSW_10_1_X_2018-02-21-1100/src/blub going to execute cd 1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5
(dasgoclient --limit 0 --query 'file dataset=/MinimumBias/Run2011A-v1/RAW run=165121 site=T2_CH_CERN') | sort -u > step1_dasquery.log 2>&1
$ dasgoclient --limit 0 --query 'file dataset=/MinimumBias/Run2011A-v1/RAW run=165121 site=T2_CH_CERN'
$ # no output
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Then, why does (or maybe the data needs to be fixed, but there is clearly an issue...) |
there is an on-going GitHub issue on the topic #22278 (not that it will necessarily fix this issue)
… On Feb 22, 2018, at 6:54 PM, Marcel Schneider ***@***.***> wrote:
Then, why does runTheMatrix try to load it from there? This is the issue I'd like to see fixed...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Ah, great, so I am not the only one confused by this issue and people work on it. Then |
@Dr15Jones the Tracer service seems like a really useful tool, saves a ton of manual debug output. However, it actually appears the the Also, I don't see |
well, @schneiml, I don't know how you configured your local setup, but and look for:
Anyway you don't have to take my word for it :) process.ALCARECOSiStripCalibAAG = cms.EDAnalyzer("SiStripGainsPCLWorker", .... enters in process.seqALCARECOPromptCalibProdSiStripGainsAAG that in turn enters in process.pathALCARECOPromptCalibProdSiStripGainsAAG that finally enters in the |
By the way, looking into the tracer output of step3 of
|
After a night of debugging, I can certainly say what the bug is: getAllContents() is currently broken. It only return MonitorElements from a single module (since they are indexed by module id). Or rather, the module with the lower moduleId. The "workaround" is to copy ME at endRun, as it was done before, 52c0ef9 However, there are several other issues we will have to address.... |
@mmusich I was looking for sth. labeled @dmitrijus it seems that fixing |
In this particular case, what values for the arguments |
@Dr15Jones note also that #22218 did confuse |
+1 This is fixed now. |
This issue is fully signed and ready to be closed. |
We have two RelVals failing with segmentation faults in DQMEDHarvesters
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1004.0_RunHI2011+RunHI2011+TIER0EXPHI+ALCAEXPHI+ALCAHARVD1HI+ALCAHARVD2HI+ALCAHARVD3HI+ALCAHARVD5HI/step6_RunHI2011+RunHI2011+TIER0EXPHI+ALCAEXPHI+ALCAHARVD1HI+ALCAHARVD2HI+ALCAHARVD3HI+ALCAHARVD5HI.log
and
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc630/CMSSW_10_1_X_2018-02-20-2300/pyRelValMatrixLogs/run/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/step8_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5.log
with both having the traceback
The text was updated successfully, but these errors were encountered: