DQM: Fix potential deadlock in setBinLabel. #28916

schneiml · 2020-02-11T11:10:41Z

PR description:

Fixes the IB test failure in 136.8861 (and maybe others). The deadlock was triggered by LogMessageMonitor:ClusterizerLogMessageMonCommon in bookHistograms. Reported in #28622.

This one only triggers in the error path, because getFullname tries to
take a lock that we already hold.

However, there is another opportunity for a deadlock via edm::Log*
(which probably takes a lock), and the per ME-locks which could be taken
before or after edm::Log* is called. I reviewed all usages of edm::Log*
in DQMServices/Core and did not see anything problematic.

Note that this is the only usage of edm::Log* in MonitorElement, and it
now avoids taking multiple locks at the same time, so there should not
be any more problems of this type.

This PR now also includes some commits from #28813 related to Sumw2, to fix a completely unrelated problem with 250202.183 which happens after the previous deadlock.

PR validation:

Runs 136.8861 without deadlock.

This one only triggers in the error path, because getFullname tries to take a lock that we already hold. However, there is another opportunity for a deadlock via edm::Log* (which probably takes a lock), and the per ME-locks which could be taken before or after edm::Log* is called. I reviewed all usages of edm::Log* in DQMServices/Core and did not see anything problematic. Note that this is the only usage of edm::Log* in MonitorElement, and it now avoids taking multiple locks at the same time, so there should not be any more problems of this type.

schneiml · 2020-02-11T11:24:14Z

@cmsbuild ping ?

cmsbuild · 2020-02-11T11:24:42Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-02-11T11:30:29Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28916/13709

This PR adds an extra 16KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File DQMServices/Core/src/MonitorElement.cc modified in PR(s): DQM: Switch to edm::stream for DQMEDAnalyzer #28813

cmsbuild · 2020-02-11T11:30:52Z

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

DQMServices/Core

@andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks.
@barvic this is something you requested to watch as well.
@davidlange6, @silviodonato, @fabiocos you are the release manager for this.

cms-bot commands are listed here

schneiml · 2020-02-11T11:52:07Z

please test 136.8861, 250202.183

No changes expected, except that these two workflows now don't time out.

Dr15Jones · 2020-02-11T13:01:55Z

AFAIK the MessageLogger does not take any locks.

Dr15Jones · 2020-02-11T13:08:13Z

To understand where the deadlock happens

Run the job in gdb
When the job seems deadlock do a controlC
Do thread apply all bt

This will give you the stack trace for all threads

If you post the results here we can help.

schneiml · 2020-02-11T13:48:19Z

@Dr15Jones the actual deadlock here is trivial, it is getFullname taking the lock that the access in setBinLabel already holds.

The MessageLogger story is hypothetical, but good to know that there are no locks, so it could not be a problem.

silviodonato · 2020-02-11T14:23:50Z

please test workflow 136.8861, 250202.183

cmsbuild · 2020-02-11T14:24:11Z

The tests are being triggered in jenkins.
Test Parameters:

MATRIX_EXTRAS = 250202.183,136.8861
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/4599/console Started: 2020/02/11 15:30

cmsbuild · 2020-02-11T15:45:37Z

-1

Tested at: e2a9965

CMSSW: CMSSW_11_1_X_2020-02-10-2300
SCRAM_ARCH: slc7_amd64_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4599/summary.html

I found follow errors while testing this PR

Failed tests: RelVals

RelVals:

When I ran the RelVals I found an error in the following workflows:
250202.183 step3

runTheMatrix-results/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/step3_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25.log

cmsbuild · 2020-02-11T15:45:39Z

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

schneiml · 2020-02-11T16:16:28Z

250202.183 now fails with a variation of the Sumw2 problem we know from #28813 . I am quite surprised to see it appear here, but it is possible if multiple modules book the same ME (this story appeared a lot in #28622 ). I will provide a fix, it will be needed in the future anyways...

cmsbuild · 2020-02-11T18:09:46Z

Pull request #28916 was updated. @SiewYan, @andrius-k, @mkirsano, @kmaeshima, @schneiml, @efeyazgan, @cmsbuild, @jfernan2, @agrohsje, @fioriNTU, @alberto-sanchez, @qliphy can you please check and sign again.

schneiml · 2020-02-11T18:11:09Z

please test workflow 136.8861, 250202.183

Hope I got the command right now. And that 250202.183 actually passes. An output comparison for that wf would be interesting, it triggers a lot of rare code paths (related to processing multiple runs).

cmsbuild · 2020-02-11T18:11:32Z

The tests are being triggered in jenkins.
Test Parameters:

MATRIX_EXTRAS = 250202.183,136.8861
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/4610/console Started: 2020/02/11 19:20

cmsbuild · 2020-02-11T19:35:27Z

+1
Tested at: 6d3777a
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/summary.html
CMSSW: CMSSW_11_1_X_2020-02-11-1100
SCRAM_ARCH: slc7_amd64_gcc820

cmsbuild · 2020-02-11T19:35:30Z

Comparison job queued.

cmsbuild · 2020-02-11T21:04:19Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/summary.html

@slava77 comparisons for the following workflows were not done due to missing matrix map:

/data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/results/JR-comparison/PR-99170b/136.8861_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt
/data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/results/JR-comparison/PR-99170b/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 34
DQMHistoTests: Total histograms compared: 2694005
DQMHistoTests: Total failures: 1
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2693658
DQMHistoTests: Total skipped: 346
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
Checked 147 log files, 16 edm output root files, 34 DQM output files

silviodonato · 2020-02-12T08:25:51Z

Thanks @schneiml. This PR fixes 250202.183,136.8861. The log is partially changed, but was i expected?

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc820/CMSSW_11_1_X_2020-02-07-2300/pyRelValMatrixLogs/run/250202.181_TTbar_13UP18+TTbar_13UP18INPUT+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step5_TTbar_13UP18+TTbar_13UP18INPUT+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/250202.181_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step5_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/136.8861_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt/step4_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt.log

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc820/CMSSW_11_1_X_2020-02-07-2300/pyRelValMatrixLogs/run/136.8861_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt/step4_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt.log

schneiml · 2020-02-12T08:46:04Z

@silviodonato "Expected" is maybe a bit too confident, but it is within the range of changes that #28622 and the PRs that came before caused in general. Specifcally, that TrigReport output is written by EDM (in schedule.cc) when there are any TrigPaths. I am not really sure what that means but it sounds unrelated and not very worrying for this kind of job?

The new MergeFailure messages in 250202.183 seem more significant -- this job does multi-run harvesting, and the given MEs seem to not merge across runs. I suspect in the past this harvesting job did not actually do multi-run harvesting, and instead only considered the last run or sth. like that. Can anybody explain what the expected result for 250202.183 is? What kind of processing is it supposed to model?

schneiml · 2020-02-12T08:49:44Z

+1

In so far as this PR fixes the problems caused in #28622. The discussion around 250202.183 is more about #28622 itself, not this PR -- it only got visible here.

silviodonato · 2020-02-12T09:44:45Z

merge
we need this PR urgently in the next IB. @alberto-sanchez @agrohsje @efeyazgan @mkirsano @qliphy @SiewYan please check even if it is already merged

silviodonato · 2020-02-12T14:47:00Z

@schneiml 136.8861 and 250202.183 are ok in CMSSW_11_1_X_2020-02-12-1100, thanks.

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_1_X_2020-02-12-1100/pyRelValMatrixLogs/run/136.8861_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt/step4_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt.log#/

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_1_X_2020-02-12-1100/pyRelValMatrixLogs/run/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/step4_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25.log#/

makortel · 2020-02-12T18:32:38Z

Can anybody explain what the expected result for 250202.183 is? What kind of processing is it supposed to model?

Run-dependent MC, see #28214.

schneiml · 2020-02-13T08:33:09Z

@makortel thanks for the link, I was guessing it is sth. in that direction. What matters is that this wf processes multiple runs in a single reco job (sth. we technically support, though I was not aware we actually do that), followed by multi-run harvesting with a rather broken configuration: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/DQM_V0001_R000316083__Global__CMSSW_X_Y_Z__RECO.root

With the old DQMStore that file would probably only contain data from the last run and the runs before are silently dropped (because multi-run harvesting, a.k.a. "collate" is not enabled, but also "single job multiple file" harvesting, a.k.a. saveAtEndRun is not enabled -- and even if it was, it was broken since the threaded migration), now with the new DQMStore it does proper multi-run harvesting (the new DQMStore only supports multi-run harvesting, "normal" harvesting is just multi-run harvesting with a single run as input), but the file name is still wrong since the "multi run harvesting" config fragment (DQMSaver_AtEndJob IIRC) which forces the run number on the output file to 999999 is not loaded.

I guess I am fine with all of that, though potential users of this WF might want to be aware. Also, the output files of such jobs must not be uploaded to production DQMGUIs -- the correct run number 999999 for files with data from multiple runs must be used. (Or, one could argue that run number 1 should be used if this is in fact simulated data. I'd be fine with that.)

makortel · 2020-02-13T16:42:35Z

@srimanob @chayanit @amassiro Please take note of what @schneiml described above about the harvesting step of the run-dependent MC workflow.

chayanit · 2020-02-21T14:07:39Z

@makortel thanks for the link, I was guessing it is sth. in that direction. What matters is that this wf processes multiple runs in a single reco job (sth. we technically support, though I was not aware we actually do that), followed by multi-run harvesting with a rather broken configuration: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/DQM_V0001_R000316083__Global__CMSSW_X_Y_Z__RECO.root

With the old DQMStore that file would probably only contain data from the last run and the runs before are silently dropped (because multi-run harvesting, a.k.a. "collate" is not enabled, but also "single job multiple file" harvesting, a.k.a. saveAtEndRun is not enabled -- and even if it was, it was broken since the threaded migration), now with the new DQMStore it does proper multi-run harvesting (the new DQMStore only supports multi-run harvesting, "normal" harvesting is just multi-run harvesting with a single run as input), but the file name is still wrong since the "multi run harvesting" config fragment (DQMSaver_AtEndJob IIRC) which forces the run number on the output file to 999999 is not loaded.

I guess I am fine with all of that, though potential users of this WF might want to be aware. Also, the output files of such jobs must not be uploaded to production DQMGUIs -- the correct run number 999999 for files with data from multiple runs must be used. (Or, one could argue that run number 1 should be used if this is in fact simulated data. I'd be fine with that.)

@schneiml we have to start running this Run-dependent MC Relvals. What would you recommend about harvest step? Who should I get in touch to fix this for long term usage?

schneiml · 2020-02-24T10:40:29Z

@chayanit I think the right thing to do would be to use the AtEndJob configuration (there is a flag in cmsDriver for that). I don't know exactly what should happen, but that way you get a single DQM output file that covers all data.

Regarding run number 1 vs. 999999, I think we (DQM/PPD) need to make a definition, but to me 999999 like for any other multi-run harvesting seems easier.

cmsbuild added this to the CMSSW_11_1_X milestone Feb 11, 2020

cmsbuild added code-checks-pending comparison-pending dqm-pending orp-pending pending-signatures tests-pending labels Feb 11, 2020

schneiml mentioned this pull request Feb 11, 2020

DQM: new DQMStore. #28622

Merged

cmsbuild added code-checks-approved and removed code-checks-pending labels Feb 11, 2020

cmsbuild added tests-started and removed tests-pending labels Feb 11, 2020

cmsbuild added comparison-notrun tests-rejected and removed comparison-pending tests-started labels Feb 11, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Feb 11, 2020

cmsbuild added tests-started and removed tests-pending labels Feb 11, 2020

cmsbuild added tests-approved and removed tests-started labels Feb 11, 2020

cmsbuild added comparison-available and removed comparison-pending labels Feb 11, 2020

cmsbuild added dqm-approved and removed dqm-pending labels Feb 12, 2020

schneiml mentioned this pull request Feb 12, 2020

DQM: Switch to edm::stream for DQMEDAnalyzer #28813

Merged

cmsbuild added orp-approved and removed orp-pending labels Feb 12, 2020

cmsbuild merged commit 5ebc557 into cms-sw:master Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQM: Fix potential deadlock in setBinLabel. #28916

DQM: Fix potential deadlock in setBinLabel. #28916

schneiml commented Feb 11, 2020 •

edited

schneiml commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

schneiml commented Feb 11, 2020

Dr15Jones commented Feb 11, 2020

Dr15Jones commented Feb 11, 2020

schneiml commented Feb 11, 2020

silviodonato commented Feb 11, 2020

cmsbuild commented Feb 11, 2020 •

edited

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

schneiml commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

schneiml commented Feb 11, 2020

cmsbuild commented Feb 11, 2020 •

edited

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

silviodonato commented Feb 12, 2020

schneiml commented Feb 12, 2020

schneiml commented Feb 12, 2020

silviodonato commented Feb 12, 2020

silviodonato commented Feb 12, 2020

makortel commented Feb 12, 2020 •

edited

schneiml commented Feb 13, 2020

makortel commented Feb 13, 2020

chayanit commented Feb 21, 2020

schneiml commented Feb 24, 2020

DQM: Fix potential deadlock in setBinLabel. #28916

DQM: Fix potential deadlock in setBinLabel. #28916

Conversation

schneiml commented Feb 11, 2020 • edited

PR description:

PR validation:

schneiml commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

schneiml commented Feb 11, 2020

Dr15Jones commented Feb 11, 2020

Dr15Jones commented Feb 11, 2020

schneiml commented Feb 11, 2020

silviodonato commented Feb 11, 2020

cmsbuild commented Feb 11, 2020 • edited

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

schneiml commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

schneiml commented Feb 11, 2020

cmsbuild commented Feb 11, 2020 • edited

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

silviodonato commented Feb 12, 2020

schneiml commented Feb 12, 2020

schneiml commented Feb 12, 2020

silviodonato commented Feb 12, 2020

silviodonato commented Feb 12, 2020

makortel commented Feb 12, 2020 • edited

schneiml commented Feb 13, 2020

makortel commented Feb 13, 2020

chayanit commented Feb 21, 2020

schneiml commented Feb 24, 2020

schneiml commented Feb 11, 2020 •

edited

cmsbuild commented Feb 11, 2020 •

edited

cmsbuild commented Feb 11, 2020 •

edited

makortel commented Feb 12, 2020 •

edited