Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM: Fix potential deadlock in setBinLabel. #28916

Merged
merged 4 commits into from Feb 12, 2020

Conversation

schneiml
Copy link
Contributor

@schneiml schneiml commented Feb 11, 2020

PR description:

Fixes the IB test failure in 136.8861 (and maybe others). The deadlock was triggered by LogMessageMonitor:ClusterizerLogMessageMonCommon in bookHistograms. Reported in #28622.

This one only triggers in the error path, because getFullname tries to
take a lock that we already hold.

However, there is another opportunity for a deadlock via edm::Log*
(which probably takes a lock), and the per ME-locks which could be taken
before or after edm::Log* is called. I reviewed all usages of edm::Log*
in DQMServices/Core and did not see anything problematic.

Note that this is the only usage of edm::Log* in MonitorElement, and it
now avoids taking multiple locks at the same time, so there should not
be any more problems of this type.

This PR now also includes some commits from #28813 related to Sumw2, to fix a completely unrelated problem with 250202.183 which happens after the previous deadlock.

PR validation:

Runs 136.8861 without deadlock.

This one only triggers in the error path, because getFullname tries to
take a lock that we already hold.

However, there is another opportunity for a deadlock via edm::Log*
(which probably takes a lock), and the per ME-locks which could be taken
before or after edm::Log* is called. I reviewed all usages of edm::Log*
in DQMServices/Core and did not see anything problematic.

Note that this is the only usage of edm::Log* in MonitorElement, and it
now avoids taking multiple locks at the same time, so there should not
be any more problems of this type.
@schneiml
Copy link
Contributor Author

@cmsbuild ping ?

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@schneiml schneiml mentioned this pull request Feb 11, 2020
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28916/13709

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

DQMServices/Core

@andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks.
@barvic this is something you requested to watch as well.
@davidlange6, @silviodonato, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@schneiml
Copy link
Contributor Author

please test 136.8861, 250202.183

No changes expected, except that these two workflows now don't time out.

@Dr15Jones
Copy link
Contributor

AFAIK the MessageLogger does not take any locks.

@Dr15Jones
Copy link
Contributor

To understand where the deadlock happens

  1. Run the job in gdb
  2. When the job seems deadlock do a controlC
  3. Do thread apply all bt

This will give you the stack trace for all threads

If you post the results here we can help.

@schneiml
Copy link
Contributor Author

@Dr15Jones the actual deadlock here is trivial, it is getFullname taking the lock that the access in setBinLabel already holds.

The MessageLogger story is hypothetical, but good to know that there are no locks, so it could not be a problem.

@silviodonato
Copy link
Contributor

please test workflow 136.8861, 250202.183

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 11, 2020

The tests are being triggered in jenkins.
Test Parameters:

@cmsbuild
Copy link
Contributor

-1

Tested at: e2a9965

CMSSW: CMSSW_11_1_X_2020-02-10-2300
SCRAM_ARCH: slc7_amd64_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4599/summary.html

I found follow errors while testing this PR

Failed tests: RelVals

  • RelVals:

When I ran the RelVals I found an error in the following workflows:
250202.183 step3

runTheMatrix-results/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/step3_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25.log

@cmsbuild
Copy link
Contributor

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

@schneiml
Copy link
Contributor Author

250202.183 now fails with a variation of the Sumw2 problem we know from #28813 . I am quite surprised to see it appear here, but it is possible if multiple modules book the same ME (this story appeared a lot in #28622 ). I will provide a fix, it will be needed in the future anyways...

@cmsbuild
Copy link
Contributor

Pull request #28916 was updated. @SiewYan, @andrius-k, @mkirsano, @kmaeshima, @schneiml, @efeyazgan, @cmsbuild, @jfernan2, @agrohsje, @fioriNTU, @alberto-sanchez, @qliphy can you please check and sign again.

@schneiml
Copy link
Contributor Author

please test workflow 136.8861, 250202.183

Hope I got the command right now. And that 250202.183 actually passes. An output comparison for that wf would be interesting, it triggers a lot of rare code paths (related to processing multiple runs).

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 11, 2020

The tests are being triggered in jenkins.
Test Parameters:

@cmsbuild
Copy link
Contributor

+1
Tested at: 6d3777a
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/summary.html
CMSSW: CMSSW_11_1_X_2020-02-11-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/summary.html

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/results/JR-comparison/PR-99170b/136.8861_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/results/JR-comparison/PR-99170b/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2694005
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2693658
  • DQMHistoTests: Total skipped: 346
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@silviodonato
Copy link
Contributor

Thanks @schneiml. This PR fixes 250202.183,136.8861. The log is partially changed, but was i expected?

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc820/CMSSW_11_1_X_2020-02-07-2300/pyRelValMatrixLogs/run/250202.181_TTbar_13UP18+TTbar_13UP18INPUT+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step5_TTbar_13UP18+TTbar_13UP18INPUT+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/250202.181_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step5_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/136.8861_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt/step4_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt.log

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc820/CMSSW_11_1_X_2020-02-07-2300/pyRelValMatrixLogs/run/136.8861_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt/step4_RunEGamma2018Dml1+RunEGamma2018Dml1+HLTDR2_2018ml+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQMml+HARVEST2018_L1TEgDQM_Prompt.log

@schneiml
Copy link
Contributor Author

@silviodonato "Expected" is maybe a bit too confident, but it is within the range of changes that #28622 and the PRs that came before caused in general. Specifcally, that TrigReport output is written by EDM (in schedule.cc) when there are any TrigPaths. I am not really sure what that means but it sounds unrelated and not very worrying for this kind of job?

The new MergeFailure messages in 250202.183 seem more significant -- this job does multi-run harvesting, and the given MEs seem to not merge across runs. I suspect in the past this harvesting job did not actually do multi-run harvesting, and instead only considered the last run or sth. like that. Can anybody explain what the expected result for 250202.183 is? What kind of processing is it supposed to model?

@schneiml
Copy link
Contributor Author

+1

In so far as this PR fixes the problems caused in #28622. The discussion around 250202.183 is more about #28622 itself, not this PR -- it only got visible here.

@silviodonato
Copy link
Contributor

merge
we need this PR urgently in the next IB. @alberto-sanchez @agrohsje @efeyazgan @mkirsano @qliphy @SiewYan please check even if it is already merged

@cmsbuild cmsbuild merged commit 5ebc557 into cms-sw:master Feb 12, 2020
@makortel
Copy link
Contributor

makortel commented Feb 12, 2020

Can anybody explain what the expected result for 250202.183 is? What kind of processing is it supposed to model?

Run-dependent MC, see #28214.

@schneiml
Copy link
Contributor Author

@makortel thanks for the link, I was guessing it is sth. in that direction. What matters is that this wf processes multiple runs in a single reco job (sth. we technically support, though I was not aware we actually do that), followed by multi-run harvesting with a rather broken configuration: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/DQM_V0001_R000316083__Global__CMSSW_X_Y_Z__RECO.root

With the old DQMStore that file would probably only contain data from the last run and the runs before are silently dropped (because multi-run harvesting, a.k.a. "collate" is not enabled, but also "single job multiple file" harvesting, a.k.a. saveAtEndRun is not enabled -- and even if it was, it was broken since the threaded migration), now with the new DQMStore it does proper multi-run harvesting (the new DQMStore only supports multi-run harvesting, "normal" harvesting is just multi-run harvesting with a single run as input), but the file name is still wrong since the "multi run harvesting" config fragment (DQMSaver_AtEndJob IIRC) which forces the run number on the output file to 999999 is not loaded.

I guess I am fine with all of that, though potential users of this WF might want to be aware. Also, the output files of such jobs must not be uploaded to production DQMGUIs -- the correct run number 999999 for files with data from multiple runs must be used. (Or, one could argue that run number 1 should be used if this is in fact simulated data. I'd be fine with that.)

@makortel
Copy link
Contributor

@srimanob @chayanit @amassiro Please take note of what @schneiml described above about the harvesting step of the run-dependent MC workflow.

@chayanit
Copy link

@makortel thanks for the link, I was guessing it is sth. in that direction. What matters is that this wf processes multiple runs in a single reco job (sth. we technically support, though I was not aware we actually do that), followed by multi-run harvesting with a rather broken configuration: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/DQM_V0001_R000316083__Global__CMSSW_X_Y_Z__RECO.root

With the old DQMStore that file would probably only contain data from the last run and the runs before are silently dropped (because multi-run harvesting, a.k.a. "collate" is not enabled, but also "single job multiple file" harvesting, a.k.a. saveAtEndRun is not enabled -- and even if it was, it was broken since the threaded migration), now with the new DQMStore it does proper multi-run harvesting (the new DQMStore only supports multi-run harvesting, "normal" harvesting is just multi-run harvesting with a single run as input), but the file name is still wrong since the "multi run harvesting" config fragment (DQMSaver_AtEndJob IIRC) which forces the run number on the output file to 999999 is not loaded.

I guess I am fine with all of that, though potential users of this WF might want to be aware. Also, the output files of such jobs must not be uploaded to production DQMGUIs -- the correct run number 999999 for files with data from multiple runs must be used. (Or, one could argue that run number 1 should be used if this is in fact simulated data. I'd be fine with that.)

@schneiml we have to start running this Run-dependent MC Relvals. What would you recommend about harvest step? Who should I get in touch to fix this for long term usage?

@schneiml
Copy link
Contributor Author

@chayanit I think the right thing to do would be to use the AtEndJob configuration (there is a flag in cmsDriver for that). I don't know exactly what should happen, but that way you get a single DQM output file that covers all data.

Regarding run number 1 vs. 999999, I think we (DQM/PPD) need to make a definition, but to me 999999 like for any other multi-run harvesting seems easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants