Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IB Failure with workflow 23234.103 #35979

Closed
qliphy opened this issue Nov 4, 2021 · 11 comments
Closed

IB Failure with workflow 23234.103 #35979

qliphy opened this issue Nov 4, 2021 · 11 comments

Comments

@qliphy
Copy link
Contributor

qliphy commented Nov 4, 2021

It appeared firstly from IB CMSSW_12_2 2021-10-31-2300, and persists up to most recent IB.

Previously we thought it is due to #35744 , and #35941 may fix this as a byproduct
#35744 (comment)
The PR test with #35941 enabled with multithreading worked well with 23234.103
#35941 (comment)

However, after merging #35941 the issue persists up to most recent IB. We may need to wait to see the next IB results.

Let's open a thread here to track this issue.

Adding @drankincms

p.s. I tried several local tests and they work well with
runTheMatrix.py -l 23234.103 --job-reports -t -4 --ibeos

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2021

A new Issue was created by @qliphy Qiang Li.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@qliphy
Copy link
Contributor Author

qliphy commented Nov 4, 2021

assign dqm,l1

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2021

New categories assigned: dqm,l1

@jfernan2,@ahmad3213,@rvenditti,@emanueleusai,@rekovic,@pbo0,@cecilecaillol,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

I believe I've traced down the problem to a thread safety issue in ROOT.

The call to QuantilesX does

https://github.com/root-project/root/blob/de7bcdff077be4ec08892c1d8349d9f8e62dd420/hist/hist/src/TH2.cxx#L2460-L2463

and in DoQuantiles there is a call to DoProjection which uses the histogram name "tmp".
https://github.com/root-project/root/blob/de7bcdff077be4ec08892c1d8349d9f8e62dd420/hist/hist/src/TH2.cxx#L2531

It then checks gROOT if that histogram name already exists and then attempts to re-use it
https://github.com/root-project/root/blob/de7bcdff077be4ec08892c1d8349d9f8e62dd420/hist/hist/src/TH2.cxx#L2189-L2199

So if any other module is calling another ROOT function which also uses the "tmp" histogram then the histogram could be reset right as the call to ComputeIntegral happens.

The upshot is it is not safe to call QuantilesX or QuantilesY and you will have to find another way to fill the histograms.

FYI @pcanal.

@Dr15Jones
Copy link
Contributor

Dr15Jones commented Nov 4, 2021

@drankincms
Copy link
Contributor

@Dr15Jones thanks a lot for tracking this down! Ok, I will go ahead and implement QuantilesX() by hand. @qliphy I will go ahead and submit a PR to fix this once it's ready.

@Dr15Jones
Copy link
Contributor

@drankincms it looks like if you first set ROOT to a unique new directory it will place the "tmp" into that directory which should isolate you from any other threads.

@Dr15Jones
Copy link
Contributor

So the ROOT code does search gDirectory for an object with the name "tmp". It looks like you can temporarily replace gDirectory via

https://github.com/root-project/root/blob/52afd08119469dae2ebecf14fbfe7c05e8a756b3/core/base/inc/TDirectory.h#L73-L84

@drankincms
Copy link
Contributor

@Dr15Jones Thanks a bunch for the tip, your TContext trick seems to fix this.

@jfernan2
Copy link
Contributor

+1
I believe this issue is being fixed by #36067

@perrotta
Copy link
Contributor

@qliphy I think we can close this (please reopen if I'm wrong)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants