Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMSSW tests fails with Fatal Root Error: @SUB=Minuit2 #42979

Closed
smuzaffar opened this issue Oct 11, 2023 · 14 comments · Fixed by #43106
Closed

CMSSW tests fails with Fatal Root Error: @SUB=Minuit2 #42979

smuzaffar opened this issue Oct 11, 2023 · 14 comments · Fixed by #43106

Comments

@smuzaffar
Copy link
Contributor

smuzaffar commented Oct 11, 2023

We updated ROOT master commit for CMSSW ROOT6 IBs to 744dcdea97 but it caused many tests in cmssw to fail with error [a]. Tests with ROOT commit 5df0ef8bfa worked fine. The ROOT change set in question is root-project/root@5df0ef8...744dcde . I see there are some changes in root's Minuit2 code (e.g Minuit2 is now the default minimizer ). Is it something we need to update in cmssw to accommodate new root Minuit2 changes?

[a]

----- Begin Fatal Exception 11-Oct-2023 00:42:39 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing global end Run run: 1
   [1] Calling method for module DQMGenericClient/'postProcessorTrack'
   Additional Info:
      [a] Fatal Root Error: @SUB=Minuit2
VariableMetricBuilder Initial matrix not pos.def.

----- End Fatal Exception -------------------------------------------------
@cmsbuild
Copy link
Contributor

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@makortel, @smuzaffar, @Dr15Jones, @antoniovilela, @rappoccio, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

FYI @guitargeek

@smuzaffar
Copy link
Contributor Author

Do we need a fix like #38687 ( https://github.com/cms-sw/cmssw/blob/master/DQM/BeamMonitor/plugins/Vx3DHLTAnalyzer.cc#L423-L432 ) to add a protection for Minuit2 Fatal Root Error?

@smuzaffar
Copy link
Contributor Author

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

@guitargeek
Copy link
Contributor

guitargeek commented Oct 11, 2023

Hi @smuzaffar, thanks for pinging me about this!

According to @lmoneta looks like an actual problem in Minuit 2 that we need to debug fix, so please don't just shove the problem under the rug with a try-catch 🙂

How can I reproduce the failure? Note that I don't have access to CMS Jenkins as a non CMS member, but I can still use CMSSW on lxplus. Maybe you could tell me which tag to checkout and which test to run?

Note that after we have fixed the actual problem on the ROOT side, this PR should probably also be reverted:

Linking also two PRs for reference:

@Dr15Jones
Copy link
Contributor

Is the problem because, by default, CMS turns ROOT Error/Warning messages into exceptions? Maybe this is one we should tell our code not to convert?

@guitargeek
Copy link
Contributor

Well, it's actually nice that you turned this into an exception, because this is most likely a logic error in Minuit2 that we would not have spotted otherwise!

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Oct 11, 2023

@guitargeek , you can reproduce it using following on lxplus8

> ssh lxplus8
> cd /tmp/$(whoami)
> /cvmfs/cms.cern.ch/common/scram p CMSSW_13_3_ROOT6_X_2023-10-10-2300
> cd CMSSW_13_3_ROOT6_X_2023-10-10-2300
> eval `/cvmfs/cms.cern.ch/common/scram run -sh`
> cmsRun /afs/cern.ch/user/c/cmsbuild/public/root6/36.0/step4_HARVESTING.py

@guitargeek
Copy link
Contributor

guitargeek commented Oct 12, 2023

Hi @smuzaffar, thanks for this, I can reproduce it!

But I'm still at a loss of what actually happens. Is it possible to make cmsRun rethrow the exception to I can actually get a stacktrace?

I tried this in the process options:

Rethrow = cms.untracked.vstring("ProductNotFound", "FatalRootError"),

But it didn't change the behavior. CMSSW is just carrying on after the fatal exception.

@makortel
Copy link
Contributor

What do you mean with "rethrow"? When the framework catches an exception, it works to shut down. Do you mean rethrowing the exception toe caught by the C++ runtime? We don't have facility to do that.

In order to get a stack trace of the exception, you could run cmsTraceExceptions cmsRun .../step4_HARVESTING.py (which is just a wrapper for gdb).

@guitargeek
Copy link
Contributor

Perfect! Thanks, yes that was exactly what I needed

@guitargeek
Copy link
Contributor

I could extract the relevant histogram and opened an upstream issue. We'll work on this with high priority.

@smuzaffar
Copy link
Contributor Author

#43106 fixes the issue by using likelihood fit instead of chi-square but this change does show many differences in dqm/reco comparison

zhenbinwu pushed a commit to zhenbinwu/cmssw that referenced this issue Feb 14, 2024
The DQM plots use the `TH2::FitSlicesY()` function to fit some Gaussians.
However, some of the fits are failing. This was not resulting in errors
so far, but with the switch to Minuit2 by default in ROOT 6.30 it will.

The problem is that it uses chi-square fits to fit slices with many
empty bins, which is not appropriate. Doing a likelihood fit with the
`"l"` option is one way to fix the problem, because it can better deal
with empty bins.

Closes cms-sw#42979.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants