Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in the 94X backport of HcalDetId protections for calibration events #23744

Closed
fabiocos opened this issue Jul 4, 2018 · 19 comments
Closed

Comments

@fabiocos
Copy link
Contributor

fabiocos commented Jul 4, 2018

The merge of #23688 has caused a list of reproducible failures in 2016 workflows in CMSSW_9_4_X_2018-07-03-2300 , within DQM with the exception

----- Begin Fatal Exception 04-Jul-2018 10:09:31 CEST-----------------------
An exception of category 'Conditions not found' occurred while
[0] Processing Event run: 283877 lumi: 17 event: 27631378 stream: 1
[1] Running path 'dqmoffline_step'
[2] Calling method for module DigiTask/'digiTask'
Exception Message:
Unavailable Conditions of type HcalQIEData for cell (0x4e280440) (CastorRadFacility 1 / 2 / 0)
----- End Fatal Exception -------------------------------------------------

triggered from

https://cmssdt.cern.ch/lxr/source/CondFormats/CastorObjects/interface/CastorCondObjectContainer.h#0090

as used in the module https://cmssdt.cern.ch/lxr/source/DQM/HcalTasks/python/DigiTask.py for what I can see.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 4, 2018

A new Issue was created by @fabiocos Fabio Cossutti.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fabiocos
Copy link
Contributor Author

fabiocos commented Jul 4, 2018

@bsunanda FYI The failing workflows are not present in 8_0_X, are we sure that the problem does not affect also that release?

@bsunanda
Copy link
Contributor

bsunanda commented Jul 4, 2018 via email

@abdoulline
Copy link

@fabiocos,
(Cc @bsunanda, @DryRun, @deguio)

I guess that DetId fix may have "provoked" something (undesirable consequence) which until now was kind of hidden behind improper treatment of Calib.channels Id and it may need a fix in HCAL DQM as well.
We didn't have HcalDQMData (neither any other conditions) for CRF channels (which are present in e-map = unpacked, though). So I take it "offline" to internal HCAL discussion for the moment...

@DryRun
Copy link
Contributor

DryRun commented Jul 9, 2018

Hi @fabiocos , could I ask specifically which workflows are failing? I tried a few (2016 data), but they apparently ran successfully.

@abdoulline
Copy link

@DryRun
I think the simplest (at the end you just need to wait for the results to come in ~1.5h and check 2016 wf's logs) would be to take the last 94X IB and to run a default Matrix:

cmsrel CMSSW_9_4_X_2018-07-08-0000
cmsenv
voms-proxy-init -voms cms -rfc
runTheMatrix.py -s --useInput all &

@abdoulline
Copy link

From runTheMatrix.py -s I see both 2016-related wf's
136.731_RunSinglePh2016B
136.7611_RunJetHT2016E
(100 ev each)
finished OK.
Then the question indeed - which wf did fail...

@abdoulline
Copy link

@fabiocos Thank you, Fabio.

Unfortunately in my case (lxplus) all the tests you've listed, like
runTheMatrix.py -l 136.769 &

failed at the step2 with similar file access errors:
Failed to open file at URL root://eoscms.cern.ch:1094//eos/cms/store/data/Run2016H/DoubleEG/RAW/v1/000/283/877/00000/00B1E52F-889B-E611-B4B9-FA163EA8BEB3.root.

And I don't see RAW for the aforementioned sample:
eos ls /eos/cms/store/data/Run2016H/DoubleEG
MINIAOD

Hopefully David (@DryRun) will have a better chance...

@DryRun
Copy link
Contributor

DryRun commented Jul 11, 2018

Thanks @abdoulline and @fabiocos. 136.731 and 136.7611 were the ones I tried successfully previously, and unfortunately I see the same file access errors as @abdoulline for most of the others. On EOS, I do see something for eos ls /eos/cms/store/data/Run2016H/ZeroBias/RAW, so perhaps 136.778 will work... trying that now.

@abdoulline
Copy link

CMSSW_9_4_X_2018-07-08-0000 + 136.778 = OK...

136.778_RunZeroBias2016H+RunZeroBias2016H+HLTDR2_2016+RECODR2_2016reHLT_Prompt+H
ARVESTDR2 Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED - time date Wed J
ul 11 16:45:10 2018-date Wed Jul 11 13:39:38 2018; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

@fabiocos
Copy link
Contributor Author

@abdoulline indeed, I reverted the crashing PR in CMSSW_9_4_X_2018-07-04-2300 as you can see in the IB history, so an earlier build should be used. Alternatively just merge #23688 on top of todays' IB an try...

@abdoulline
Copy link

@fabiocos admittedly I've overlooked the reverting...
So for @DryRun it should be easy now to reproduce the issue with earlier (e.g.)
CMSSW_9_4_X_2018-07-03-2300 + 136.778

@DryRun
Copy link
Contributor

DryRun commented Jul 12, 2018

Thanks for the suggestions. I was using CMSSW_9_4_X_2018-07-03-2300, but it was crashing yesterday due to the cvmfs issues.

It works today, and I was able to locate the crash. I think the crash can be fixed with c3cdccb (requires subdet==HcalEndcap for QIE11 digis; also, f8aeac9 could be included, which does the same thing for HBHE digis. See https://github.com/cms-sw/cmssw/commits/8d539d44ecea86fea7f16929f65101103fb077d4/DQM/HcalTasks/plugins/DigiTask.cc).

However, there are 7 commits to DigiTask.cc in between 9_4_X and c3cdccb. Could we cherry-pick just the relevant commits, or do we have to take all the commits in order?

@abdoulline
Copy link

abdoulline commented Jul 12, 2018

David, the two fixing snippets you've linked - they contain just several lines of code.
So I'd suggest you to submit this "minimum-minimorum" fix to 94X without any other back-porting...

@abdoulline
Copy link

@DryRun David?

@DryRun
Copy link
Contributor

DryRun commented Jul 17, 2018

Hi @abdoulline, thanks for the reminder. I made a backport of the two commits at #23808, which should fix this crash.

@abdoulline
Copy link

abdoulline commented Jul 17, 2018 via email

@fabiocos
Copy link
Contributor Author

fabiocos commented Aug 2, 2018

The problem looks fixed now

@fabiocos fabiocos closed this as completed Aug 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants