Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT DQM for 9.3.x - part 3 #19891

Merged
merged 14 commits into from Jul 25, 2017
Merged

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jul 24, 2017

Follow up to #19713.

Merge the following HLT DQM pull requests (in this order):

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 24, 2017

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 24, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/21721/console Started: 2017/07/24 22:06

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

DQMOffline/Trigger

@vazzolini, @kmaeshima, @dmitrijus, @cmsbuild, @vanbesien, @davidlange6 can you please review it and eventually sign? Thanks.
@battibass, @mtosi, @jhgoh, @calderona, @HuguesBrun, @trocino, @rociovilar this is something you requested to watch as well.
@davidlange6 you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

Comparison job queued.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 25, 2017

Thanks. For more HLT DQM, do I open a new PR to or keep merging things here ?

@davidlange6
Copy link
Contributor

merge

@davidlange6
Copy link
Contributor

i meant to merge this.

@cmsbuild cmsbuild merged commit 939e204 into cms-sw:master Jul 25, 2017
@davidlange6
Copy link
Contributor

hi @fwyzard - it looks like this PR has introduced a bunch of seg faults - could you find a quick fix

eg

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc7_amd64_gcc630/CMSSW_9_3_X_2017-07-25-1100/pyRelValMatrixLogs/run/136.784_RunMET2017B+RunMET2017B+HLTDR2_2017+RECODR2_2017reHLT_skimMET_Prompt+HARVEST2017/step3_RunMET2017B+RunMET2017B+HLTDR2_2017+RECODR2_2017reHLT_skimMET_Prompt+HARVEST2017.log

----- Begin Fatal Exception 25-Jul-2017 14:24:05 CEST-----------------------
An exception of category 'FatalRootError' occurred while
[0] Processing stream end Run run: 297227 stream: 2
[1] Calling method for module TopMonitor/'susyEle17CaloIdMJet30_ele'
Additional Info:
[a] Fatal Root Error: @sub=Merge
Cannot merge histograms - limits are inconsistent:
first: elePtEta_1_denominator (9, 0.000000, 400.000000), second: elePtEta_1_denominator (10, 0.000000, 400.000000)

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 25, 2017

will look - just to confirm, these are exceptions, not segfaults we are talkinig about, right ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 25, 2017

I suspect I understand the problem - it looks like multiple modules are booking the same histogram; I've asked @parbol to have a loot in #19290.

@fwyzard fwyzard mentioned this pull request Jul 25, 2017
@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 25, 2017

If my guess is correct, #19912 should fix these errors.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 25, 2017

OK, step3 of 136.784 runs fine with the fix.

@Martin-Grunewald
Copy link
Contributor

@fwyzard
TSG test shows this error:

%MSG
----- Begin Fatal Exception 26-Jul-2017 08:07:55 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 3 event: 103 stream: 1
   [1] Running path 'dqmofflineOnPAT_step'
   [2] Calling method for module TopMonitor/'WprimeEle115'
Exception Message:
A std::exception was thrown.
no method or data member named "deltaEtaSuperClusterAtVtx" found for type "reco::GsfElectron"
----- End Fatal Exception -------------------------------------------------

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 26, 2017

hi @Martin-Grunewald ,
how do I reproduce it ?

And by the way, what does the exception mean ?

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 26, 2017 via email

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 26, 2017 via email

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 26, 2017

hi @davidlange6 , do you want the fix on top of #19916 or stand-alone ?

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 26, 2017 via email

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 26, 2017

#19918

@Martin-Grunewald
Copy link
Contributor

Using #19918 I get the next error:

Begin processing the 5th record. Run 1, Event 107, LumiSection 3 at 26-Jul-2017 10:09:21.725 CEST
----- Begin Fatal Exception 26-Jul-2017 10:09:21 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 3 event: 103 stream: 0
   [1] Running path 'dqmofflineOnPAT_step'
   [2] Calling method for module TopMonitor/'WprimeEle115'
Exception Message:
A std::exception was thrown.
no method or data member named "passConversionVeto" found for type "reco::GsfElectron"
----- End Fatal Exception -------------------------------------------------

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 26, 2017

Sigh... on which workflow?
I didn't get any errors on 136.784 and 136.785 ...

@Martin-Grunewald
Copy link
Contributor

Our TSG workflow on the V2.1 menu - within HLTrigger/Configuration/test:

./runAll.csh GRun

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 26, 2017

hi again,
I've run all the 136.* workflows without seeing the issue reported by Martin - and the reason could be that I did not hit an interesting event.

However, for my tests I've used #19916 (actually #19917) on top of the other changes. Is it possible that the problem has already been fixed by the commits there ?

Martin, can you try with the latest IB + #19917 ?

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Jul 27, 2017

@fwyzard
Unfortunatly the error persists:

CMSSW_9_3_X_2017-07-26-2300/src/HLTrigger/Configuration/test/[129]$ git branch
  CMSSW_9_3_X
  cms-sw/refs/pull/19884/head
  cms-sw/refs/pull/19917/head
* from-CMSSW_9_3_X_2017-07-26-2300

to reproduce: make a developer area:

cd src
git cms-addpkg HLTrigger/Configuration
scram b
rehash
cd HLTrigger/Configuration/test/

cp /afs/cern.ch/user/g/gruen/public/RelVal_DigiL1RawHLT_GRun_MC.py .
cp /afs/cern.ch/user/g/gruen/public/RelVal_RECO_GRun_MC.py .

cmsRun RelVal_DigiL1RawHLT_GRun_MC.py >& RelVal_DigiL1RawHLT_GRun_MC.log
echo $?

cmsRun RelVal_RECO_GRun_MC.py >& RelVal_RECO_GRun_MC.log
echo $?

The first cmsRun takes a GEN-SIM file and produces the file needed for the second cmsRun which shows the crash.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 27, 2017

Your test machine is bigger than mine...

In my second attempt, the job was killed after the first few events by the kernel because the machine ran out of memory.

In the first attempt the whole machine died...

I'm trying again with a singe stream to reduce the memory usage.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 27, 2017

OK, with a single thread/stream I can reproduce the error.

@fwyzard fwyzard deleted the HLT_DQM_for_93x_part3 branch August 2, 2017 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants