Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow 11834.24 timeouts on cs8 #36492

Closed
makortel opened this issue Dec 14, 2021 · 25 comments · Fixed by #36505
Closed

Workflow 11834.24 timeouts on cs8 #36492

makortel opened this issue Dec 14, 2021 · 25 comments · Fixed by #36505

Comments

@makortel
Copy link
Contributor

makortel commented Dec 14, 2021

Workflow 11834.24 step3 appears to time out consistently on cs8. In CMSSW_12_3_X_2021-12-13-2300 cs8_amd64_gcc10

Begin processing the 1st record. Run 1, Event 102, LumiSection 2 on stream 0 at 14-Dec-2021 09:22:02.427 CET
Begin processing the 2nd record. Run 1, Event 101, LumiSection 2 on stream 0 at 14-Dec-2021 09:52:41.636 CET
Begin processing the 3rd record. Run 1, Event 103, LumiSection 2 on stream 0 at 14-Dec-2021 10:11:41.210 CET
Begin processing the 4th record. Run 1, Event 104, LumiSection 2 on stream 0 at 14-Dec-2021 10:29:59.495 CET
Begin processing the 5th record. Run 1, Event 105, LumiSection 2 on stream 0 at 14-Dec-2021 10:42:44.897 CET
Begin processing the 6th record. Run 1, Event 106, LumiSection 2 on stream 0 at 14-Dec-2021 11:02:20.171 CET
Begin processing the 7th record. Run 1, Event 107, LumiSection 2 on stream 0 at 14-Dec-2021 11:14:44.980 CET
Begin processing the 8th record. Run 1, Event 109, LumiSection 2 on stream 0 at 14-Dec-2021 11:37:49.580 CET
Begin processing the 9th record. Run 1, Event 108, LumiSection 2 on stream 0 at 14-Dec-2021 11:47:45.216 CET

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Tue Dec 14 11:49:15 CET 2021
Thread 1 (Thread 0x2b8eba686800 (LWP 21897) "cmsRun"):
#3  0x00002b8ebe30035b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02711/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2021-12-13-2300/lib/cs8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b8ee6e1ad04 in CellularAutomaton::createAndConnectCells(std::vector<HitDoublets const*, std::allocator<HitDoublets const*> > const&, TrackingRegion const&, CACut const&, CACut const&, float) () from /cvmfs/cms-ib.cern.ch/nweek-02711/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2021-12-13-2300/lib/cs8_amd64_gcc10/libRecoPixelVertexingPixelTriplets.so
#6  0x00002b8ee6e05de7 in CAHitQuadrupletGenerator::hitNtuplets(IntermediateHitDoublets const&, std::vector<OrderedHitSeeds, std::allocator<OrderedHitSeeds> >&, SeedingLayerSetsHits const&) () from /cvmfs/cms-ib.cern.ch/nweek-02711/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2021-12-13-2300/lib/cs8_amd64_gcc10/libRecoPixelVertexingPixelTriplets.so
#7  0x00002b8f95e54e81 in CAHitNtupletEDProducerT<CAHitQuadrupletGenerator>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02711/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2021-12-13-2300/lib/cs8_amd64_gcc10/pluginRecoPixelVertexingPixelTripletsPlugins.so
#8  0x00002b8eb728b453 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02711/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2021-12-13-2300/lib/cs8_amd64_gcc10/libFWCoreFramework.so
#9  0x00002b8eb727476f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02711/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2021-12-13-2300/lib/cs8_amd64_gcc10/libFWCoreFramework.so
#10 0x00002b8eb71ced45 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02711/cs8_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2021-12-13-2300/lib/cs8_amd64_gcc10/libFWCoreFramework.so

Current Modules:

Module: CAHitQuadrupletEDProducer:initialStepHitQuadruplets (crashed)

A fatal system signal has occurred: external termination request

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cs8_amd64_gcc10/CMSSW_12_3_X_2021-12-13-2300/pyRelValMatrixLogs/run/11834.24_TTbar_14TeV+2021PU_0T+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoNanoPU+HARVESTNanoPU/step3_TTbar_14TeV+2021PU_0T+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoNanoPU+HARVESTNanoPU.log#/

Note that each event takes consistently 10-30 minutes to process.

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

@smuzaffar, I noticed the cmsDriver commands on cs8 appear to be missing --nThreads 4 (that I see on slc7).

@smuzaffar
Copy link
Contributor

@makortel , ah good catch.
This is fixed now cms-sw/cms-bot@36a78cd (next cs8 IBs should run threaded relvals)

@makortel
Copy link
Contributor Author

CMSSW_12_3_X_2021-12-13-2300 cs8_amd64_gcc11 shows similar time/event behavior

Begin processing the 1st record. Run 1, Event 102, LumiSection 2 on stream 0 at 14-Dec-2021 06:20:24.993 UTC
Begin processing the 2nd record. Run 1, Event 101, LumiSection 2 on stream 0 at 14-Dec-2021 06:39:09.060 UTC
Begin processing the 3rd record. Run 1, Event 103, LumiSection 2 on stream 0 at 14-Dec-2021 06:51:29.176 UTC
Begin processing the 4th record. Run 1, Event 104, LumiSection 2 on stream 0 at 14-Dec-2021 07:04:08.326 UTC
Begin processing the 5th record. Run 1, Event 105, LumiSection 2 on stream 0 at 14-Dec-2021 07:13:07.386 UTC
Begin processing the 6th record. Run 1, Event 106, LumiSection 2 on stream 0 at 14-Dec-2021 07:26:41.612 UTC
Begin processing the 7th record. Run 1, Event 107, LumiSection 2 on stream 0 at 14-Dec-2021 07:35:39.248 UTC
Begin processing the 8th record. Run 1, Event 109, LumiSection 2 on stream 0 at 14-Dec-2021 07:51:59.032 UTC
Begin processing the 9th record. Run 1, Event 108, LumiSection 2 on stream 0 at 14-Dec-2021 07:59:23.870 UTC
Begin processing the 10th record. Run 1, Event 110, LumiSection 2 on stream 0 at 14-Dec-2021 08:11:29.708 UTC

In the same IB slc7_amd64_gcc10 (but this is with 4 threads and streams)

Begin processing the 1st record. Run 1, Event 101, LumiSection 2 on stream 3 at 14-Dec-2021 07:16:59.614 CET
Begin processing the 2nd record. Run 1, Event 104, LumiSection 2 on stream 2 at 14-Dec-2021 07:16:59.641 CET
Begin processing the 3rd record. Run 1, Event 103, LumiSection 2 on stream 0 at 14-Dec-2021 07:17:09.537 CET
Begin processing the 4th record. Run 1, Event 102, LumiSection 2 on stream 1 at 14-Dec-2021 07:17:09.540 CET
Begin processing the 5th record. Run 1, Event 106, LumiSection 2 on stream 3 at 14-Dec-2021 07:28:39.637 CET
Begin processing the 6th record. Run 1, Event 105, LumiSection 2 on stream 2 at 14-Dec-2021 07:32:53.480 CET
Begin processing the 7th record. Run 1, Event 107, LumiSection 2 on stream 0 at 14-Dec-2021 07:37:19.302 CET
Begin processing the 8th record. Run 1, Event 109, LumiSection 2 on stream 1 at 14-Dec-2021 07:40:57.789 CET
Begin processing the 9th record. Run 1, Event 108, LumiSection 2 on stream 3 at 14-Dec-2021 07:42:55.061 CET
Begin processing the 10th record. Run 1, Event 110, LumiSection 2 on stream 2 at 14-Dec-2021 07:53:25.663 CET

Based on a few first events the per-event processing times are on the same order of magnitude, but not exactly the same.

@makortel
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

@cms-sw/reconstruction-l2 11834.24 appears to be 0T, is 10-30 minutes per event. Just to confirm, is this as expected?

@makortel
Copy link
Contributor Author

The no-pileup variant 11634.24 appears to process only 1 event, I wonder if that could be sufficient for the pileup variant too

11634.24 appears to include also ALCA step that 11834.24 does not (@cms-sw/pdmv-l2 @cms-sw/alca-l2), I have no idea if that is intentional or not.

@slava77
Copy link
Contributor

slava77 commented Dec 14, 2021

In the same IB slc7_amd64_gcc10 (but this is with 4 threads and streams)
Begin processing the 1st record. Run 1, Event 101, LumiSection 2 on stream 3 at 14-Dec-2021 07:16:59.614 CET
Begin processing the 2nd record. Run 1, Event 104, LumiSection 2 on stream 2 at 14-Dec-2021 07:16:59.641 CET
Begin processing the 3rd record. Run 1, Event 103, LumiSection 2 on stream 0 at 14-Dec-2021 07:17:09.537 CET

the startup times don't look like 4 stream, more like 2-stream.

OTOH https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc900/CMSSW_12_3_X_2021-12-13-2300/pyRelValMatrixLogs/run/11834.24_TTbar_14TeV+2021PU_0T+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoNanoPU+HARVESTNanoPU/step3_TTbar_14TeV+2021PU_0T+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoNanoPU+HARVESTNanoPU.log
looks more like 4/4 thread/stream

@cms-sw/reconstruction-l2 11834.24 appears to be 0T, is 10-30 minutes per event. Just to confirm, is this as expected?

In the slc7_amd64_gcc900/CMSSW_12_3_X_2021-12-13-2300 log the job starts processing at 13:29:04.502 CET and ends at 14:47:01 CET Closed file root.... So, yes, it looks like some events take a long time.

@cms-sw/tracking-pog-l2 please take a note

Is Module: CAHitQuadrupletEDProducer:initialStepHitQuadruplets (crashed) an indication that this module took the longest time or that it was just one of the modules on one of the threads at the time of SIGKILL?

@tvami
Copy link
Contributor

tvami commented Dec 14, 2021

11634.24 appears to include also ALCA step that 11834.24 does not (@cms-sw/pdmv-l2 @cms-sw/alca-l2), I have no idea if that is intentional or not.

11834.24 has the NANO step as well (while 11634.24 does not), I'm not sure if that matters or not.
Anyway to answer the question, to me it looks like whenever we do PU wf-s, we never do the ALCA steps. So I guess it is intentional (although this decision was taken much before me)

@mmusich
Copy link
Contributor

mmusich commented Dec 14, 2021

The no-pileup variant 11634.24 appears to process only 1 event, I wonder if that could be sufficient for the pileup variant too

when they were first introduced at #30527 these wfs were designed to run just 1 event:

https://github.com/cms-sw/cmssw/pull/30527/files#diff-8c163203cbaa64f0e0e6aa2b16346cfab3eaf84a57a008c7fa37193e8b9b2adbR581

and actually it seems that dumping the command both start with 1 GS event:

11834.24 2021PU_0T+TTbar_14TeV_TuneCP5_GenSim+DigiPU+RecoPU+HARVESTPU+Nano [1]: cmsDriver.py TTbar_14TeV_TuneCP5_cfi  -s GEN,SIM -n 1 --conditions auto:phase1_2021_realistic_0T --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Extended --era Run3 --magField 0T --relval 9000,100 
                                           [2]: cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions auto:phase1_2021_realistic_0T --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --magField 0T --pileup Run3_Flat55To75_PoissonOOTPU --pileup_input das:/RelValMinBias_14TeV/CMSSW_12_0_0_pre4-120X_mcRun3_2021_realistic_v2-v1/GEN-SIM
                                           [3]: cmsDriver.py step3  -s RAW2DIGI,L1Reco,RECO,RECOSIM,EI,PAT,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM --conditions auto:phase1_2021_realistic_0T --datatier GEN-SIM-RECO,MINIAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,DQM --geometry DB:Extended --era Run3 --magField 0T --pileup Run3_Flat55To75_PoissonOOTPU --pileup_input das:/RelValMinBias_14TeV/CMSSW_12_0_0_pre4-120X_mcRun3_2021_realistic_v2-v1/GEN-SIM
                                           [4]: cmsDriver.py step4  -s HARVESTING:@standardValidation+@standardDQM+@ExtraHLT+@miniAODValidation+@miniAODDQM --conditions auto:phase1_2021_realistic_0T --mc  --geometry DB:Extended --scenario pp --filetype DQM --era Run3 -n 10 --magField 0T --pileup Run3_Flat55To75_PoissonOOTPU --pileup_input das:/RelValMinBias_14TeV/CMSSW_12_0_0_pre4-120X_mcRun3_2021_realistic_v2-v1/GEN-SIM
                                           [5]: cmsDriver.py step5  -s NANO --conditions auto:phase1_2021_realistic --datatier NANOAODSIM -n 10 --eventcontent NANOEDMAODSIM --filein file:step3_inMINIAODSIM.root --geometry DB:Extended --era Run3

 -------------------------------------------------------------------------------- 


11634.24 2021_0T+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST+ALCA [1]: cmsDriver.py TTbar_14TeV_TuneCP5_cfi  -s GEN,SIM -n 1 --conditions auto:phase1_2021_realistic_0T --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Extended --era Run3 --magField 0T --relval 9000,100 
                                           [2]: cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions auto:phase1_2021_realistic_0T --datatier GEN-SIM-DIGI-RAW -n 1 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --magField 0T
                                           [3]: cmsDriver.py step3  -s RAW2DIGI,L1Reco,RECO,RECOSIM,EI,PAT,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM --conditions auto:phase1_2021_realistic_0T --datatier GEN-SIM-RECO,MINIAODSIM,DQMIO -n 1 --eventcontent RECOSIM,MINIAODSIM,DQM --geometry DB:Extended --era Run3 --magField 0T
                                           [4]: cmsDriver.py step4  -s HARVESTING:@standardValidation+@standardDQM+@ExtraHLT+@miniAODValidation+@miniAODDQM --conditions auto:phase1_2021_realistic_0T --mc  --geometry DB:Extended --scenario pp --filetype DQM --era Run3 -n 1 --magField 0T
                                           [5]: cmsDriver.py step5  -s ALCA:SiPixelCalSingleMuonLoose+SiPixelCalSingleMuonTight+TkAlMuonIsolated+TkAlMinBias+MuAlOverlaps+EcalESAlign+TkAlZMuMu+TkAlDiMuonAndVertex+HcalCalHBHEMuonFilter+TkAlUpsilonMuMu+TkAlJpsiMuMu+SiStripCalMinBias --conditions auto:phase1_2021_realistic_0T --datatier ALCARECO -n 1 --eventcontent ALCARECO --geometry DB:Extended --filein file:step3.root --era Run3 --magField 0T

@slava77
Copy link
Contributor

slava77 commented Dec 14, 2021

and actually it seems that dumping the command both start with 1 GS event:

apparently the IB tests recycle gen-sim; and this way they have 10 events here.

@mmusich
Copy link
Contributor

mmusich commented Dec 14, 2021

@cms-sw/tracking-pog-l2 please take a note

so we know for sure that the CA is not designed to run correctly for 0T, e.g.:

but also:

//87 cm/GeV = 1/(3.8T * 0.3)
//take less than radius given by the hardPtCut and reject everything below
float minRadius = hardPtCut * 87.f; // FIXME move out and use real MagField

Some work was started before the beam test in case it was necessary to run reconstruction at 0T, but it was stopped as soon as it became clear we would have had field.
Please clarify what short term action is expected.

@makortel
Copy link
Contributor Author

and actually it seems that dumping the command both start with 1 GS event:

apparently the IB tests recycle gen-sim; and this way they have 10 events here.

The cmsDriver commands listed in the IB dashboard show explicit -n 10, whereas for 11634.24 theyhave explicit -n 1.

@makortel
Copy link
Contributor Author

I have a vague recollection we introduced the trackingLowPU era during Run 2 for "Run 1 -like tracking" that works with 0 T out of the box. (but I don't know if this is still relevant or not)

@makortel
Copy link
Contributor Author

and actually it seems that dumping the command both start with 1 GS event:

apparently the IB tests recycle gen-sim; and this way they have 10 events here.

The cmsDriver commands listed in the IB dashboard show explicit -n 10, whereas for 11634.24 theyhave explicit -n 1.

Which, on a closer look, is exactly what @mmusich showed in #36492 (comment)

# 11834.24
[2]: cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions auto:phase1_2021_realistic_0T --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --magField 0T --pileup Run3_Flat55To75_PoissonOOTPU --pileup_input das:/RelValMinBias_14TeV/CMSSW_12_0_0_pre4-120X_mcRun3_2021_realistic_v2-v1/GEN-SIM

 -------------------------------------------------------------------------------- 

# 11634.24
[2]: cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions auto:phase1_2021_realistic_0T --datatier GEN-SIM-DIGI-RAW -n 1 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --magField 0T

@makortel
Copy link
Contributor Author

I think the remaining action is to make the -n 1 applied on all steps of 11834.24.

@makortel
Copy link
Contributor Author

assign pdmv

@cmsbuild
Copy link
Contributor

New categories assigned: pdmv

@bbilin,@wajidalikhan,@jordan-martins,@kskovpen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor

slava77 commented Dec 15, 2021

I have a vague recollection we introduced the trackingLowPU era during Run 2 for "Run 1 -like tracking" that works with 0 T out of the box. (but I don't know if this is still relevant or not)

wasn't that a special setup for no-pixel-tracker configurations?

@slava77
Copy link
Contributor

slava77 commented Dec 15, 2021

wasn't that a special setup for no-pixel-tracker configurations?

or, wasn't it technically made so that no-pixel-tracker setup could work ... and chances are that in attempt to make the purpose more generic the name morphed to trackingLowPU

@mmusich
Copy link
Contributor

mmusich commented Dec 15, 2021

I think the remaining action is to make the -n 1 applied on all steps of 11834.24.

see please #36503.

@mmusich
Copy link
Contributor

mmusich commented Dec 15, 2021

I have a vague recollection we introduced the trackingLowPU era during Run 2 for "Run 1 -like tracking" that works with 0 T out of the box. (but I don't know if this is still relevant or not)

That's precisely what I had attempted here #33532, though we still had hopes to adapt the Run2 tracking as well.

@makortel
Copy link
Contributor Author

I see --nThreads 4 for cs8 in CMSSW_12_3 2021-12-14-2300

@makortel
Copy link
Contributor Author

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants