Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Nano step-related fixes for Run3/Phase2 workflows #36350

Merged
merged 2 commits into from Dec 5, 2021

Conversation

kpedro88
Copy link
Contributor

@kpedro88 kpedro88 commented Dec 3, 2021

PR description:

  1. Apply the fix from implement step skipping more consistently for Patatrack workflows #36341 to some other workflows that similarly skipped the standalone Nano step
  2. Put the run3_nanoAOD_devel modifier into the Run3 Era (easier way to make sure workflows are consistently applying it)

attn: @cms-sw/xpog-l2

PR validation:

Compared output of this command for 12_1_0_pre5 and this branch: runTheMatrix.py -w upgrade -nel 10024.1,11624.1,11634.17,11834.17,12834.17,13034.17,11834.98,11834.99,11634.21,11834.21,11834.9821,11834.9921,35034.21,35234.21,35234.98,35234.99,35234.9821,35234.9921 --dryRun. (This tests trackingOnly, deepCore, prodLike, premix workflows.)

@kpedro88
Copy link
Contributor Author

kpedro88 commented Dec 3, 2021

resolves #36347

@kpedro88
Copy link
Contributor Author

kpedro88 commented Dec 3, 2021

type bug-fix

@kpedro88
Copy link
Contributor Author

kpedro88 commented Dec 3, 2021

test parameters:
workflows = 10024.1,11634.1,11723.17,11834.21,35034.21,35234.21,35234.9921

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36350/27112

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

A new Pull Request was created by @kpedro88 (Kevin Pedro) for master.

It involves the following packages:

  • Configuration/Eras (operations)
  • Configuration/PyReleaseValidation (pdmv, upgrade)

@perrotta, @jordan-martins, @bbilin, @wajidalikhan, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen, @qliphy, @fabiocos, @davidlange6 can you please review it and eventually sign? Thanks.
@makortel, @missirol, @fabiocos, @slomeo, @Martin-Grunewald this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@kpedro88
Copy link
Contributor Author

kpedro88 commented Dec 3, 2021

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

-1

Failed Tests: RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-71105c/20969/summary.html
COMMIT: 815a5fe
CMSSW: CMSSW_12_2_X_2021-12-03-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/36350/20969/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

  • 11723.1711723.17_QCD_Pt_1800_2400_14+2021_seedingDeepCore+QCD_Pt_1800_2400_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step3_QCD_Pt_1800_2400_14+2021_seedingDeepCore+QCD_Pt_1800_2400_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log

@qliphy
Copy link
Contributor

qliphy commented Dec 4, 2021

urgent
to be included for 12_2_0_pre3

@cmsbuild cmsbuild added the urgent label Dec 4, 2021
@perrotta
Copy link
Contributor

perrotta commented Dec 4, 2021

@kpedro88 Tests did not finish, yet. But there is at least an issue with 11723.17 to be fixed

@kpedro88
Copy link
Contributor Author

kpedro88 commented Dec 4, 2021

For future reference, the crash and relevant trace:

%MSG-e BasicSingleTrajectoryState:  GsfTrackProducer:electronGsfTracks  03-Dec-2021 19:34:31 CET Run: 1 Event: 5
asking for componenets to a SingleTrajectoryState
%MSG
cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/e7669cd1d991022314b0cb0a8ff2f065/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/src/TrackingTools/TrajectoryState/src/BasicTrajectoryState.cc:248: virtual const Components& BasicSingleTrajectoryState::components() const: Assertion `false' failed.

Thread 1 (Thread 0x2b65f44920c0 (LWP 15499) "cmsRun"):
#0  0x00002b65f240bddd in poll () from /lib64/libc.so.6
#1  0x00002b65f86d59d7 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00002b65f86d630c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00002b65f86d97ab in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b65f234e387 in raise () from /lib64/libc.so.6
#6  0x00002b65f234fa78 in abort () from /lib64/libc.so.6
#7  0x00002b65f23471a6 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00002b65f2347252 in __assert_fail () from /lib64/libc.so.6
#9  0x00002b660004b4e8 in BasicSingleTrajectoryState::components() const () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/libTrackingToolsTrajectoryState.so
#10 0x00002b661c355dc7 in MultiStatePropagation<Plane>::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/libTrackingToolsGsfTools.so
#11 0x00002b661c35570c in GsfPropagatorAdapter::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/libTrackingToolsGsfTools.so
#12 0x00002b6620d5f6d1 in GsfPropagatorWithMaterial::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/libTrackingToolsGsfTracking.so
#13 0x00002b6620d6221e in GsfTrajectoryFitter::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/libTrackingToolsGsfTracking.so
#14 0x00002b6620b5eaa2 in (anonymous namespace)::KFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/pluginTrackingToolsTrackFittersPlugins.so
#15 0x00002b666c9b416e in TrackProducerAlgorithm<reco::GsfTrack>::buildTrack(TrajectoryFitter const*, Propagator const*, std::vector<AlgoProductTraits<reco::GsfTrack>::AlgoProduct, std::allocator<AlgoProductTraits<reco::GsfTrack>::AlgoProduct> >&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, TrajectoryStateOnSurface&, TrajectorySeed const&, float, reco::BeamSpot const&, edm::RefToBase<TrajectorySeed>, int, signed char) () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/libRecoTrackerTrackProducer.so
#16 0x00002b666c843791 in TrackProducerAlgorithm<reco::GsfTrack>::runWithCandidate(TrackingGeometry const*, MagneticField const*, std::vector<TrackCandidate, std::allocator<TrackCandidate> > const&, TrajectoryFitter const*, Propagator const*, TransientTrackingRecHitBuilder const*, reco::BeamSpot const&, std::vector<AlgoProductTraits<reco::GsfTrack>::AlgoProduct, std::allocator<AlgoProductTraits<reco::GsfTrack>::AlgoProduct> >&) () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/pluginRecoTrackerTrackProducerPlugins.so
#17 0x00002b666c83cf99 in GsfTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02709/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_2_X_2021-12-02-1100/lib/slc7_amd64_gcc900/pluginRecoTrackerTrackProducerPlugins.so

The interesting thing is that this workflow runs fine in 12_1_0_pre5 (with or without this PR). It appears that it's not crashing in 12_2_X IBs only because the Reco step is removed entirely without this PR.

@perrotta what do you want to do here? I don't have the time or expertise to debug this workflow...

@qliphy
Copy link
Contributor

qliphy commented Dec 4, 2021

The interesting thing is that this workflow runs fine in 12_1_0_pre5 (with or without this PR). It appears that it's not crashing in 12_2_X IBs only because the Reco step is removed entirely without this PR.

@kpedro88 The Reco step for 11723.17 is only removed between CMSSW_12_2_X_2021-11-29-1100 [1] and CMSSW_12_2_X_2021-11-29-2300 [2], as @perrotta commented previously here, with a suspect being #36167 from you. Can you check what happened in #36167 for 11723.17 and whether we can revert it back (partly)? Thanks!

[1] https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_12_2/2021-11-29-1100?selectedArchs=slc7_amd64_gcc900&selectedFlavors=X&selectedFlavors=X&selectedStatus=failed&selectedStatus=known_failed&selectedStatus=passed

[2] https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_12_2/2021-11-29-2300?selectedArchs=slc7_amd64_gcc900&selectedFlavors=X&selectedFlavors=X&selectedStatus=failed&selectedStatus=known_failed&selectedStatus=passed

@kpedro88
Copy link
Contributor Author

kpedro88 commented Dec 4, 2021

@qliphy I'm saying that if I copy the exact step3 command from CMSSW_12_2_X_2021-11-29-1100:

cmsDriver.py step3  -s RAW2DIGI,L1Reco,RECO,RECOSIM,EI,PAT,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM --conditions auto:phase1_2021_realistic --datatier GEN-SIM-RECO,MINIAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,DQM --geometry DB:Extended --era Run3 --procModifiers seedingDeepCore  --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary  --filein  file:step2.root  --fileout file:step3.root  --suffix "-j JobReport3.xml "  --nThreads 4 > step3_QCD_Pt_1800_2400_14+2021_seedingDeepCore+QCD_Pt_1800_2400_14TeV_TuneCP5_GenSimINPUT+Digi+Reco+HARVEST.log  2>&1

it crashes in a clean CMSSW_12_2_X_2021-12-03-1100 IB. So this specific crash is not caused by #36167; rather, #36167 accidentally hid the real problem by removing the RECO step, so the crash was not visible until now.

@perrotta
Copy link
Contributor

perrotta commented Dec 4, 2021

I have verified what @kpedro88 wrote. If I run

 runTheMatrix.py -l 11723.17 > & rtm_11723.17.out &

in CMSSW_12_2_X_2021-11-29-1100 (i.e. before #36167 was merged), that workflow crashes as well.

But in the IB tests of CMSSW_12_2_X_2021-11-29-1100 the same workflow succeed. The difference is in the input events, which in the PR tests are produced from scratch at step1, while in the IB tests they are taken from the store, i.e. they are different events. (I cannot verify in https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-71105c/20969/summary.html because that page does not open now, but this is what I see in my local lxplus)

Therefore, it confirms that the crash of 11723.17 in the tests does not depend on #36167, or even on this PR.
Since the other failing workflows seem to have been fixed, I would propose to merge this PR even if the test fail, and wait for the resuts IB tests

(We can even re-run the tests without 11723.17, if people prefere not to have the "test rejected" status. And in any case, the issue must be debugged, but by Tracking and/or EGamma, since it involves GsfTrackProducer:electronGsfTracks)

@perrotta
Copy link
Contributor

perrotta commented Dec 4, 2021

test parameters:
workflows = 10024.1,11634.1,11834.21,35034.21,35234.21,35234.9921

@perrotta
Copy link
Contributor

perrotta commented Dec 4, 2021

please test

@kpedro88
Copy link
Contributor Author

kpedro88 commented Dec 4, 2021

@perrotta thanks for confirming, I agree with your proposal for this PR.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-71105c/20992/summary.html
COMMIT: 815a5fe
CMSSW: CMSSW_12_2_X_2021-12-04-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/36350/20992/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-71105c/10024.1_TTbar_13+2017_trackingOnly+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-71105c/11634.1_TTbar_14TeV+2021_trackingOnly+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-71105c/11834.21_TTbar_14TeV+2021PU_ProdLike+TTbar_14TeV_TuneCP5_GenSim+DigiPU+RecoNanoPU+MiniAODPU+NanoPU
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-71105c/35034.21_TTbar_14TeV+2026D77_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+MiniAOD
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-71105c/35234.9921_TTbar_14TeV+2026D77PU_PMXS1S2ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+MiniAODPU

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 41
  • DQMHistoTests: Total histograms compared: 3041955
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3041933
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 40 files compared)
  • Checked 173 log files, 37 edm output root files, 41 DQM output files
  • TriggerResults: no differences found

@perrotta
Copy link
Contributor

perrotta commented Dec 4, 2021

@cms-sw/pdmv-l2 @cms-sw/upgrade-l2 please check and sign, if you agree with this PR: this is the last one missing before building pre3

@srimanob
Copy link
Contributor

srimanob commented Dec 4, 2021

+Upgrade

@bbilin
Copy link
Contributor

bbilin commented Dec 5, 2021

+1

@qliphy
Copy link
Contributor

qliphy commented Dec 5, 2021

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 5, 2021

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will be automatically merged.

@perrotta
Copy link
Contributor

perrotta commented Dec 6, 2021

I have opened an issue for the crash observed in 11723.17: #36369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants