Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in MkFit affecting PromptReco #38127

Closed
fabiocos opened this issue May 30, 2022 · 14 comments
Closed

Crash in MkFit affecting PromptReco #38127

fabiocos opened this issue May 30, 2022 · 14 comments

Comments

@fabiocos
Copy link
Contributor

Two prompt reco jobs have been reported to crash in prompt reco (run 352493) , see https://cms-talk.web.cern.ch/t/seg-violation-on-reco-jobs-for-run-352493/11038 . The crash is happening in MkFitProducer:initialStepTrackCandidatesMkFitPreSplitting,
with dump:

#0  0x00007fd6f2705ddd in poll () from /lib64/libc.so.6
#1  0x00007fd6e02a628f in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_4/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2  0x00007fd6e02a6c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_4/lib/slc7_amd64_gcc10/pluginFWC
oreServicesPlugins.so
#3  0x00007fd6e02a956b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_4/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fd642451d3b in mkfit::StdSeq::clean_cms_seedtracks_iter(std::vector<mkfit::Track, std::allocator<mkfit::Track> >*, mkfit::IterationConfig const&, mkfit::BeamSpot const
&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_4/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCMS.so
#6  0x00007fd6424557b4 in mkfit::run_OneIteration(mkfit::TrackerInfo const&, mkfit::IterationConfig const&, mkfit::EventOfHits const&, std::vector<std::vector<bool, std::allocato
r<bool> > const*, std::allocator<std::vector<bool, std::allocator<bool> > const*> > const&, mkfit::MkBuilder&, std::vector<mkfit::Track, std::allocator<mkfit::Track> >&, std::vec
tor<mkfit::Track, std::allocator<mkfit::Track> >&, bool, bool, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_4/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCM
S.so
#7  0x00007fd6424efd36 in tbb::detail::d1::task_arena_function<MkFitProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const::{lambda()#1}, void>::operator()()
 const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_4/lib/slc7_amd64_gcc10/pluginRecoTrackerMkFitPlugins.so
#8  0x00007fd6f37e09ce in operator() (__closure=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_a
md64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:771
#9  tbb::detail::d0::try_call_proxy<tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, intptr_t)::<lambda()> >::on_completion<tbb::detail::r1::isolate_within_
arena(tbb::detail::d1::delegate_base&, intptr_t)::<lambda()> > (on_completion_body=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd6
4_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/../../include/oneapi/tbb/detail/_template_he
lpers.h:230
#10 tbb::detail::r1::isolate_within_arena (d=..., isolation=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/
BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:772
#11 0x00007fd6424f1b21 in MkFitProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_4/lib/slc7_amd64_gcc10/pluginRecoTrackerMkFitPlugins.so
#12 0x00007fd6f514be13 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_4/lib/slc7_amd64_gcc10/libFWCoreFramework.so

Running in debugger the crash is seen at https://github.com/cms-sw/cmssw/blob/master/RecoTracker/MkFitCMS/src/MkStdSeqs.cc#L193 where the values of

              std::cout << "ts, tss " << ts << " " << tss << " " << writetrack.size() << std::endl;

are

ts, tss 17 17 40
ts, tss 17 18 40
ts, tss 17 15 40
ts, tss 17 16 40
ts, tss 17 13 40
ts, tss 17 14 40
ts, tss 18 17 40
ts, tss 18 18 40
ts, tss 18 15 40
ts, tss 18 16 40
ts, tss 18 13 40
ts, tss 18 14 40
ts, tss 24 24 40
ts, tss 24 23 40
ts, tss 24 25 40
ts, tss 24 20 40
ts, tss 24 22 40
ts, tss 24 21 40
ts, tss 23 -2147483648 40

Thread 1 "cmsRun" received signal SIGSEGV, Segmentation fault.
mkfit::StdSeq::clean_cms_seedtracks_iter (seed_ptr=seed_ptr@entry=0x7fffffff2b00, itrcfg=..., bspot=...)
    at /build/fabiocos/123X/CMSSW_12_3_4_patch2/src/RecoTracker/MkFitCMS/src/MkStdSeqs.cc:195
195                   if (not writetrack[tss])
(gdb) q
@cmsbuild
Copy link
Contributor

A new Issue was created by @fabiocos Fabio Cossutti.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented May 31, 2022

type tracking

@jpata
Copy link
Contributor

jpata commented May 31, 2022

@slava77 @leonardogiannini @osschar @mmasciov et al could you please take a look?
It was highlighted at the ORP today, as something urgent for tomorrow's joint operations meeting.

@jpata
Copy link
Contributor

jpata commented May 31, 2022

type urgent

@fabiocos
Copy link
Contributor Author

the only pathological event I found in /store/data/Run2022A/ZeroBias13/RAW/v1/000/352/493/00000/4e54b433-58cd-4240-853e-3a66292157dc.root is 352493:178138216.

@perrotta
Copy link
Contributor

urgent

@slava77
Copy link
Contributor

slava77 commented May 31, 2022

@fabiocos
is there a config available to navigate directly to the crashing event?

@slava77
Copy link
Contributor

slava77 commented May 31, 2022

I was looking at the logs in /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022A/run352493/job_287976/tarball/job/WMTaskSpace/cmsRun1 and
/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022A/run352493/job_285547/tarball/job/WMTaskSpace/cmsRun1

@cms-sw/core-l2 would it be possible to report the crashing event in the summary?
Currently we only get

Module: MkFitProducer:initialStepTrackCandidatesMkFitPreSplitting (crashed)

It's nice that Fabio selected the problematic event; but it would be useful to know that event from the logs alone.

@osschar
Copy link
Contributor

osschar commented May 31, 2022

I have reproduced the problem using @fabiocos setup / config. The crash is due to round-off errors in determining the bin extents of sorted seed container. This has already been fixed on master and in 12_4_x (#37586).

Backport is being prepared.

@osschar
Copy link
Contributor

osschar commented May 31, 2022

Addressed in #38151.

@fabiocos
Copy link
Contributor Author

fabiocos commented Jun 1, 2022

+1

the problem looks to be fixed by the proposed backport, which has been integrated and will entre a new patch release to be deployed in operations.

@fabiocos fabiocos closed this as completed Jun 1, 2022
@jpata
Copy link
Contributor

jpata commented Jun 1, 2022

+reconstrution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants