Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults in RecHitsSortedInPhi constructor in GPU workflows #40604

Closed
makortel opened this issue Jan 24, 2023 · 16 comments
Closed

Segfaults in RecHitsSortedInPhi constructor in GPU workflows #40604

makortel opened this issue Jan 24, 2023 · 16 comments

Comments

@makortel
Copy link
Contributor

makortel commented Jan 24, 2023

The step 3 in subset of 10824.59x and 11634.59x workflows have been segfaulting in GPU IBs since CMSSW_13_0_X_2023-01-18-2300. Example stack trace

Thread 7 (Thread 0x14e9347ff700 (LWP 3429302) "cmsRun"):
#2  0x000014e9b5c68360 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x000014e9bbff0129 in __xstat64 () from /lib64/libc.so.6
#5  0x000014e9bc9be6c4 in stat (__statbuf=0x14e9347f7800, __path=<optimized out>) at /usr/include/sys/stat.h:455
#6  std::filesystem::status (p=..., ec=...) at ../../../../../libstdc++-v3/src/c++17/fs_ops.cc:1513
#7  0x000014e9bc9bedfc in std::filesystem::status (p=...) at ../../../../../libstdc++-v3/src/c++17/fs_ops.cc:1578
#8  0x000014e9be66f169 in (anonymous namespace)::locateFile(std::filesystem::__cxx11::path, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .isra.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreUtilities.so
#9  0x000014e9be670d39 in edm::FileInPath::initialize_() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreUtilities.so
#10 0x000014e9be672530 in edm::FileInPath::FileInPath(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreUtilities.so
#11 0x000014e905faca10 in SectorProcessorLUT::read_cppf_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned int, std::allocator<unsigned int> >&, std::vector<unsigned int, std::allocator<unsigned int> >&, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libL1TriggerL1TMuonEndCap.so
#12 0x000014e905fad9f0 in SectorProcessorLUT::read(bool, int) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libL1TriggerL1TMuonEndCap.so
#13 0x000014e905f72014 in EMTFSetup::reload(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libL1TriggerL1TMuonEndCap.so
#14 0x000014e905fb0d6a in TrackFinder::process(edm::Event const&, edm::EventSetup const&, std::vector<l1t::EMTFHit, std::allocator<l1t::EMTFHit> >&, std::vector<l1t::EMTFTrack, std::allocator<l1t::EMTFTrack> >&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libL1TriggerL1TMuonEndCap.so
#15 0x000014e905f20f97 in L1TMuonEndCapTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginL1TriggerL1TMuonEndCapPlugins.so
#16 0x000014e9beb4259d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 6 (Thread 0x14e935643700 (LWP 3429295) "cmsRun"):
#2  0x000014e9b5c68360 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x000014e9bbfa02e9 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#5  0x000014e938882188 in SiPixelTemplate2D::pushfile(SiPixel2DTemplateDBObject const&, std::vector<SiPixelTemplateStore2D, std::allocator<SiPixelTemplateStore2D> >&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libCondFormatsSiPixelTransient.so
#6  0x000014e9388d1354 in PixelCPEClusterRepair::PixelCPEClusterRepair(edm::ParameterSet const&, MagneticField const*, TrackerGeometry const&, TrackerTopology const&, SiPixelLorentzAngle const*, SiPixelTemplateDBObject const*, SiPixel2DTemplateDBObject const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoLocalTrackerSiPixelRecHits.so
#7  0x000014e9389937ae in PixelCPEClusterRepairESProducer::produce(TkPixelCPERecord const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so
#8  0x000014e9389a0879 in edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<PixelCPEClusterRepairESProducer, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> >(PixelCPEClusterRepairESProducer*, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> > (PixelCPEClusterRepairESProducer::*)(TkPixelCPERecord const&), edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> const&, edm::es::Label const&)::{lambda(TkPixelCPERecord const&)#1}, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> >::runProducerAsync(tbb::detail::d1::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so

Thread 5 (Thread 0x14e936044700 (LWP 3429294) "cmsRun"):
#2  0x000014e9b5c68360 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x000014e9bbfa02e4 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#5  0x000014e938882188 in SiPixelTemplate2D::pushfile(SiPixel2DTemplateDBObject const&, std::vector<SiPixelTemplateStore2D, std::allocator<SiPixelTemplateStore2D> >&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libCondFormatsSiPixelTransient.so
#6  0x000014e9388d1354 in PixelCPEClusterRepair::PixelCPEClusterRepair(edm::ParameterSet const&, MagneticField const*, TrackerGeometry const&, TrackerTopology const&, SiPixelLorentzAngle const*, SiPixelTemplateDBObject const*, SiPixel2DTemplateDBObject const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoLocalTrackerSiPixelRecHits.so
#7  0x000014e9389937ae in PixelCPEClusterRepairESProducer::produce(TkPixelCPERecord const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so
#8  0x000014e9389a0879 in edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<PixelCPEClusterRepairESProducer, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> >(PixelCPEClusterRepairESProducer*, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> > (PixelCPEClusterRepairESProducer::*)(TkPixelCPERecord const&), edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> const&, edm::es::Label const&)::{lambda(TkPixelCPERecord const&)#1}, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> >::runProducerAsync(tbb::detail::d1::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so

Thread 1 (Thread 0x14e9bb42c640 (LWP 3428961) "cmsRun"):
#3  0x000014e9b5c6bb1b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x000014e94c94a724 in RecHitsSortedInPhi::RecHitsSortedInPhi(std::vector<BaseTrackerRecHit const*, std::allocator<BaseTrackerRecHit const*> > const&, Point3DBase<float, GlobalTag> const&, DetLayer const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#6  0x000014e94c94662c in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#7  0x000014e94c9445ca in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#8  0x000014e8ebadb559 in (anonymous namespace)::Impl<(anonymous namespace)::DoNothing, (anonymous namespace)::ImplIntermediateHitDoublets, (anonymous namespace)::RegionsLayersSeparate>::produce(bool, edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoTrackerTkHitPairsPlugins.so
#9  0x000014e9beb4259d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so


Current Modules:
Module: HitPairEDProducer:initialStepHitDoubletsPreSplitting (crashed)
Module: none
Module: L1TMuonEndCapTrackProducer:valEmtfStage2Digis
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_0_GPU_X_2023-01-23-2300/pyRelValMatrixLogs/run/10824.592_TTbar_13+2018_Patatrack_FullRecoGPU/step3_TTbar_13+2018_Patatrack_FullRecoGPU.log#/

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

Here is another one pointing more clearly to the crash to occur in sorting

#3  0x0000148464e82b1b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00001483fbbaed27 in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#6  0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#7  0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#8  0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#9  0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#10 0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#11 0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#12 0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#13 0x00001483fbbae2ec in RecHitsSortedInPhi::RecHitsSortedInPhi(std::vector<BaseTrackerRecHit const*, std::allocator<BaseTrackerRecHit const*> > const&, Point3DBase<float, GlobalTag> const&, DetLayer const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#14 0x00001483fbbaa62c in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#15 0x00001483fbba85ca in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#16 0x000014839ad3b559 in (anonymous namespace)::Impl<(anonymous namespace)::DoNothing, (anonymous namespace)::ImplIntermediateHitDoublets, (anonymous namespace)::RegionsLayersSeparate>::produce(bool, edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoTrackerTkHitPairsPlugins.so
#17 0x000014846dd6259d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Current Modules:
Module: HitPairEDProducer:initialStepHitDoubletsPreSplitting (crashed)
Module: SiStripRecHitsValid:stripRecHitsValid
Module: SiStripRecHitConverter:siStripMatchedRecHits
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_0_GPU_X_2023-01-23-2300/pyRelValMatrixLogs/run/10824.593_TTbar_13+2018_Patatrack_FullRecoGPU_Validation/step3_TTbar_13+2018_Patatrack_FullRecoGPU_Validation.log#/

@makortel
Copy link
Contributor Author

Assign reconstruction,heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,reconstruction

@mandrenguyen,@fwyzard,@clacaputo,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

#40465 looks like a plausible culprit. Let me tag also @AdrianoDee.

@AdrianoDee
Copy link
Contributor

Let me have a look.

@makortel
Copy link
Contributor Author

@AdrianoDee Have you had a chance to take a look? In principle it would be good to have the crashes fixed for 13_0_0.

@AdrianoDee
Copy link
Contributor

@makortel you are right. I had a look but I didn't converge. On it in the next days.

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Feb 23, 2023

So, I still didn't understand what's happening but something strange is that I can't reproduce this in single thread and the crash occurs when any of the threads goes to the next event (so at 5th event for 4 threads, 9th for 8 and so on). If this ring a bell for somebody please let me know. Debugging is getting nasty not being able to run single threaded (also, any suggestion on how to better debug this it's very welcome).

@Dr15Jones
Copy link
Contributor

suggestion on how to better debug this it's very welcome

Have you tried valgrind? It will also work with multiple threads.

Another thing to try would be to see if using 2 streams and 1 thread also leads to a crash.

@Dr15Jones
Copy link
Contributor

After taking a look at the code (which ultimately is just sorting on floats which are stored as member data) it seems the most likely culprit is a NaN value as at least one of the phi values. A NaN breaks sorting since

   //to a sort 1 must be equal to nan since
   1 < nan == false;
   nan < 1 == false;
  // to a sort 2 must be equal to nan since
  2 < nan == false;
  nan < 2 == false;  

so from the transitive property of arithmetics, the sort would assume 1 == 2 as well so it expects the following
``
1 < 2 == false;

so breaks the sorting algorithm.

@AdrianoDee
Copy link
Contributor

Thanks @Dr15Jones I was noticing the same nans too in hits' phi. Trying to track why they appear.

@AdrianoDee
Copy link
Contributor

The problem is that localCoordToHostAsync is not taking into account the SoA layout padding to 128 alignment. And then this cudaMemcpyAsync is copying some wrong portion of memory. Still don't understand how this got unspotted. My quick fix would be:

--- a/CUDADataFormats/TrackingRecHit/interface/TrackingRecHitSoADevice.h
+++ b/CUDADataFormats/TrackingRecHit/interface/TrackingRecHitSoADevice.h
@@ -48,7 +48,11 @@ public:
   cms::cuda::host::unique_ptr<float[]> localCoordToHostAsync(cudaStream_t stream) const {
     auto ret = cms::cuda::make_host_unique<float[]>(4 * nHits(), stream);
     size_t rowSize = sizeof(float) * nHits();
-    cudaCheck(cudaMemcpyAsync(ret.get(), view().xLocal(), rowSize * 4, cudaMemcpyDefault, stream));
+    
+    cudaCheck(cudaMemcpyAsync(ret.get(), view().xLocal(), rowSize, cudaMemcpyDefault, stream));
+    cudaCheck(cudaMemcpyAsync(ret.get() + nHits(), view().yLocal(), rowSize, cudaMemcpyDefault, stream));
+    cudaCheck(cudaMemcpyAsync(ret.get() + nHits() * 2, view().xerrLocal(), rowSize, cudaMemcpyDefault, stream));
+    cudaCheck(cudaMemcpyAsync(ret.get() + nHits() * 3, view().yerrLocal(), rowSize, cudaMemcpyDefault, stream));
 
     return ret;
   }  //move to utilities
   

@AdrianoDee
Copy link
Contributor

Proposed the fix in #40869

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Mar 2, 2023

Solved by #40869 (and #40870).

@makortel
Copy link
Contributor Author

makortel commented Mar 2, 2023

+heterogeneous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants