Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crashes in GPU and CPU in collision runs #38453

Closed
swagata87 opened this issue Jun 21, 2022 · 50 comments
Closed

HLT crashes in GPU and CPU in collision runs #38453

swagata87 opened this issue Jun 21, 2022 · 50 comments

Comments

@swagata87
Copy link
Contributor

Dear experts,

During the week of June 13-20, following 3 types of HLT crashes happened in collision runs. HLT was using CMSSW_12_3_5.

  1. type 1
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.


A fatal system signal has occurred: abort signal

This crash happened on June 13th, during stable beams, collision at 900 GeV. Run number: 353709. The crash happened in a CPU(fu-c2a05-35-01). Elog: http://cmsonline.cern.ch/cms-elog/1143438. Full crash report: https://swmukher.web.cern.ch/swmukher/hltcrash_June13_StableBeam.txt

  1. type 2
Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: PathStatusInserter:Dataset_ExpressPhysics
Module: EcalRawToDigi:hltEcalDigisLegacy

A fatal system signal has occurred: segmentation violation
Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none

A fatal system signal has occurred: segmentation violation
Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: none
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU

A fatal system signal has occurred: segmentation violation

This type of crashes happened in GPUs (for example: fu-c2a02-35-01). It happened during collision runs when no real collisions were happening. On June 14th (run 353744, Pixel subdetector was out), and on June 18th (run 353932, 353935, 353941, Pixel and tracker subdetectors were out).

  1. type 3
[2] Prefetching for module MeasurementTrackerEventProducer/'hltSiStripClusters'
[3] Prefetching for module SiPixelDigiErrorsFromSoA/'hltSiPixelDigisFromSoA'
[4] Calling method for module SiPixelDigiErrorsSoAFromCUDA/'hltSiPixelDigiErrorsSoA'
Exception Message:
A std::exception was thrown.
cannot create std::vector larger than max_size()

happened in fu-c2a02-39-01 (GPU), in collision run 353941 (Pixel and tracker subdetectors were out), no real collision was ongoing.

Reason of crash (2) and (3) might even be related.
Relevant elog on (2) and (3): http://cmsonline.cern.ch/cms-elog/1143515

Regards,
Swagata, as HLT DOC during June 13-20.

@cmsbuild
Copy link
Contributor

A new Issue was created by @swagata87 Swagata Mukherjee.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

assign hlt, reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,reconstruction

@jpata,@missirol,@clacaputo,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

@swagata87 could you provide the full stack traces for the job that failed with the segmentation violations?

@swagata87
Copy link
Contributor Author

Three examples are pasted below:

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sat Jun 18 18:31:53 CEST 2022
Thread 1 (Thread 0x7fde7a331540 (LWP 194148) "cmsRun"):
#0 0x00007fde7c1d3ddd in poll () from /lib64/libc.so.6
#1 0x00007fde70bf428f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007fde70bf4c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007fde70bf756b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fde7c2366a6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007fddb6e786ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007fddb6e76fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007fde7ec2dd83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007fde7ec16eaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007fde7eb720e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007fde7eb723db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007fde7eb749c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007fde7eab8c45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007fde7d2c1b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fddb5dd2300, this=0x7fde799da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fde799da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fde7eae2ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007fde7eaed8fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007fde7d2b015b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: EcalRawToDigi:hltEcalDigisLegacy
Module: none

A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Tue Jun 14 06:45:22 CEST 2022
Thread 1 (Thread 0x7f1d0ef42540 (LWP 251002) "cmsRun"):
#0 0x00007f1d10de4ddd in poll () from /lib64/libc.so.6
#1 0x00007f1d057f428f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f1d057f4c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f1d057f756b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f1d10e45d29 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007f1c4b0876ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007f1c4b085fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007f1d1383fd83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f1d13828eaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f1d137840e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007f1d137843db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f1d137869c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f1d136cac45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f1d11ed3b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f1c4a9a1500, this=0x7f1d0e5da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f1d0e5da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f1d136f4ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f1d136ff8fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007f1d11ec215b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: HcalHitReconstructor:hltHoreco
Module: HcalHitReconstructor:hltHoreco

A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]  
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Tue Jun 14 06:45:23 CEST 2022
Thread 1 (Thread 0x7f6148fd5540 (LWP 250893) "cmsRun"):
#0 0x00007f614ae77ddd in poll () from /lib64/libc.so.6
#1 0x00007f613f1f228f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f613f1f2c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f613f1f556b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f614aed8cb5 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007f60850e76ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007f60850e5fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007f614d8d4d83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f614d8bdeaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f614d8190e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007f614d8193db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f614d81b9c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f614d75fc45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f614bf5fb8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f60849ad400, this=0x7f61485da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f61485da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f614d789ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f614d7948fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007f614bf4e15b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none

A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]

The full list is here:

@swagata87
Copy link
Contributor Author

Experts are working on providing a recipe to reproduce the crashes offline. (tagging @mzarucki and @fwyzard )
Once that is available, that can be posted here so that tracker DPG can have a look. The code that triggered the crashes are under tracker DPG.

@swagata87
Copy link
Contributor Author

swagata87 commented Jun 22, 2022

Dear tracker DPG, (@cms-sw/trk-dpg-l2)
I managed to reproduce the GPU crash happened during run 353941 in the machine gputest-milan-01.cms at Point 5.
I used CMSSW_12_3_5.
$CMSSW_RELEASE_BASE/ is -bash: /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/: Is a directory

General instructions to set up CMSSW area in GPU nodes online, is here
https://twiki.cern.ch/twiki/bin/viewauth/CMS/TriggerDevelopmentWithGPUs

The HLT configuration file is: https://swmukher.web.cern.ch/swmukher/hlt_v5.py
The .raw file I ran on is this-> run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw
This .raw file and all the other .raw files are available in the online machines under /store/error_stream.

I have copied one .raw here: https://swmukher.web.cern.ch/swmukher/run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw

In case it is useful,
The HLT configuration file was obtained by the following command:
https_proxy=http://cmsproxy.cms:3128/ hltConfigFromDB --adg --configName /cdaq/physics/firstCollisions22/v2.4/HLT/V5 > hlt_v5.py

Then, at the end, the following block was added:

process.EvFDaqDirector = cms.Service(
    "EvFDaqDirector",
    runNumber=cms.untracked.uint32(353941), #maybe_replace_me
    baseDir=cms.untracked.string("tmp"),
    buBaseDir=cms.untracked.string(
        "/nfshome0/swmukher/check/CMSSW_12_3_5/src" #replace_me
    ),
    useFileBroker=cms.untracked.bool(False),
    fileBrokerKeepAlive=cms.untracked.bool(True),
    fileBrokerPort=cms.untracked.string("8080"),
    fileBrokerUseLocalLock=cms.untracked.bool(True),
    fuLockPollInterval=cms.untracked.uint32(2000),
    requireTransfersPSet=cms.untracked.bool(False),
    selectedTransferMode=cms.untracked.string(""),
    mergingPset=cms.untracked.string(""),
    outputAdler32Recheck=cms.untracked.bool(False),
)

process.source.fileNames = cms.untracked.vstring("file:run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw")  #maybe_replace_me    
process.source.fileListMode = True

cmsRun hlt_v5.py reproduces the crash.
It will create a /tmp folder.
To reproduce the crash again, I had to remove the /tmp folder before doing cmsRun again.

Let me know if something was unclear.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 22, 2022

@swagata87 thank you for providing these instructions !

@tsusa you can use the online GPU machines to reproduce the issue:

ssh gpu-c2a02-39-01.cms
mkdir -p /data/$USER
cd /data/$USER
source /data/cmssw/cmsset_default.sh
cmsrel CMSSW_12_3_5
cd CMSSW_12_3_5
mkdir run
cd run
cp ~hltpro/error/hlt_error_run353941.py .
cmsRun hlt_error_run353941.py

In my test the problem did not happen every time, I had to run the job a few times before it crashed:

while cmsRun hlt_error_run353941.py; do clear; rm -rf output; done

It eventually crashed, though I'm not 100% sure if it was due to the same problem :-/

@fwyzard
Copy link
Contributor

fwyzard commented Jun 22, 2022

Yes, looks like the same crash:

#4  <signal handler called>
#5  0x00007fbbf5c9f6a6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6  0x00007fbb34ed06ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl<SiPixelErrorsSoA, int, SiPixelErrorCompact const*, std::map<unsigned int, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map<unsigned int, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> > > > > const*&) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7  0x00007fbb34ecefab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8  0x00007fbbf8696d83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9  0x00007fbbf867feaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
...

@Dr15Jones
Copy link
Contributor

As a guess, I think the problem is an extremely large amount of data is being requested to be copied which leads to some memory overwrite into a protected memory space. This is just based on what edm::Event::emplaceImpl is doing which is basically calling

explicit SiPixelErrorsSoA(size_t nErrors, const SiPixelErrorCompact *error, const SiPixelFormatterErrors *err)
: error_(error, error + nErrors), formatterErrors_(err) {}

@Dr15Jones
Copy link
Contributor

So cms::cuda::SimpleVector does not initialize any of its member data in its constructor

constexpr SimpleVector() = default;

If the first call to SiPixelDigiErrorsSoAFromCUDA::acquire hits this condition

if (gpuDigiErrors.nErrorWords() == 0)
return;

then this call in produce

iEvent.emplace(digiErrorPutToken_, error_.size(), error_.data(), formatterErrors_);

will just copy a random number of bytes from a random memory address.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 22, 2022

@Dr15Jones thanks for investigating the issue.

So cms::cuda::SimpleVector does not initialize any of its member data in its constructor

This is intended, because a SimpleVector is often allocated by the host in GPU memory, so the constructor cannot be run.
However the does leave open the possibility of using uninitialised memory :-(

A minimal fix could be

diff --git a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
index 4037b4d5061..554f1425cef 100644
--- a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
+++ b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
@@ -28,7 +28,7 @@ private:
   edm::EDPutTokenT<SiPixelErrorsSoA> digiErrorPutToken_;
 
   cms::cuda::host::unique_ptr<SiPixelErrorCompact[]> data_;
-  cms::cuda::SimpleVector<SiPixelErrorCompact> error_;
+  cms::cuda::SimpleVector<SiPixelErrorCompact> error_ = cms::cuda::make_SimpleVector<SiPixelErrorCompact>(0, nullptr);
   const SiPixelFormatterErrors* formatterErrors_ = nullptr;
 };
 

With it I have been able to run over 20 times on the same input as before without triggering any errors.

@trtomei
Copy link
Contributor

trtomei commented Jun 22, 2022

Hm, looks like I am late to the party... but, if it's any help, here are instructions for the error seen in Run 353744 (AFAICT you have been testing with Run 353941). Running in Hilton this time:

Input file: file:/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run353744_ls0009.root
CMSSW: CMSSW_12_3_5
GT: 123X_dataRun3_HLT_v7
Menu: /cdaq/physics/firstCollisions22/v2.4/HLT/V2

I also see the same problem, it crashes only every once in a while. It's probably the same bug, but I add it here for completeness.

@trtomei
Copy link
Contributor

trtomei commented Jun 23, 2022

I also have here the other crash, this one is fully reproducible:

Input file: file:/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run353709_ls0085.root
CMSSW: CMSSW_12_3_5
GT: 123X_dataRun3_HLT_v7
Menu: /cdaq/physics/firstCollisions22/v2.4/HLT/V2

It will always crash on the 52nd event, Run 353709, Event 76567528, LumiSection 85, with the message:

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray<short unsigned int, 48> >; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc<unsigned int, 32769, 163840>; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray<unsigned int, 6>]: Assertion `tmpNtuplet.size() <= 4' failed.

PS: it's not needed to run on Hilton at all, I was running in offline-like mode.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 23, 2022

@trtomei could you clarify

  • what was the original error stream file ? is it /store/error_stream/run353709/run353709_ls0085_index000141_fu-c2a05-35-01_pid90386.raw
  • are you running with or without GPUs ?
  • does the error happen consistently, or only randomly ?

Running online, I have not been able to reproduce the error using the .raw input file, neither with nor without GPUs.

@trtomei
Copy link
Contributor

trtomei commented Jun 26, 2022

@fwyzard To clarify:

  • Original error stream file was indeed: /store/error_stream/run353709/run353709_ls0085_index000141_fu-c2a05-35-01_pid90386.raw
  • I understand I am using GPUs, as I am using the Skylake machine ( hilton-c2e36-35-04), using the process.options = cms.untracked.PSet( accelerators = cms.untracked.vstring( '*' ) ) option, and I see the lines
%MSG-i CUDAService:  (NoModuleName) 26-Jun-2022 12:24:47  pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
%MSG
  • For me, the error happens consistently.

Maybe sit together with me tomorrow and we solve this.

@missirol
Copy link
Contributor

@swagata87 @trtomei

Is this issue still relevant?

@swagata87
Copy link
Contributor Author

swagata87 commented Oct 12, 2022

Is this issue still relevant?

actually, yesterday we had a crash which looks like the type1 crash mentioned in the issue-description.
Here are some relevant information on yesterday's crash:

Run number: 360224
StartTime: Oct 12 2022, 02:52
EndTime: Oct 12 2022, 04:36
HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.1/HLT/V1
CMSSW_12_4_9
Crash happened in: fu-c2b05-23-01
The error stream file has been copied to hilton. So I think FOG will check if it is reproducible or not, and will follow up.

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_9-el8_amd64_gcc10/build/CMSSW_12_4_9-build/tmp/BUILDROOT/dc6747a684df926e1faea7ef7c301e1a/opt/cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_9/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.

@trtomei
Copy link
Contributor

trtomei commented Oct 14, 2022

The files in ROOT format and the HLT configuration are in: /afs/cern.ch/user/t/tomei/public/issue38453
This is reproducible in the Hilton with GPU:

%MSG-i ThreadStreamSetup:  (NoModuleName) 14-Oct-2022 02:05:46  pre-events
setting # threads 4
setting # streams 4
%MSG
%MSG-i CUDAService:  (NoModuleName) 14-Oct-2022 02:05:47  pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)

@missirol
Copy link
Contributor

@cms-sw/tracking-pog-l2

In this issue, one HLT crash is not yet solved, and I would say we need help from tracking experts in order to find a fix.

The crash is reproducible offline (see #38453 (comment)), it comes from the (HLT) pixel reconstruction, and it only happens on CPU, not on GPU (for what we have seen so far).

Removing some assert calls, one can find a tmpNtuplet with size=5, but that's as far as my insight goes.
https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293

@fwyzard
Copy link
Contributor

fwyzard commented Oct 19, 2022

I have a vague recollection of a comment from @VinInn sayng that we should simply remove the assert...

I think now it's OK to have ntuplets with 5 hits, so an alternative could be to change the condition to <= 5 ?

@missirol
Copy link
Contributor

At least, removing the asserts [1,2] does not lead to any other crashes, fwiw.

And just for my understanding: the fact that, for the same event, we do not see a ntuplet with size=5 on GPU can be expected?

[1] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293
[2] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L334

@VinInn
Copy link
Contributor

VinInn commented Oct 19, 2022

It does not happen on GPU because assert are removed.
this is a sort of sextuplet candidate (rare? impossible?)
anyhow if on GPU does not cause havoc I would either change the condition following @fwyzard advice or just remove the assert.
mind the assert at the end of the function as well

@missirol
Copy link
Contributor

It does not happen on GPU because assert are removed.

Okay, thanks, but still I tried to just print the ntuplet size while running on GPU, and I didn't see a size=5..

@missirol
Copy link
Contributor

Thanks for having a look.

I checked that (unsurprisingly) the HLT runs fine on these 'error events', for both CPU and GPU, after changing the 4 to a 5 in the asserts, so in the meantime I'll open PRs with that change to gain time.

@missirol
Copy link
Contributor

The PRs with the 4 -> 5 change are #39780 (12_6_X), #39781 (12_5_X), and #39782 (12_4_X).

@mmusich
Copy link
Contributor

mmusich commented Oct 20, 2022

@cms-sw/hlt-l2 (now speaking with the ORM hat, in order to better coordinate the creation of the next patch releases):

@missirol
Copy link
Contributor

will this issue be fully solved after the merge of the backports of #39780 ?

Yes, that is my understanding.

is there any other outstanding HLT crash with recent data that still needs to followed up (outside of this ticket)?

There are two more issues, but those crashes have been rare: #39568 , which ECAL has promised to look into, and #38651, which might somehow have been a glitch (seen only once).

FOG (@trtomei) can tell us if there are any new online crashes without a CMSSW issue.

@missirol
Copy link
Contributor

+hlt

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

so the tuplet in question is joining layer-pairs 0,3,10,7,12, so all 6 BPIX1,2,3 and FPIX1,2,3
geometrically (almost) impossible but ok.
Now why not on GPU?

how can I run hlt_for_debug.py on GPU and NOT on CPU?

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

Anyhow if we "observe" sextuplets we need to allow sextuplets in the code .... so the fix of the asserts is ok (the arrays were already over-dimensioned)

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

The sextuplet is on GPU as well

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

in case you are interested here are the coordinates of the hits

CPU 0,3,10,7,12,   r/z: 2.834839/0.714075,6.584036/-13.839200,10.662227/-29.628742,11.603213/-33.272655,13.539767/-40.767979,15.999158/-50.250683,
GPU 0,3,10,7,12,   r/z: 2.834839/0.714075,6.584036/-13.839200,10.662227/-29.628742,11.603212/-33.272655,13.539767/-40.767979,15.999158/-50.250683,

@missirol
Copy link
Contributor

how can I run hlt_for_debug.py on GPU and NOT on CPU?

Looks like this was already solved. I add one comment for documentation purposes.

The complication comes from the fact that the HLT menu includes 2 prescaled triggers that run the pixel CPU-only reco (which is why we saw the crash online). To ensure that only the pixel GPU reco is running, one solution is to remove them, but that's tricky to do starting from the full menu [1]; alternatively, one can just run 1 appropriate Path instead of the full menu (most times, this is enough for a reproducer) [2]. In the future, we/HLT should maybe try to build 'minimal' reproducers, e.g. not using the full menu if that's not needed.

[1] Add at the end of hlt_for_debug.py:

del process.DQM_PixelReconstruction_v4
del process.AlCa_PFJet40_CPUOnly_v1
del process.HLT_PFJet40_GPUvsCPU_v1
process.hltMuonTriggerResultsFilter.triggerConditions = ['FALSE']
del process.PrescaleService
del process.DQMHistograms
dpaths = [foo for foo in process.paths_() if foo.startswith('Dataset_')]
for foo in dpaths: process.__delattr__(foo)
fpaths = [foo for foo in process.finalpaths_()]
for foo in fpaths: process.__delattr__(foo)

[2] In this case, it could have been

hltGetConfiguration run:360224 \
 --data \
 --no-prescale \
 --no-output \
 --globaltag 124X_dataRun3_HLT_v4 \
 --paths AlCa_PFJet40_v* \
 --max-events -1 \
 --input file:run360224_ls0081_file1.root \
 > hlt.py

cat <<@EOF >> hlt.py
process.MessageLogger.cerr.FwkReport.limit = 1000
process.options.numberOfThreads = 1
process.options.accelerators = ['cpu']
@EOF

cmsRun hlt.py &> hlt.log

@missirol
Copy link
Contributor

in case you are interested here are the coordinates of the hits

If it's not too much trouble to explain, I would be interested to know how to extract the information on layer pairs and r-z coordinates for a given candidate.

I see the crash at Run 360224, Event 82169671, LumiSection 81; if I run CPU-only, I can see tmpNtuplet.size == 5 inside find_ntuplets, but I don't see that if I run the same event on GPU, and this got me confused.

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

diff --git a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
index 4ec7069ac8e..a33ab98ca09 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
+++ b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
@@ -290,7 +290,19 @@ public:

     auto doubletId = this - cells;
     tmpNtuplet.push_back_unsafe(doubletId);
-    assert(tmpNtuplet.size() <= 4);
+    assert(tmpNtuplet.size() <= 5);
+    if (tmpNtuplet.size()>4) {
+#ifdef __CUDACC__
+      printf("GPU ");
+#else
+      printf("CPU ");
+#endif
+      for (auto c : tmpNtuplet) printf("%d,",cells[c].theLayerPairId_);
+      printf("   r/z: ");
+      for (auto c : tmpNtuplet) printf("%f/%f,", cells[c].theInnerR,cells[c].theInnerZ);
+      auto c = tmpNtuplet[tmpNtuplet.size()-1];  printf("%f/%f,",cells[c].outer_r(hh),cells[c].outer_z(hh));
+      printf("\n");
+    }

     bool last = true;
     for (unsigned int otherCell : outerNeighbors()) {
@@ -331,7 +343,7 @@ public:
       }
     }
     tmpNtuplet.pop_back();
-    assert(tmpNtuplet.size() < 4);
+    assert(tmpNtuplet.size() < 5);
   }

   // Cell status management

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

I saw the printout twice so I added the ifdef part

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

btw the method .back() of VecArray is badly broken (does not compile because cannot compile)

@missirol
Copy link
Contributor

Thanks a lot for the info.

@missirol
Copy link
Contributor

(This issue is solved; the rest below is just me trying to learn things.)

With Vincenzo's diff, I get what he wrote: same sextuplet on CPU and GPU.

In my previous attempts, I had additional printouts in GPUCACell::find_ntuplets, and in that case I couldn't see the sextuplet on GPU. I think this is somewhat reproducible: I ran 30 times with this diff [*] and I could see the sextuplet on GPU in the printouts only 2 times (on CPU, I saw it 10 times out of 10).

At least now I see what I was doing differently.

[*] (yes, most of these printouts are pointless)

diff --git a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
index 4ec7069ac8e..bfefdf7ccd6 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
+++ b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
@@ -290,15 +290,37 @@ public:
 
     auto doubletId = this - cells;
     tmpNtuplet.push_back_unsafe(doubletId);
-    assert(tmpNtuplet.size() <= 4);
+    assert(tmpNtuplet.size() <= 5);
+    if (tmpNtuplet.size()>4) {
+#ifdef __CUDACC__
+      printf("GPU ");
+#else
+      printf("CPU ");
+#endif
+      for (auto c : tmpNtuplet) printf("%d,",cells[c].theLayerPairId_);
+      printf("   r/z: ");
+      for (auto c : tmpNtuplet) printf("%f/%f,", cells[c].theInnerR,cells[c].theInnerZ);
+      auto c = tmpNtuplet[tmpNtuplet.size()-1];  printf("%f/%f,",cells[c].outer_r(hh),cells[c].outer_z(hh));
+      printf("\n");
+    }
 
     bool last = true;
     for (unsigned int otherCell : outerNeighbors()) {
       if (cells[otherCell].isKilled())
         continue;  // killed by earlyFishbone
       last = false;
+#ifdef __CUDACC__
+      printf("GPU1 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+      printf("CPU1 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
       cells[otherCell].find_ntuplets<DEPTH - 1>(
           hh, cells, cellTracks, foundNtuplets, apc, quality, tmpNtuplet, minHitsPerNtuplet, startAt0);
+#ifdef __CUDACC__
+      printf("GPU2 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+      printf("CPU2 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
     }
     if (last) {  // if long enough save...
       if ((unsigned int)(tmpNtuplet.size()) >= minHitsPerNtuplet - 1) {
@@ -331,7 +353,12 @@ public:
       }
     }
     tmpNtuplet.pop_back();
-    assert(tmpNtuplet.size() < 4);
+    assert(tmpNtuplet.size() < 5);
+#ifdef __CUDACC__
+    printf("GPU3 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+    printf("CPU3 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
   }
 
   // Cell status management

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022 via email

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022

@missirol
I'm sorry. running on patatrack02 even changing GPU I observed always (6 out of 6) both GPU and CPU printout.

  1. on which machine are you running?
  2. how exactly are you switching between cpu and gpu?
    (in my case I'm just running cmsRun hlt_for_debug.py from a copy of issue38453 directory using CMSSW_12_4_10_patch2)

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022

btw: printf from GPU is not guaranteed to appear if there are too many.

@missirol
Copy link
Contributor

missirol commented Oct 21, 2022

running on patatrack02 even changing GPU I observed always (6 out of 6) both GPU and CPU printout.

Sorry for the trouble, then. I tested on gpu-c2a02-39-03; to run CPU-only, I add process.options.accelerators = ['cpu'] to the config [*]. I've been using 12_4_10 with the diff in #38453 (comment) (will re-try with 12_4_10_patch2, but that likely makes no difference).

btw: printf from GPU is not guaranteed to appear if there are too many.

Thanks, didn't know, it might explain what I (didn't) see.

[*]

https_proxy=http://cmsproxy.cms:3128 \
hltGetConfiguration run:360224 \
 --data \
 --no-prescale \
 --no-output \
 --globaltag 124X_dataRun3_HLT_v4 \
 --paths AlCa_PFJet40_v* \
 --max-events -1 \
 --input file:run360224_ls0081_file1.root \
 > hlt.py

cat <<@EOF >> hlt.py
process.MessageLogger.cerr.FwkReport.limit = 1000
process.options.numberOfThreads = 1
#process.options.accelerators = ['cpu']
@EOF

cmsRun hlt.py &> hlt.log

@missirol
Copy link
Contributor

btw: printf from GPU is not guaranteed to appear if there are too many.

I think this is indeed the explaination [*]. Case closed, and sorry again for the noise.

[*] I checked this by keeping the large number of printouts, but also adding

#ifdef __CUDACC__
    if (tmpNtuplet.size() > 4) {
      __trap();
    }
#endif

and the program crashed 10/10 times on GPU (running only on the event in question), meaning each time there was a sextuplet on GPU.

@perrotta
Copy link
Contributor

perrotta commented Nov 2, 2022

@swagata87 @missirol can this issue be considered concluded, and therefore closed?

@missirol
Copy link
Contributor

missirol commented Nov 2, 2022

In my understanding, yes (I signed it). Swagata can confirm and close.

@swagata87
Copy link
Contributor Author

yes, I am closing this issue. Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants